SEGMENTS

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[ Top | Next ]

Segments aligns and displays the segments of similarity found by WordSearch.

DESCRIPTION

[ Previous | Top | Next ]

WordSearch uses word comparison, which is very fast, to identify regions of possible similarity between a query sequence and some set of sequences. Segments uses optimal alignment, which is slow but precise, to display the best segment of similarity in the regions identified by WordSearch. WordSearch uses a method similar to the method of Wilbur and Lipman (Proc. Natl. Acad. Sci.(USA) 80; 726-730 (1983)) to find the regions of possible similarity. Segments uses the alignment procedure of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to search for the segments.

Segments use a scoring matrix, a gap creation penalty, and a gap extension penalty to find the best region of similarity between two sequences. The best region has the highest quality, where quality is the sum of the matches minus the sum of the mismatches minus the sum of the gap creation and extension penalties for the gaps added. The best region must fall within some "width" around the peak diagonal.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Segments to align the regions of similarity between a human globin coding sequence and sequences in the GenBank nucleotide sequence database found in the example session for WordSearch:

% segments

 (BestFit) SEGMENTS from what WORDSEARCH file ?  ggammacod.word

 What should I call the output file (* ggammacod.pairs *) ?

 Aligning ......................-...

 GB_PR2:HUMHBGG    545 bp  Gaps:  0  Quality:   4440 / Length: 444

 Aligning ......................-...

 GB_PAT:I42109    443 bp  Gaps:  1  Quality:   4377 / Length: 444

 Aligning .....................-..

 GB_PR1:HSGGGPHG    521 bp  Gaps:  0  Quality:   3814 / Length: 383

 //////////////////////////////////////////////////////////////////

OUTPUT

[ Previous | Top | Next ]

Here is part of the output file:

 (BestFit) SEGMENTS from: ggammacod.word  October 19, 1998 15:06

 (Masked) (Nucleotide) WORDSEARCH of: GenDocData:ggammacod.seq  check: 2906

 from: 1  to: 444

 ASSEMBLE    July 27, 1994 11:40

Symbols:     1 to: 92    from: gamma.seq  ck: 6474,  2179 to: 2270

Symbols:    93 to: 315   from: gamma.seq  ck: 6474,  2393 to: 2615

Symbols:   316 to: 444   from: gamma.seq  ck: 6474,  3502 to: 3630

Human fetal beta globins G and A gamma . . .

 AvMatch: 3.84  AvMisMatch: -6.00  GapWeight: 50  LengthWeight: 3   ..

        Match display thresholds for the alignment(s):

                    | = IDENTITY

                    : =   3

                    . =   1

ggammacod.seq             check: 2906  from: 1      to: 444

GB_PR2:HUMHBGG            check: 7917  from: 17     to: 545

     M15386 Human hemoglobin gamma-G (HBG2) mRNA, partial cds. 3/97

 Gaps: 0  Quality: 4440  Ratio: 10.000  Score: 442  Width: 3  Limits: +/-4

                  .         .         .         .         .

       1 ATGGGTCATTTCACAGAGGAGGACAAGGCTACTATCACAAGCCTGTGGGG 50

         ||||||||||||||||||||||||||||||||||||||||||||||||||

      18 ATGGGTCATTTCACAGAGGAGGACAAGGCTACTATCACAAGCCTGTGGGG 67

                  .         .         .         .         .

      51 CAAGGTGAATGTGGAAGATGCTGGAGGAGAAACCCTGGGAAGGCTCCTGG 100

         ||||||||||||||||||||||||||||||||||||||||||||||||||

      68 CAAGGTGAATGTGGAAGATGCTGGAGGAGAAACCCTGGGAAGGCTCCTGG 117

/////////////////////////////////////////////////////////////////////////

ggammacod.seq             check: 2906  from: 1      to: 444

GB_PR1:MMGGLINE           check: 889   from: 2318   to: 11286

     X53419 M.mulatta gamma-globin-1(G), gamma-globin-2(A) genes and ...

 Gaps: 0  Quality: 2174  Ratio: 9.577  Score: 209  Width: 3  Limits: +/-4

                  .         .         .         .         .

      91 AGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGG 140

         ||||||||||||||||||||||||||||||||||||||||||||||||||

    2409 AGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGG 2458

                  .         .         .         .         .

     141 CAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCCAAAGTCAAGGCAC 190

         ||||||||||||||||||||||||||||||||||||||| ||||||||||

    2459 CAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCCAAGGTCAAGGCAC 2508

/////////////////////////////////////////////////////////////////////////

INPUT FILES

[ Previous | Top | Next ]

Segments accept the output file of WordSearch as input. If any of the search set sequences listed in this file have been changed or deleted, Segments acts as if they do not exist. If the WordSearch query sequence listed in this file no longer exists, Segments complains and stops. Segments also reads the beginning and ending positions of the query sequence in the output file from WordSearch. If Segments cannot read this range, the entry query sequence is used.

RELATED PROGRAMS

[ Previous | Top | Next ]

Segments is an automated version of the BestFit program run with -LIMit, with the limits set to plus and minus width+1. The output file of WordSearch is the input file for Segments. Compare/DotPlot and BestFit are more flexible tools for examining the relationship between two sequences when automation is not desired.

SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST and FastA, it can be very slow.

BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST. TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?" TFastX does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA, and like TFastA, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

RESTRICTIONS

[ Previous | Top | Next ]

The diagonal of comparison cannot be longer than 30,000 and the surface of comparison may not be larger than one million. The surface of comparison can be estimated by multiplying the average length of the two sequences being compared by the sum of the two gap shift limits. (See the ALGORITHM topic below for more information about gap shift limits.) Segments truncates sequences that exceed 30,000 symbols and squeezes the gap shift limits to keep the surface within the one-million limit.

ALGORITHM

[ Previous | Top | Next ]

Segments reads the query sequence and the set of sequences and diagonals in the output list from WordSearch and then executes a limited BestFit on each pair of sequences to make an alignment near that diagonal. For a detailed description, see BestFit ( -LIMit), and imagine that the gap shift limits are both set to width + 1. Width is defined as the width of a structure in the histogram from a word comparison (see the WordSearch program). Width is the fifth column of data in the WordSearch output file.

CONSIDERATIONS

[ Previous | Top | Next ]

There is strong reason to believe that the BestFit algorithm used by Segments is the best way to search for segments of similarity (Lipman and Pearson, "Rapid and Sensitive Protein Similarity Searches," Science 227; 1435-1441 (1985)), but the best parameters to use for Segments are not clear. Like any alignment program, Segments produces alignments that are very different depending on the values assigned for match, mismatch, gap creation penalty, and gap extension penalty. Segments chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with -MATRix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) Similarly, if you have done a simplified word search and adjust the match and mismatch comparison values with -MATch and -MISmatch, the program will adjust the default gap penalties accordingly. You can use -GAPweight and -LENgthweight to specify alternative gap penalties if you don't want to accept the default values.

The Public Scoring Matrix is Quite Stringent

The public scoring matrix file segdna.cmp scores matches as +10 and mismatches as -6, which means that the segment shown is cut off if there is any significant region where mismatches outnumber matches by about a 2:1 ratio. If the words scored by WordSearch were dispersed along the diagonal, then some of them may not appear in the alignment for that diagonal.

The Alignments Miss Some Words

Segments often fails to display every word scored for the peak diagonal if the words were not tightly grouped along the diagonal. You can use -WHOle to get Needleman-Wunsch alignments that traverse the entire length of the diagonal. If you run Compare with -WORd and plot the output with DotPlot, you see the exact pattern of word identities between two sequences.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax:  % segments [-INfile=]ggammacod.word -Default

Prompted Parameters:

[-OUTfile=]ggammacod.pairs  names the output file

Local Data Files:

-MATRix=segdna.cmp    assigns the scoring matrix for nucleic acids

-MATRix=blosum62.cmp  assigns the scoring matrix for proteins

Optional Parameters:

-GAPweight=50         sets the gap creation penalty

-LENgthweight=3       sets the gap extension penalty

-WHOle                aligns the whole diagonal, not just the best segment

-MATch=10             sets symbol match value for simplified word searches

-MISmatch=-5          sets symbol mismatch value for simplified word searches

-PAIr=x,5,1             thresholds for displaying '|', ':', and '.'

-WIDth=50               the number of sequence symbols per line

-PAGe=60                adds a line with a form feed every 60 lines

-NOBIGGaps              suppresses abbreviation of large gaps with '.'s

-NOMONitor            suppresses the screen monitor

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.

Segments reads comparison values from the scoring matrix file segdna.cmp (nucleic acids) or blosum62.cmp (peptides). If the WordSearch sequences were simplified, Segments would use the same simplification table used by WordSearch to construct a scoring matrix.

Segments run with -WHOle uses the scoring matrix files seggapdna.cmp for nucleotide sequence comparison instead of segdna.cmp. The scoring matrix for protein sequence comparisons, blosum62.cmp, is unchanged.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-MATRix=mymatrix.cmp

Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-GAPweight=50

Lets you designate a gap creation penalty if you don't want to use the default penalty. (See the ALGORITHM topic in BestFit for a description of gap creation penalties.)

-LENgthweight=3

Lets you select a gap extension penalty if you don't want to use the default penalty. (See the ALGORITHM topic in BestFit for a description of gap extension penalties.)

-WHOle

Causes this program to make alignments using the method of Needleman and Wunsch instead of the default method of Smith and Waterman. The difference between these two methods is the same as the difference between the programs Gap and BestFit. The Needleman and Wunsch method displays the whole length of both sequences after alignment, while the Smith and Waterman method shows only the best segment of similarity from each sequence.

-WHOle causes Segments to read the local data file seggapdna.cmp for nucleotide sequence comparisons.

-MATch=10

If you have done a simplified word search, Segments must make up a scoring matrix that looks like your simplification scheme. The matrix normally assigns 10 for all the symbol comparisons you treated as equivalent and -20/Alphabet size for all other symbol comparisons. -MATch and -MISmatch allow you to set values other than 10 for matches and -20/Alphabet size for mismatch.

-MISmatch=-5

See -MATch for a description of -MISmatch.

-PAIr=4,2,1

The paired output file from this program displays sequence similarity by printing one of three characters between similar sequence symbols: a pipe character(|), a colon (:), or a period (.). Normally a pipe character is put between symbols that are the same, a colon is put between symbols whose comparison value is greater than or equal to the average positive non-identical comparison value in the scoring matrix, and a period is put between symbols whose comparison value is greater than or equal to 1. You can change these match display thresholds from the command line. The three values associated with -PAIr are the display thresholds for the pipe character, colon, and period. The match display criterion for a pipe character changes from symbolic identity (the default) to the quantitative threshold you have set in the first parameter. A pipe character will no longer be inserted between identical symbols unless their comparison values are greater than or equal to this threshold. If you still want a pipe character to connect identical symbols, use x instead of a number as the first value. (See Appendix VII for more information about scoring matrices.)

-WIDth=50

Puts 50 sequence symbols on each line of the output file. You can set the width to anything from 10 to 150 symbols.

-PAGe=60

Printed output from this program may cross from one page to another in an annoying way. Use this parameter to add form feeds to the output file in order to try to keep clusters of related information together. You can set the number of lines per page by supplying a number after -PAGe.

-NOBIGGaps

Suppresses large gap abbreviations, showing all the sequence characters across from large gaps. Usually, gaps that extend one sequence by more than one complete line of output are abbreviated with three dots arranged in a vertical line.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

Printed: May 27, 2005 14:24

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.