BESTFIT

BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman.

DESCRIPTION

[ Previous | Top | Next ]

BestFit inserts gaps to obtain the optimal alignment of the best region of similarity between two sequences, and then displays the alignment in a format similar to the output from Gap. The sequences can be of very different lengths and have only a small segment of similarity between them. You could take a short RNA sequence, for example, and run it against a whole mitochondrial genome.

SEARCHING FOR SIMILARITY

[ Previous | Top | Next ]

BestFit is the most powerful method in the Accelrys GCG (GCG) for identifying the best region of similarity between two sequences whose relationship is unknown.

EXAMPLE

[ Previous | Top | Next ]

The sequence gamma.seq contains an Alu family sequence somewhere in the first 500 bases. alu.seq contains a generic human Alu family repeat. The two sequences are aligned and the best segment of similarity is found with BestFit.

% bestfit

 BESTFIT of what sequence 1 ?  gamma.seq

                  Begin (* 1 *) ?

                End (* 11375 *) ?  500

               Reverse (* No *) ?

 to what sequence 2 (* gamma.seq *) ?  alu.seq

                  Begin (* 1 *) ?

                End (*   207 *) ?

               Reverse (* No *) ?

 What is the gap creation penalty (* 50 *) ?

 What is the gap extension penalty (* 3 *) ?

 What should I call the paired output display file (* gamma.pair *)

 Aligning ..........-.

 Aligning ..........-..

          Gaps:     3

       Quality:  1293

 Quality Ratio: 6.246

  % Similarity: 84.466

        Length:   209

OUTPUT

[ Previous | Top | Next ]

Here is the output file. Notice how BestFit finds and displays only the best segments of similarity:

 BESTFIT of: gamma.seq  check: 6474  from: 1  to: 500

Human fetal beta globins G and A gamma

from Shen, Slightom and Smithies,  Cell 26; 191-203.

Analyzed by Smithies et al. Cell 26; 345-353.

 to: alu.seq  check: 4238  from: 1  to: 207

HSREP2 from the EMBL data library

Human Alu repetitive sequence located near the insulin gene

Dhruva D.R., Shenk T., Subramanian K.N.; "Integration in vivo into

Simian virus 40 DNA of a sequence that resembles a certain family of

genomic interspersed repeated sequences"; Proc. Natl. Acad. Sci. USA

77:4514-4518(1980).  . . .

 CompCheck: 2335

         Gap Weight:     50      Average Match: 10.000

      Length Weight:      3   Average Mismatch: -9.000

            Quality:   1293             Length:    209

              Ratio:  6.246               Gaps:      3

 Percent Similarity: 84.466   Percent Identity: 84.466

        Match display thresholds for the alignment(s):

                    | = IDENTITY

                    : =   5

                    . =   1

 gamma.seq x alu.seq       September 24, 1998 17:06  ..-+-+

                  .         .         .         .         .

     137 AGACCAACCTGGCCAACATGGTGAAATCCCATCTCTAC.AAAAATACAAA 185

         |||||| |||||||||||||||||||  ||||||||||  ||||||||||

       1 AGACCAGCCTGGCCAACATGGTGAAACTCCATCTCTACTGAAAATACAAA 50

                  .         .         .         .         .

     186 AATTAGACAGGCATGATGGCAAGTGCCTGTAATCCCAGCTACTTGGGAGG 235

         |||||| |||||||| ||    ||||||| |||||||||||||| |||||

      51 AATTAGCCAGGCATGGTGATGCGTGCCTGGAATCCCAGCTACTTAGGAGG 100

                  .         .         .         .         .

     236 CTGAGGAAGGAGAATTGCTTGAACCTGGAAGGCAGGAGTTGCAGTGAGCC 285

         |||||  || |||||  ||| ||||  | |||  |  |||||||||||||

     101 CTGAGACAGAAGAATCCCTTAAACCAAG.AGGTGGAGGTTGCAGTGAGCC 149

                  .         .         .         .         .

     286 GAGATCATACCACTGCACTCCAGCCTGGGTGACAGAACAAGACTCTGTCT 335

         ||||||  ||  |||||||||||||| ||||||||| | ||||||  |||

     150 GAGATCGCACGGCTGCACTCCAGCCT.GGTGACAGAGCGAGACTCCATCT 198

     336 CAAAAAAAA 344

         |||||||||

     199 CAAAAAAAA 207

INPUT FILES

[ Previous | Top | Next ]

BestFit accepts two individual nucleotide sequences or protein sequences as input. The function of BestFit depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

When you want an alignment that covers the whole length of both sequences, use Gap. When you are trying to find only the best segment of similarity between two sequences, use BestFit. PileUp creates a multiple sequence alignment of a group of related sequences, aligning the whole length of all sequences. DotPlot displays the entire surface of comparison for a comparison of two sequences. GapShow displays the pattern of differences between two aligned sequences. PlotSimilarity plots the average similarity of two or more aligned sequences at each position in the alignment. Pretty displays alignments of several sequences. CompTable helps generate scoring matrices for peptide comparison.

ALGORITHM

[ Previous | Top | Next ]

BestFit uses the local homology algorithm of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to find the best segment of similarity between two sequences. BestFit reads a scoring matrix that contains values for every possible GCG symbol match (see the LOCAL DATA FILES topic below) . The program uses these values to construct a path matrix that represents the entire surface of comparison with a score at every position for the best possible alignment to that point. The quality score for the best alignment to any point is equal to the sum of the scoring matrix values of the matches in that alignment, less the gap creation penalty times the number of gaps in that alignment, less the gap extension penalty times the total length of all gaps in that alignment. The gap creation and gap extension penalties are set by you. If the best path to any point has a negative value, a zero is put in that position.

After the path matrix is complete, the highest value on the surface of comparison represents the end of the best region of similarity between the sequences. The best path from this highest value backwards to the point where the values revert to zero is the alignment shown by BestFit. This alignment is the best segment of similarity between the two sequences.

For nucleic acids, the default scoring matrix has a match value of 10 for each identical symbol comparison and -9 for each non-identical comparison (not considering nucleotide ambiguity symbols for this example). The quality score for a nucleic acid alignment can, therefore, be determined using the following equation:

       Quality = 10 x TotalMatches + -9 x TotalMismatches

                    - (GapCreationPenalty x GapNumber)

                    - (GapExtensionPenalty x TotalLengthOfGaps)

The quality score for a protein alignment is calculated in a similar manner. However, while the default nucleic acid scoring matrix has a single value for all non-identical comparisons, the default protein scoring matrix has different values for the various non-identical amino acid comparisons. The quality score for a protein alignment can therefore be determined using the following equation (where Total_(AA) is the total number of A-A (Ala-Ala) matches in the alignment, CmpVal_(AA) is the value for an A-A comparison in the scoring matrix, Total_(AB) is the total number of A-B (Ala-Asx) matches in the alignment, CmpVal_(AB) is the value for an A-B comparison in the scoring matrix, ...) :

       Quality =  CmpVal_(AA) x Total_(AA)

                + CmpVal_(AB) x Total_(AB)

                + CmpVal_(AC) x Total_(AC)

                + CmpVal_(ZZ) x Total_(ZZ)

                - (GapCreationPenalty x GapNumber)

                - (GapExtensionPenalty x TotalLengthOfGaps)

For a more complete discussion of scoring matrices, see Appendix VII.

CONSIDERATIONS

[ Previous | Top | Next ]

BestFit Always Finds Something

BestFit always finds an alignment for any two sequences you compare -- even if there is no significant similarity between them! You must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity.

The Segments Shown Obscure Alternative Segments

BestFit only shows one segment of similarity, so if there are several, all but one is obscured. You can approach this problem with graphic matrix analysis (see the Compare and DotPlot programs). Alternatively, you can run BestFit on ranges outside the ranges of similarity found in earlier runs to bring other segments out of the shadow of the best segment.

The Best Fit is Only One Member of a Family

Like all fast gapping algorithms, the alignment displayed is a member of the family of best alignments. This family may have other members of equal quality, but will not have any member with a higher quality. The family is usually significantly different for different choices of gap creation and gap extension penalties. See the CONSIDERATIONS topic in the entry for the Gap program in the Program Manual to learn more about how to assign gap creation and gap extension penalties.

Default Gap Penalties are Specific to Each Scoring Matrix

BestFit chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with -MATRix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use -GAPweight and -LENgthweight to specify alternative gap penalties if you don't want to accept the default values.

The Public Scoring Matrix for Nucleic Acid Comparisons is Very Stringent

The scoring matrix swgapdna.cmp awards matches +10 and penalizes mismatches -9 so the segments found may be very brief. This penalty means that the alignment cannot be extended by even two bases to pick one extra match. The scoring matrix used by Smith and Waterman, when local alignments were first described, used -.33 for the mismatch penalty and +1 for the match score. You can use Fetch to copy randomdna.cmp and rename it swgapdna.cmp to use a value of -3 for each mismatch and +10 for each match, or use nwsgapdna.cmp, which has no mismatch penalty at all.

Rapid Alignment

When possible, BestFit tries to find the optimal alignment very quickly. If this rapid alignment is not unambiguously optimal, BestFit automatically realigns the sequences to calculate the optimal alignment. When this occurs, the monitor of alignment progress on your terminal screen (Aligning...) is displayed twice for a single alignment.

ALIGNING LONG SEQUENCES

[ Previous | Top | Next ]

The program attempts to allocate enough computer memory to align the input sequences. In the worst case, where the two sequences being aligned are unrelated, the allocation is proportional to the product of the lengths of the two input sequences. However, in many cases where the sequences being aligned are more closely related, the computer can determine an optimal alignment using less memory. When memory on your computer is limiting and the program cannot allocate all of the memory it needs to align long sequences, it completes the alignment in whatever memory it can allocate and displays the message *** Alignment is not guaranteed to be optimal ***. Because the criteria used in the calculation for guaranteeing an optimal alignment are very stringent, the alignment often may be optimal even if this message is displayed.

If you know roughly where the alignment of interest for long sequences begins, you can run the program with -LIMit. Then set the starting coordinates for each sequence near the point where the alignment of interest begins and set gap shift limits on each sequence. The program then aligns the sequences from your starting point such that the sequences do not get out of phase by more than the gap shift limits you have set. If you started both sequences at base number one and set the gap shift limit for sequence one to 100 and for sequence two to 50, then base 350 in sequence one could not be gapped to any base outside of the range from 300 to 450 on sequence two. These limited alignments often require less computer memory than unlimited alignments.

EVALUATING ALIGNMENT SIGNIFICANCE

[ Previous | Top | Next ]

This program can help you evaluate the significance of the alignment, using a simple statistical method, with -RANdomizations. The second sequence is repeatedly shuffled, maintaining its length and composition, and then realigned to the first sequence. The average alignment score, plus or minus the standard deviation, of all randomized alignments is reported in the output file. You can compare this average quality score to the quality score of the actual alignment to help evaluate the significance of the alignment. The number of randomizations can be specified by adding an optional value to -RANdomizations; the default is 10. You can preserve the dinucleotide or dipeptide composition of the input sequence in the shuffled sequence by using -PREServe=2. Use -PREServe=3 to preserve the trinucleotide or tripeptide composition of the input sequence.

The score of each randomized alignment is reported to the screen. You can use <Ctrl>C to interrupt the randomizations and output the results from those randomized alignments that have been completed.

By ignoring the statistical properties of biological sequences, this simple Monte Carlo statistical method may give misleading results. Please see Lipman, D.J., Wilbur, W.J., Smith, T.F., and Waterman, M.S. (Nucl. Acids Res. 12; 215-226 (1984)) for a discussion of the statistical significance of nucleic acid similarities.

ALIGNMENT METRICS

[ Previous | Top | Next ]

BestFit and Gap display four figures of merit for alignments: Quality, Ratio, Identity, and Similarity.

The Quality (described above) is the metric maximized in order to align the sequences. Ratio is the quality divided by the number of bases in the shorter segment. Percent Identity is the percent of the symbols that actually match. Percent Similarity is the percent of the symbols that are similar. Symbols that are across from gaps are ignored in the calculation of Percent Identity and Percent Similarity. A similarity is scored when the scoring matrix value for a pair of symbols is greater than or equal to the average positive non-identical comparison value in the matrix, the similarity threshold. This threshold is also used by the display procedure to decide when to put a ':' (colon) between two aligned symbols. You can change this threshold by specifying optional values to -PAIr. For instance, -PAIr=10,5 would set the similarity threshold to 5.

The similarity and identity metrics are not optimized by alignment programs so they should not be used to compare alignments.

PEPTIDE SEQUENCES

[ Previous | Top | Next ]

If your input sequences are peptide sequences, this program uses a scoring matrix, blosum62.cmp, with comparison values derived from a study of substitutions between amino acid pairs in ungapped block of aligned protein segments as measured by Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)).

RESTRICTIONS

[ Previous | Top | Next ]

Input sequences may not be more than 32,000 symbols long.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % bestfit [-INfile1=]gamma.seq  [-INfile2=]alu.seq -Default

Prompted Parameters:

-BEGin1=1 -BEGin2=1      sets the beginning of each sequence

-END1=500 -END2=207      sets the end of each sequence

-NOREV1     -NOREV2      specifies the strand of each sequence

-GAPweight=50            sets the gap creation penalty (8 is protein default)

-LENgthweight=3          sets the gap extension penalty (2 is protein default)

[-OUTfile1=]gamma.pair   names the alignment output file

Local Data Files:

-MATRix=swgapdna.cmp     assigns the scoring matrix for nucleic acids

-MATRix=blosum62.cmp     assigns the scoring matrix for proteins

Optional Parameters:

-OUTfile2=gamma.gap     names seq. output file for sequence 1 with gaps added

-OUTfile3=alu.gap       names seq. output file for sequence 2 with gaps added

-PENAlizedlength=12     penalizes gaps longer than 12 sequence characters

                            the same as gaps of length 12

-LIMit1=499 -LIMit2=206 limits the surface of comparison

-RANdomizations[=10]    determines average score from 10 randomized

                            alignments

  -PREServe=2             preserves dinucleotide or dipeptide composition

                            in randomized sequence

           =3             preserves trinucleotide or tripeptide composition

                            in randomized sequence

-PAIr=x,5,1             thresholds for displaying '|', ':', and '.'

-WIDth=50               the number of sequence symbols per line

-PAGe=60                adds a line with a form feed every 60 lines

-NOBIGGaps              suppresses abbreviation of large gaps with '.'s

-HIGhroad               makes the top alignment for your parameters

-LOWroad                makes the bottom alignment for your parameters

-NOSUMmary              suppresses the screen summary

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.

If the first sequence you name is a nucleic acid, BestFit uses the scoring matrix in the public file swgapdna.cmp. (SW stands for Smith and Waterman.) If the first sequence you name is a peptide sequence, BestFit reads blosum62.cmp instead.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-REVerse1 and -REVerse2

Sets the program to use the reverse strand for the two input sequences.

-GAPweight=8

Sets the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

-LENgthweight=2

Sets the gap extension penalty that is subtracted from the alignment score for each gapped symbol.

-MATRix=mymatrix.cmp

Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-OUTfile2=seqname1.gap -OUTfile3=seqname2.gap

This program can write three different output files. The first displays the alignment of sequence one with sequence two. The second is a new sequence file for sequence one, possibly expanded by gaps to make it align with sequence two. The third, like the second, is a new sequence file for sequence two, possibly expanded by gaps to make it align with sequence one. The program writes only the first file unless there are output file options on the command line. If there are any output files named on the command line, only those output files are written. If you add -OUT to the command line without an accompanying file name, then the program will write the second and third output files after prompting you for their names.

Aligned sequences (in sequence files) can be displayed with GapShow. Their similarity can be displayed with PlotSimilarity.

-PENAlizedlength=12

lets you set the maximum penalty for any gap in the alignment. For instance, if you specify -PENAlizedlength=12, then any gap longer than 12 characters is penalized the same as a gap of length 12. Using this parameter, alignments can contain large gaps without incurring large gap extension penalties. This may be useful, for instance, if you are aligning a cDNA sequence with the corresponding genomic DNA sequence containing large introns.

-LIMit1=20 and -LIMit2=20

let you set gap shift limits for each sequence. When you already know of a long similarity between two sequences you can "zip" them together using this mode. The beginning coordinates for each sequence must be near the beginning of the alignment you want to see. The alignment continues so that gaps inserted do not require the sequences to get out of step by more than the gap shift limits. You can align very long sequences rapidly. When you set gap shift limits for one or both input sequences, the maximum surface of comparison available to your alignment is 3.5 million. The size of the surface of comparison that your alignment actually requires can be predicted by multiplying the average length of the two sequences by the sum of the two shift limits.

If you add just -LIMit to the command line without specifying any value, the program prompts you to enter gap shift limits for each sequence.

-RANdomizations=10

reports the average alignment score and standard deviation from 10 randomized alignments in which the second sequence is repeatedly shuffled, maintaining the length and composition of the original sequence, and then aligned to the first sequence. You can use the optional parameter to set the number of randomized alignment to some number other than 10.

-PREServe=2

preserves the dinucleotide or dipeptide composition of the original sequence in the shuffled sequence. Use -PREServe=3 to preserve the trinucleotide or tripeptide composition.

-PAIr=4,2,1

The paired output file from this program displays sequence similarity by printing one of three characters between similar sequence symbols: a pipe character(|), a colon (:), or a period (.). Normally a pipe character is put between symbols that are the same, a colon is put between symbols whose comparison value is greater than or equal to the average positive non-identical comparison value in the scoring matrix, and a period is put between symbols whose comparison value is greater than or equal to 1. You can change these match display thresholds from the command line. The three values associated with -PAIr are the display thresholds for the pipe character, colon, and period. The match display criterion for a pipe character changes from symbolic identity (the default) to the quantitative threshold you have set in the first parameter. A pipe character will no longer be inserted between identical symbols unless their comparison values are greater than or equal to this threshold. If you still want a pipe character to connect identical symbols, use x instead of a number as the first value. (See Appendix VII for more information about scoring matrices.)

-WIDth=50

puts 50 sequence symbols on each line of the output file. You can set the width to anything from 10 to 150 symbols.

-PAGe=60

Printed output from this program may cross from one page to another in an annoying way. Use this parameter to add form feeds to the output file in order to try to keep clusters of related information together. You can set the number of lines per page by supplying a number after -PAGe.

-NOBIGGaps

suppresses large gap abbreviations, showing all the sequence characters across from large gaps. Usually, gaps that extend one sequence by more than one complete line of output are abbreviated with three dots arranged in a vertical line.

-LOWroad and -HIGhroad

The insertion of gaps is arbitrary in many cases, and equally optimal alignments can be generated by inserting gaps differently. When equally optimal alignments are possible, this program can insert the gaps differently if you select either the -LOWroad or the -HIGhroad parameter. Here are examples for the alignment of GACCAT with GACAT with different parameters.

For:       Match = 10         MisMatch = -9

      Gap weight = 10    Length Weight =  0

LowRoad:   1 GACCAT 6

             || |||      Quality = 40

           1 GA.CAT 5

HighRoad:  1 GACCAT 6

             ||| ||      Quality = 40

           1 GAC.AT 5

For:       Match = 10         MisMatch = 0

      Gap weight = 30    Length Weight = 0

HighRoad:  1 GACCAT 6

             |||           Quality = 30

           1 GACAT. 5

LowRoad:   1 GACCAT 6

                |||        Quality = 30

           1 .GACAT 5

Essentially the low road shifts all of the arbitrary gaps in sequence two to the left and all of the arbitrary gaps in sequence one to the right. The high road does exactly the opposite. When neither high road nor low road is selected, the program tries not to insert a gap whenever that is possible and uses the high road alternative for all collisions.

-SUMmary

writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: May 27, 2005 11:44

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.