COMPARE

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[ Top | Next ]

Compare compares two protein or nucleic acid sequences and creates a file of the points of similarity between them for plotting with DotPlot. Compare finds the points using either a window/stringency or a word match criterion. The word comparison is 1,000 times faster than the window/stringency comparison, but somewhat less sensitive.

DESCRIPTION

[ Previous | Top | Next ]

Compare is the first program of a two-program set that produces dot-plots. Compare compares two sequences and writes a file of the points where matches of a certain quality are found. The points in the output file can be plotted with the DotPlot program. Dot-plotting is the best method in the Accelrys GCG (GCG) for comparing two sequences when you suspect that there could be more than one segment of similarity between the two.

Compare makes a file with the coordinates of each point where two sequences are similar. The sequences are compared in every possible register and a point is added to the file wherever some match criterion for similarity is met. The match criterion can be met in two different ways:

The standard way compares two sequences in every register, searching for all the places where a given number of matches (stringency) occur within a given range (window). See Maizel and Lenk (1981) "Enhanced Graphic Matrix Analysis of Nucleic Acid and Protein Sequences" Proc. Natl. Acad. Sci. USA 78; 7665-7669 for a description of the matrix analysis of biological sequences.

The other way to find points of similarity is to search for short perfect matches of some set length. Short perfect matches are referred to as words. The word comparison between two sequences is about 1,000 times faster than the window/stringency match described above, but it requires that the sequences contain short perfect matches for any similarity to be found. Word comparison is discussed in detail by Wilbur and Lipman (1983) "Rapid Similarity Searches of Nucleic Acid and Protein Data Banks" Proc. Natl. Acad. Sci. USA 80; 726-730. The authors refer to a word as a k-tuple. Compare does a word comparison if it is run with -WORdsize.

You may limit the number of points that Compare finds with -LIMit.

EXAMPLE

[ Previous | Top | Next ]

The example session below was used to find the places where two haptoglobin sequences are similar. The output from this session is plotted in the Program Manual entry for DotPlot.

% compare

 COMPARE what horizontal sequence ? hpr.seq

                  Begin (* 1 *) ?

                End (*  2966 *) ?

               Reverse (* No *) ?

 To what vertical sequence (* hpr.seq *) ?  hpf.seq

                  Begin (* 1 *) ?

                End (*  2740 *) ?

               Reverse (* No *) ?

 What comparison window size (* 21 *) ?

 What stringency (* 14 *) ?

 What should I call the output file (* hpr.pnt *) ?

 Number of points: 4986

 Writing ..........

OUTPUT

[ Previous | Top | Next ]

The output file from this session can be read by the DotPlot program to produce a dot-plot. The plots generated by DotPlot from this session and from another session with -WORdsize=8 are shown in the figures in the Program Manual below. The example session with DotPlot uses the file from this session with Compare. Here is part of the output file:

 COMPARE of: hpr.seq  check: 8102  from: 1  to: 2966

Haptoglobin related sequence

HindIII fragment sequenced 12/27/83

  (partially from hpf sequence)

 *** To: hpf.seq  check: 2624  from: 1  to: 2740

Haptoglobin alpha2

HindIII fragment , region equivalent to hp1f

 Window: 21  Stringency: 14  Points: 4986  September 27, 1998 12:15  ..

    131   2639    187   2624    276   2670    277   2671    278   2672

     94   2454     95   2455     96   2456    128   2389    132   2389

     32   2281    146   2389    164   2389     47   2098    656   2662

    //////////////////////////////////////////////////////////////////

   2861    123   2864    126   2865    127   2866    128   2867    129

   2911     56      0      0      0      0      0      0      0      0

      0      0      0      0      0      0      0      0      0      0

       hpr.seq   8102      1   2966  F

       hpf.seq   2624      1   2740  F

            21     14      0  COMPARE

INPUT FILES

[ Previous | Top | Next ]

Compare accepts two individual nucleotide sequences or protein sequences as input. The function of Compare depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

DotPlot makes a dot-plot with the output file from Compare or StemLoop. StemLoop finds stems (inverted repeats) within a sequence. You specify the minimum stem length, minimum and maximum loop sizes, and the minimum number of bonds per stem. All stems or only the best stems can be displayed on your screen or written into a file. BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. Repeat finds direct repeats in sequences. You must set the size, stringency, and range within which the repeat must occur; all the repeats of that size or greater are displayed as short alignments.

RESTRICTIONS

[ Previous | Top | Next ]

No more than 200,000 points may be produced in the plot file. The point files can be quite large and should be deleted as soon as they have been examined. Window must be between 1 and 100, or if word comparison is done, the word size must be between 1 and 25.

ALGORITHM

[ Previous | Top | Next ]

Compare makes a file of every point where two sequences are similar according to a set match criterion. The points are the Cartesian coordinates of each point of similarity in units of the original sequence coordinates. If the window is greater than 1, the point recorded by Compare is in the middle of the window.

Window/Stringency Comparisons

For window/stringency comparisons, Compare reads a scoring matrix (see Section 4, Using Data Files in the User's Guide) that defines a match value for every possible GCG symbol comparison. Compare then slides the vertical sequence along the horizontal in order to generate every possible register of comparison. For each register, Compare slides a window along the pair of sequences. The match values for each pair of symbols within the window are summed to determine a score for the window at each window position. When the score is greater than or equal to the stringency, then the match criterion has been met and a point is added to the file at the position of the middle of the window on both axes. When the window has no integral center (windows of even length), then Compare rounds the coordinates up. If you have used -ALL, then points are added to the file at all of the positions within the window that have match values greater than or equal to the average positive non-identical comparison value in the scoring matrix (see -ALL in the PARAMETER REFERENCE topic).

Word Comparisons

For word comparisons, you set a word length. Compare then slides the vertical sequence along the horizontal in order to generate every possible register of comparison. For each register, Compare slides a window whose size is equal to the word length along the pair of sequences. If all of the symbols in the two sequences within the window are identical, Compare puts a point in the file at the middle of the word's position in the two sequences. If the word has no integral center (words of even length), then Compare rounds the coordinates up.

Alphabet

The parameter alphabet that appears in the output is the number of symbols in the alphabet that could make up each word. The alphabet contains four symbols for nucleic acids and up to 31 for peptide sequences.

CONSIDERATIONS

[ Previous | Top | Next ]

Dot-plotting helps recognize large regions of similarity. It is not really sensitive enough, in most uses, to see small structures like promoters. In general, you should not try to look for structures that are smaller than the stringency. The window/stringency comparison is usually more sensitive than the word comparison for regions that are only weakly related.

For window/stringency comparisons, Compare chooses a default stringency that is appropriate for the scoring matrix that it reads. If you select a different scoring matrix with -MATRix, the program will adjust the default stringency accordingly.

SUGGESTIONS

[ Previous | Top | Next ]

Try a Word Comparison First

Word comparisons are very fast, so run Compare with -WORdsize first. Usually, this pilot run gives you a rough idea what the dot-plot for the more sensitive window/stringency comparison is going to look like. See the two plots in the Program Manual entry for DotPlot for examples of each type of comparison.

Setting Window and Stringency

A window 21-symbols wide with a stringency of 14 is a good place to start when comparing nucleic acid sequences that have very few ambiguity codes in them. The number of points you get should be of the same magnitude as the number of symbols in your sequences. We have had good results with a window of 30 and a stringency of 11 for peptide sequence comparison. You can use -LIMit to stop the program before the number of points gets unreasonable.

Batch Queue

Unless you are using the -WORdsize parameter, Compare is one of the few programs in GCG that can take more than a few minutes to run. Therefore, large comparisons should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide. Very large comparisons may exceed the CPU limit set by some systems. In practice you should probably limit the range of the sequences compared to about 10,000 for each batch job.

Setting Word Size

You might try a word size of 6 for nucleic acid sequences of 1,000 bases and perhaps 8 for 10,000 bases. You can start with a word size of 2 or 3 for peptide-sequence comparisons.

<CTRL>C

[ Previous | Top | Next ]

If you need to stop this program, use <Ctrl>C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use <Ctrl>C.

Compare writes out all of the points that were found before the comparison was interrupted.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % compare [-INfile1=]hpr.seq [-INfile2=]hpf.seq -Default

Prompted Parameters:

-BEGin1=1  -BEGin2=1   sets the beginning of each sequence (1 is horizontal)

-END1=2966 -END2=2740  sets the end of each sequence (2 is vertical)

-REVerse1  -REVerse2   specifies the strand of each sequence

-WINdow=21             sets the comparison window

-STRIngency=14         sets stringency to find a match in comparison window

[-OUTfile=]hpr.pnt     names the output file

Local Data files:

-MATRix=compardna.cmp  assigns the scoring matrix for nucleic acids

-MATRix=blosum62.cmp   assigns the scoring matrix for proteins

Optional Parameters:

-WORdsize=6      makes a rapid word comparison for perfect 6-mer matches

  -NOSORtpoints    doesn't sort points by diagonal (word comparison only)

  -MINPOints=45    sorts points on diagonals that have at least 45 points

                     (word comparison only)

  -NORANdom        doesn't show points on diagonals with only a

                     few points (word comparison only)

-LIMit=3000      limits the number of points found to 3,000

-ALL[=1]         shows all of the points under the window whose symbol

                   comparison values are greater than or equal to 1

-BATch           submits the program to the batch queue

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.

For window/stringency comparisons, Compare uses the scoring matrix found in either compardna.cmp or blosum62.cmp to find the match values for any position of the window. You should recognize that stringency is really the sum of the match values (defined in this file) for the symbols compared under the window. The public version of compardna.cmp (for nucleic acid comparisons) scores a 1.0 for all IUB nucleic acid ambiguity symbol comparisons where there is any overlap between the sets defined by the symbols (see Appendix III). No symbols match the symbols X or N however. The public version of blosum62.cmp is based on substitutions between amino acid pairs in ungapped blocks of aligned protein segments as measured by Henikoff and Henikoff.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-REVerse1 and -REVerse2

Sets the program to use the reverse strand for the two input sequences.

-WINdow=21

Sets the size of the window within which the comparison score is calculated (when doing a window/stringency comparison).

-STRIngency=14

Sets the minimum comparison score that defines a match (when doing a window/stringency comparison). The comparison score is the sum of the individual match values for each pair of symbols within the window.

-MATRix=mymatrix.cmp

Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-WORdsize=6

Indicates that you wish to use Compare to do a word comparison, rather than the default window/stringency comparison. Compare prompts you for the word size if you do not specify the length of the word next with the -WORdsize parameter.

The next three parameters only affect word comparisons.

-NOSORtpoints

Compare normally sorts the points from a word comparison so that points on diagonals with a lot of other points appear together in the output file. This sort greatly speeds up plotting since many adjacent points on a diagonal can be represented with a single line. -NOSORtpoints suppresses this sorting, which adds computing to the Compare program; the points appear in the file (and on the plot) in the order in which they were found by the algorithm.

-MINPOints=45

The only points that are sorted in word comparisons are from diagonals that have some minimum number of points. Compare normally sets this number to four times what you would expect on the longest diagonal in the surface of comparison by chance. You can reset the minimum with this parameter.

-NORANdom

Suppresses the display of all of the points that are not on diagonals that have some minimum number of points. Compare normally sets this minimum number to four times what you would expect on the longest diagonal in the surface of comparison by chance. You can reset the minimum with -MINPOints.

-LIMit=3000

In many applications it is impractical to generate more than a certain number of points. You may limit the maximum number of points found with this parameter. Compare automatically stops and writes only those points already determined when the limit was reached.

-ALL=1

For detailed comparisons, you may want to see every position where your sequences are similar. Usually, Compare puts only one point in the file at the middle position of the window whenever the window/stringency match criterion is met (see the ALGORITHM topic above). Use the -ALL parameter to see all the positions under the window that are similar. Compare then displays all of the points under the window that have scoring matrix values greater than or equal to the average positive non-identical comparison value in the matrix. Use an optional value with -ALL to change the threshold above which a point is shown to a different number.

-BATch

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

Printed: May 27, 2005 11:56

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.