Table of Contents
Compare compares two protein or nucleic acid sequences and creates a file of the points of similarity between them for plotting with DotPlot. Compare finds the points using either a window/stringency or a word match criterion. The word comparison is 1,000 times faster than the window/stringency comparison, but somewhat less sensitive.
Compare is the first program of a two-program set that produces dot-plots. Compare compares two sequences and writes a file of the points where matches of a certain quality are found. The points in the output file can be plotted with the DotPlot program. Dot-plotting is the best method in the Accelrys GCG (GCG) for comparing two sequences when you suspect that there could be more than one segment of similarity between the two.
Compare makes a file with the coordinates of each point where two sequences are similar. The sequences are compared in every possible register and a point is added to the file wherever some match criterion for similarity is met. The match criterion can be met in two different ways:
The standard way compares two sequences in every register, searching for all the places where a given number of matches (stringency) occur within a given range (window). See Maizel and Lenk (1981) "Enhanced Graphic Matrix Analysis of Nucleic Acid and Protein Sequences" Proc. Natl. Acad. Sci. USA 78; 7665-7669 for a description of the matrix analysis of biological sequences.
The other way to find points of similarity is to search for short perfect matches of some set length. Short perfect matches are referred to as words. The word comparison between two sequences is about 1,000 times faster than the window/stringency match described above, but it requires that the sequences contain short perfect matches for any similarity to be found. Word comparison is discussed in detail by Wilbur and Lipman (1983) "Rapid Similarity Searches of Nucleic Acid and Protein Data Banks" Proc. Natl. Acad. Sci. USA 80; 726-730. The authors refer to a word as a k-tuple. Compare does a word comparison if it is run with -WORdsize.
You may limit the number of points that Compare finds with -LIMit.
COMPARE what horizontal sequence ? hpr.seq
Begin (* 1 *) ?
End (* 2966 *) ?
Reverse (* No *) ?
To what vertical sequence (* hpr.seq *) ? hpf.seq
Begin (* 1 *) ?
End (* 2740 *) ?
Reverse (* No *) ?
What comparison window size (* 21 *) ?
What stringency (* 14 *) ?
What should I call the output file (* hpr.pnt *) ?
Number of points: 4986
The output file from this session can be read by the DotPlot program to produce a dot-plot. The plots generated by DotPlot from this session and from another session with -WORdsize=8 are shown in the figures in the Program Manual below. The example session with DotPlot uses the file from this session with Compare. Here is part of the output file:
COMPARE of: hpr.seq check: 8102 from: 1 to: 2966
Haptoglobin related sequence
HindIII fragment sequenced
(partially from hpf sequence)
*** To: hpf.seq check: 2624 from: 1 to: 2740
HindIII fragment , region equivalent to hp1f
Window: 21 Stringency: 14 Points: 4986 September 27, 1998 12:15 ..
131 2639 187 2624 276 2670 277 2671 278 2672
94 2454 95 2455 96 2456 128 2389 132 2389
32 2281 146 2389 164 2389 47 2098 656 2662
2861 123 2864 126 2865 127 2866 128 2867 129
2911 56 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
hpr.seq 8102 1 2966 F
hpf.seq 2624 1 2740 F
21 14 0 COMPARE
Compare accepts two individual nucleotide sequences or protein sequences as input. The function of Compare depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.
DotPlot makes a dot-plot with the output file from Compare or StemLoop. StemLoop finds stems (inverted repeats) within a sequence. You specify the minimum stem length, minimum and maximum loop sizes, and the minimum number of bonds per stem. All stems or only the best stems can be displayed on your screen or written into a file. BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. Repeat finds direct repeats in sequences. You must set the size, stringency, and range within which the repeat must occur; all the repeats of that size or greater are displayed as short alignments.
No more than 200,000 points may be produced in the plot file. The point files can be quite large and should be deleted as soon as they have been examined. Window must be between 1 and 100, or if word comparison is done, the word size must be between 1 and 25.
Compare makes a file of every point where two sequences are similar according to a set match criterion. The points are the Cartesian coordinates of each point of similarity in units of the original sequence coordinates. If the window is greater than 1, the point recorded by Compare is in the middle of the window.
For window/stringency comparisons, Compare reads a scoring matrix (see Section 4, Using Data Files in the User's Guide) that defines a match value for every possible GCG symbol comparison. Compare then slides the vertical sequence along the horizontal in order to generate every possible register of comparison. For each register, Compare slides a window along the pair of sequences. The match values for each pair of symbols within the window are summed to determine a score for the window at each window position. When the score is greater than or equal to the stringency, then the match criterion has been met and a point is added to the file at the position of the middle of the window on both axes. When the window has no integral center (windows of even length), then Compare rounds the coordinates up. If you have used -ALL, then points are added to the file at all of the positions within the window that have match values greater than or equal to the average positive non-identical comparison value in the scoring matrix (see -ALL in the PARAMETER REFERENCE topic).
For word comparisons, you set a word length. Compare then slides the vertical sequence along the horizontal in order to generate every possible register of comparison. For each register, Compare slides a window whose size is equal to the word length along the pair of sequences. If all of the symbols in the two sequences within the window are identical, Compare puts a point in the file at the middle of the word's position in the two sequences. If the word has no integral center (words of even length), then Compare rounds the coordinates up.
The parameter alphabet that appears in the output is the number of symbols in the alphabet that could make up each word. The alphabet contains four symbols for nucleic acids and up to 31 for peptide sequences.
Dot-plotting helps recognize large regions of similarity. It is not really sensitive enough, in most uses, to see small structures like promoters. In general, you should not try to look for structures that are smaller than the stringency. The window/stringency comparison is usually more sensitive than the word comparison for regions that are only weakly related.
For window/stringency comparisons, Compare chooses a default stringency that is appropriate for the scoring matrix that it reads. If you select a different scoring matrix with -MATRix, the program will adjust the default stringency accordingly.
Try a Word Comparison First
Word comparisons are very fast, so run Compare with -WORdsize first. Usually, this pilot run gives you a rough idea what the dot-plot for the more sensitive window/stringency comparison is going to look like. See the two plots in the Program Manual entry for DotPlot for examples of each type of comparison.
Setting Window and Stringency
A window 21-symbols wide with a stringency of 14 is a good place to start when comparing nucleic acid sequences that have very few ambiguity codes in them. The number of points you get should be of the same magnitude as the number of symbols in your sequences. We have had good results with a window of 30 and a stringency of 11 for peptide sequence comparison. You can use -LIMit to stop the program before the number of points gets unreasonable.
Unless you are using the -WORdsize parameter, Compare is one of the few programs in GCG that can take more than a few minutes to run. Therefore, large comparisons should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide. Very large comparisons may exceed the CPU limit set by some systems. In practice you should probably limit the range of the sequences compared to about 10,000 for each batch job.
Setting Word Size
You might try a word size of 6 for nucleic acid sequences of 1,000 bases and perhaps 8 for 10,000 bases. You can start with a word size of 2 or 3 for peptide-sequence comparisons.
If you need to stop this program, use <Ctrl>C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use <Ctrl>C.
Compare writes out all of the points that were found before the comparison was interrupted.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.
Minimal Syntax: % compare [-INfile1=]hpr.seq [-INfile2=]hpf.seq -Default
-BEGin1=1 -BEGin2=1 sets the beginning of each sequence (1 is horizontal)
-END1=2966 -END2=2740 sets the end of each sequence (2 is vertical)
-REVerse1 -REVerse2 specifies the strand of each sequence
-WINdow=21 sets the comparison window
-STRIngency=14 sets stringency to find a match in comparison window
[-OUTfile=]hpr.pnt names the output file
Local Data files:
-MATRix=compardna.cmp assigns the scoring matrix for nucleic acids
-MATRix=blosum62.cmp assigns the scoring matrix for proteins
-WORdsize=6 makes a rapid word comparison for perfect 6-mer matches
-NOSORtpoints doesn't sort points by diagonal (word comparison only)
-MINPOints=45 sorts points on diagonals that have at least 45 points
(word comparison only)
-NORANdom doesn't show points on diagonals with only a
few points (word comparison only)
-LIMit=3000 limits the number of points found to 3,000
-ALL[=1] shows all of the points under the window whose symbol
comparison values are greater than or equal to 1
-BATch submits the program to the batch queue
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.
Local Scoring Matrices
This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.
For window/stringency comparisons, Compare uses the scoring matrix found in either compardna.cmp or blosum62.cmp to find the match values for any position of the window. You should recognize that stringency is really the sum of the match values (defined in this file) for the symbols compared under the window. The public version of compardna.cmp (for nucleic acid comparisons) scores a 1.0 for all IUB nucleic acid ambiguity symbol comparisons where there is any overlap between the sets defined by the symbols (see Appendix III). No symbols match the symbols X or N however. The public version of blosum62.cmp is based on substitutions between amino acid pairs in ungapped blocks of aligned protein segments as measured by Henikoff and Henikoff.
You can set the parameters listed below from the command line.
-REVerse1 and -REVerse2
Sets the program to use the reverse strand for the two input sequences.
Sets the size of the window within which the comparison score is calculated (when doing a window/stringency comparison).
Sets the minimum comparison score that defines a match (when doing a window/stringency comparison). The comparison score is the sum of the individual match values for each pair of symbols within the window.
Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.
For more information see the Local Scoring Matrices section.
Indicates that you wish to use Compare to do a word comparison, rather than the default window/stringency comparison. Compare prompts you for the word size if you do not specify the length of the word next with the -WORdsize parameter.
The next three parameters only affect word comparisons.
Compare normally sorts the points from a word comparison so that points on diagonals with a lot of other points appear together in the output file. This sort greatly speeds up plotting since many adjacent points on a diagonal can be represented with a single line. -NOSORtpoints suppresses this sorting, which adds computing to the Compare program; the points appear in the file (and on the plot) in the order in which they were found by the algorithm.
The only points that are sorted in word comparisons are from diagonals that have some minimum number of points. Compare normally sets this number to four times what you would expect on the longest diagonal in the surface of comparison by chance. You can reset the minimum with this parameter.
Suppresses the display of all of the points that are not on diagonals that have some minimum number of points. Compare normally sets this minimum number to four times what you would expect on the longest diagonal in the surface of comparison by chance. You can reset the minimum with -MINPOints.
In many applications it is impractical to generate more than a certain number of points. You may limit the maximum number of points found with this parameter. Compare automatically stops and writes only those points already determined when the limit was reached.
For detailed comparisons, you may want to see every position where your sequences are similar. Usually, Compare puts only one point in the file at the middle position of the window whenever the window/stringency match criterion is met (see the ALGORITHM topic above). Use the -ALL parameter to see all the positions under the window that are similar. Compare then displays all of the points under the window that have scoring matrix values greater than or equal to the average positive non-identical comparison value in the matrix. Use an optional value with -ALL to change the threshold above which a point is shown to a different number.
Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.
Printed: May 27, 2005 11:56
Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.
Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.