Table of Contents
Repeat finds direct repeats in sequences. You must set the size, stringency, and range within which the repeat must occur; all the repeats of that size or greater are displayed as short alignments.
Repeat lets you choose a minimum repeat length (window), a stringency within the window, and a search range and then finds all the repeats of at least that size and stringency within the search range chosen. The repeats are sorted by position and displayed in an output file as alignments of those parts of the sequence that make up the repeats. Repeat tells you the number of repeats found for your settings of window and stringency before filing the results. If you feel there are too many repeats, you may reset the parameters before writing the repeats out to a file. You can limit the number of repeats shown, or sort the repeats by quality so that the longest repeats come at the top of the list. See the ALGORITHM topic below to understand precisely what Repeat does.
Here is a session using Repeat to find all the direct repeats in the first 1,000 bases of gamma.seq that are 10 bases or longer and that occur within 100 bases of each other and that have at least 9 out of 10 matched bases:
REPEATs from what sequence ? gamma.seq
Begin (* 1 *) ?
End (* 11375 *) ? 1000
What minimum repeat window (* 7 *) ? 10
What minimum stringency (* 10 *) ? 9
Find repeats through what range (* 50 *) ? 100
There are 11 repeats, would you like to
1) File the repeats
4) Set new parameters
Please choose one (* 1 *):
What should I call the output file (* gamma.rpt *)
Each repeat is shown as an alignment of the repeated regions along with the beginning and ending coordinates of each region. The size and stringency of each repeat is shown to the right of the alignment. The stringency is the sum of the repeat's pairwise residue values which are found in the scoring matrix. Here is some of the output file for the example above:
REPEAT of: gamma.seq check: 6474 from: 1 to: 1000
Human fetal beta globins G and A gamma
from Shen, Slightom and Smithies, Cell 26; 191-203.
Analyzed by Smithies et al. Cell 26; 345-353.
Window: 10 Stringency: 9 Range: 100 Repeats: 11
October 7, 1998..
79 TGTAATCCCA 88
|| ||||||| 10 9
158 TGAAATCCCA 167
158 TGAAATCCCATCT 170
|| ||||||| || 13 11
213 TGTAATCCCAGCT 225
395 ACCAGTCTCT 404
||||| |||| 10 9
444 ACCAGACTCT 453
937 AAAAAACAAAA 947
|||||| |||| 11 10
965 AAAAAATAAAA 975
965 AAAAAATAAAAA 976
|||||||||| | 12 11
985 AAAAAATAAAGA 996
981 AAAGAAAAA 989
||||||||| 9 9
992 AAAGAAAAA 1000
Repeat accepts a single sequence file as input. The function of Repeat depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.
Xnu replaces statistically significant tandem repeats in protein sequences with X characters. If a resulting protein sequence is used as a query for a BLAST search, the regions with X characters are ignored.
Using Compare/DotPlot to create a dot-plot comparison of a sequence to itself is functionally equivalent to running Repeat. The dot-plot is a much more graphic way to show where the repeats occur and what the background of random repeats looks like.
Repeat cannot find more than 1,000 repeats.
For window/stringency comparisons, Repeat reads a scoring matrix that defines a match value for every possible GCG symbol comparison. (See Section 4, Using Data Files in the User's Guide for more information.) Repeat then slides the sequence along itself in order to generate every register of comparison (diagonal) for the search range you have set. For each diagonal, Repeat slides a window along the pair of sequences. The match values for each pair of symbols within the window are summed to determine a score at each position. When the score under the window is greater than or equal to the set stringency, then the match criterion has been met and the repeat is recorded.
Before the repeats are presented, they are nibbled from both ends so that the symbol pair on each end has a scoring matrix value at least as great as the average positive non-identical comparison value in the matrix. You can reset this minimum match threshold with the -PAIr command-line parameter. Thus, repeats less than the minimum repeat length may be shown.
Repeat can show several repeats that are part of the same structure if there is a simple sequence with a repeat period shorter than the minimum repeat length.
Repeat chooses a default minimum stringency that is appropriate for the scoring matrix it reads. If you select a different scoring matrix with the -MATRix command-line parameter, the program will adjust the default minimum stringency accordingly.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.
REPEAT does not support complete command-line control.
Local Data Files:
-MATRix=repeatdna.cmp assigns the scoring matrix for nucleic acids
-MATRix=blosum62.cmp assigns the scoring matrix for proteins
-LIMit limits the number of repeats written into the output file
-SORt sorts the repeats on quality
-PAIr=5 sets match threshold for displaying "|"
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.
Local Scoring Matrices
This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.
By default, Repeat uses the scoring matrix found in either repeatdna.cmp (for nucleotide sequences) or blosum62.cmp (for protein sequences) to find the pairwise match values when determining the stringency of the repeat. You should recognize that stringency is really the sum of the match values (defined in this file) for the symbols compared under the window. The public version of repeatdna.cmp scores a 1 for all IUPAC-IUB nucleic acid ambiguity symbol comparisons where there is ANY overlap between the sets defined by the symbols (see Appendix III). No symbols match the symbols X or N, however. In the public version of blosum62.cmp, the scores for pairwise values for amino acids range from -4 to +11. You can use the Fetch program to create copies of these scoring matrix files in your working directory, where you may modify them to suit your own needs.
You can set the parameters listed below from the command line.
Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.
For more information see the Local Scoring Matrices section.
Sorts the repeats by quality score instead of position so that the longest repeats (those with the highest quality scores) are at the top of the output.
Limits the output report to the largest repeats. This parameter automatically causes the repeats to be sorted by quality score instead of position. If you use this parameter, the program asks you to specify how many repeats you want to see.
The output from this program has a '|' (vertical bar) between sequence symbols that match. This match display character is added to the output whenever the symbol comparison value for the two symbols in your scoring matrix is greater than or equal to the average positive non-identical comparison value in the matrix. The -PAIr parameter lets you specify a match display threshold appropriate for the scoring matrix you are using.
The repeat nibbling, referred to in the ALGORITHM topic above, uses the threshold value set by this command-line parameter to decide what repeats should be nibbled away from the structure. If you set the pairing threshold too high, all repeats will be nibbled away!
Printed: May 27, 2005 14:22
Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.
Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.