Table of Contents
WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.
WordSearch uses an algorithm similar to the algorithm of Wilbur and Lipman (Proc. Natl. Acad. Sci. (USA) 80; 726-730 (1983)) to compare one sequence (the query) to any group of sequences. You should think of the comparisons as a set of dot-plots with the query as the vertical sequence and the group of sequences to which the query is being compared as the different horizontal sequences (the search set). The search finds the registers of comparison (diagonals) that have the largest number of short perfect matches (words). The best segment of similarity along each diagonal can be viewed with the program Segments.
What is a Word?
A word is any short sequence (n-mer) where you have set n to some small constant like six or seven. The word GGATGGC is one of the 16,384 possible words of length seven that can be created from an alphabet consisting of the four letters G, A, T, and C. The word QQL is one of the 8,000 possible words of length three that you can make with the 20 letters of the amino acid alphabet.
What is a Word Mask?
The symbols that match between two words need not be contiguous. You can use the characters + and - to define a word mask like ++-++-++. This mask means that matching words should match at positions 1, 2, 4, 5, 7, and 8 and that positions 3 and 6 may or may not match.
What is a Diagonal?
A diagonal is a register of comparison for two sequences -- a path across a surface of comparison where X - Y for every point is a constant. A series of dots along a diagonal represents a segment of similarity between two sequences. Each diagonal can be defined by the constant X - Y for that diagonal. The path up from the origin is numbered zero. The paths above the zero diagonal are negative and the paths below the zero diagonal are positive. The diagonals are then numbered between minus the length of the vertical (query) sequence and plus the length of the horizontal (search set) sequence.
What is the Output?
WordSearch sorts the scores of all the diagonals in your comparison and shows you a list of the best diagonals where you have restricted the size of the list to some finite number like 50 or 100. You can see optimal alignments of the segments of similarity in the WordSearch output file with the Segments program.
WordSearch compares both strands of your query sequence to any set of sequences you name and shows the best diagonals and the number of symbols within matching words on each of these best diagonals. The diagonals are identified with the coordinate X - Y (described above), the number of symbols within the matching words for that diagonal, the strand of the query sequence, and the name of the search set sequence.
Score Distribution Plot
WordSearch makes a histogram showing the number of diagonals observed for each diagonal score. The histogram shows the distribution of diagonal scores so you can see if a particular diagonal in your list of best diagonals is significant.
Here is a session using WordSearch to find sequences in the GenBank nucleotide sequence database with similarities to a human globin coding sequence.
% wordsearch -PLOt -MASk
(Masked) WORDSEARCH with what query sequence ? ggammacod.seq
Begin (* 1 *) ?
End (* 444 *) ?
Search for query in what sequence(s) (* GenBank:* *) ?
What word-mask (* ++-++-++ *) ?
List how many best diagonals (* 50 *) ?
Integrate how many adjacent diagonals (* 3 *) ?
What should I call the output file (* ggammacod.word *) ?
1 A16SRRNA Len: 1,497
101 AB000354 Len: 607
201 AB001715 Len: 876
8-mers found: 2,000,000,000
Diagonals with words: 154,369,323
Total diagonals: 2,000,000,000
Sequences searched: 552,323
CPU time: 17:06.78
Output file: ggammacod.word
When your LaserWriter attached to tty07 is ready, press <Return>.
WordSearch produces a list file containing the names of sequences that contain the best diagonals in your search and optionally can plot the distribution of scores from the search. Here is some of the output file:
(Masked) (Nucleotide) WORDSEARCH of: ggammacod.seq check: 2906
from: 1 to: 444
July 27, 1994
Symbols: 1 to: 92 from: gamma.seq ck: 6474, 2179 to: 2270
Symbols: 93 to: 315 from: gamma.seq ck: 6474, 2393 to: 2615
Symbols: 316 to: 444 from: gamma.seq ck: 6474, 3502 to: 3630
Human fetal beta globins G and A gamma
from Shen, Slightom and Smithies, Cell 26; 191-203. . . .
TO: GenBank:* Sequences: 552,323 Total-length: 1,036,534,882
October 19, 1998
Database Release Information:
GenBank, Release 108.0, Released on 16Aug1998, Formatted on
EMBL, Release 55.0, Released on 16Jun1998, Formatted on
Word-size: 8 Words: 2000000000 Diagonals: 154,369,323
Integral-width: 3 Alphabet: 4 List-size: 50 CPU minutes: 17.11
Sequence Strd Diag Score Width Documentation ..
GB_PR2:HUMHBGG + 17 442 3 M15386 Human hemoglob ...
GB_PAT:I42109 + -1 440 4 I42109 Sequence 4 fro ...
GB_PR1:HSGGGPHG + -20 378 3 X55656 H.sapiens mRNA ...
GB_PR1:HUMHBBGG + 2300 211 3 M32723 Human G-gamma- ...
GB_PR1:MMGGLINE + 7197 209 3 X53419 M.mulatta gamm ...
GB_PR1:MMGGLINE + 2318 209 3 X53419 M.mulatta gamm ...
If you run WordSearch with -PLOt, it plots a histogram showing the number of diagonals observed with each different score. This plot should help you judge which of the diagonals in your output list are significant and whether the output list was large enough to contain all of the significant diagonals. Here is the score distribution plot from the example session:
By looking at a plot like this one, you can conclude that observations with a score of less than about 80 are probably part of the population of diagonals with only random similarity to ggammacod.seq. (The example has an unusual number of significant similarities arising from the fact that many similar globins have been sequenced.)
You can set the resolution of the score distribution plot with -BINsize. By default, each histogram is integrated into bins that are the size of the word length. For words of length 6, the histograms would normally show the frequency of diagonals with scores from 0 to 5, 6 to 11, 12 to 17, and so forth.
The Histogram Shows Scores for Structures
The histogram shows the scores for diagonals after processing into structures. See the ALGORITHM topic below for a description of how scores accumulate on diagonals and the way scores are grouped into structures before becoming eligible to join the list of best diagonals.
Ideally the list of best diagonals should be large enough to include some diagonals from the high end of the random scores. The list of best diagonals may not have been large enough, however, to show all of the diagonals with significant scores. The cutoff or lowest score in the output list is marked on the "Diagonal Scores" axis with an asterisk (*). Notice that the list size was not large enough to include all of the globin sequences in GenBank.
The end of the histogram with the best observations (highest scores) is magnified into a small plot in the upper-right corner. The inset plot simply expands the vertical axis tenfold so that the number of high-scoring diagonals can be read exactly.
WordSearch accepts either a nucleotide or a protein sequence as input. The function of WordSearch depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.
Segments aligns and displays the segments of similarity found by WordSearch.
If you run Compare with -WORd, the program calculates the points for a dot plot that shows where common words between two sequences occur.
SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST and FastA, it can be very slow.
FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST. TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"
ProfileSearch uses a profile (representing a group of aligned sequences) as a query to search the database for new sequences with similarity to the group. The profile is created with the program ProfileMake. HmmerSearch uses a profile hidden Markov model as a query to search a sequence database to find sequences similar to the family from which the profile HMM was built. Profile HMMs can be created using HmmerBuild.
FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?" TFastX does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA, and like TFastA, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"
FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.
The query sequence may not be more than 30,000 symbols long. You may not select a list size of more than 1,000 "best" diagonals. The word size should be from 1 to 30. Word searching is subject to many limitations and considerations which are discussed further below.
The Match Criterion
The match criterion for two words is that all of the symbols in each word are identical. The symbols that must be identical need not be contiguous if a word mask has been set, but the symbols that must match must be identical, except for case. There is no scoring matrix and no support for the equivalence of nucleic acid ambiguity codes. Lower- and uppercase letters are equivalent however.
Word Searching Requires some Perfect Identity
The basic assumption of word comparisons is that patterns of similarity have an unusual number of common words (short perfect matches) along a set of closely spaced diagonals. This is often the case for nucleic acid sequences that have diverged recently, but it may not be true for protein comparisons. You should consider this assumption carefully. When two sequences have diverged sufficiently so that an optimal alignment of them has one mismatch for every six bases, then a word comparison with words of length six may not recognize their similarity.
Sequence Simplification May Increase the Level of Perfect Identity
-SIMplify allows you to map the sequences' symbols into a simpler subset of symbols to find matches between categories of sequence symbols.
Queries Containing Repetitive or Simple Sequences
If you use a query sequence containing a mammalian Alu-family sequence, you are in danger of finding the hundreds of Alu-family sequences that have been published to the exclusion of anything else. The ideal query sequence contains no simple (e.g., polyA) or repeated sequences. Ideally the query should be short enough so that any segment of similarity generates an unusual peak on the histogram. If the query is shorter than 500 bases, most of the diagonals are approximately the same length. Short diagonals of similar length increase the probability that word scores from a small segment of similarity are not lost in the background noise.
You might try a word size of six and an integral width of three for nucleic acid searches as suggested by the program's defaults. You should recognize that when the average word occurs in the query sequence more than zero times, the amount of CPU time rises dramatically. You could start with a word size of two for protein sequence comparisons.
A word mask calling for two matches followed by one uncertainty is more sensitive for recognizing protein coding sequences than a simple contiguous word search. You can set up a word mask by including an expression like -MASk=++-++-++, using a plus sign (+) to show the positions where the symbols must match and a minus sign (-) to show the positions where symbols may or may not match. Wobble in the third codon position in the genetic code would make a mask like ++-++ more sensitive than +++++ for recognizing similar coding regions.
It does not make sense to define a mask with leading or trailing - characters, and therefore WordSearch removes these. Defining a word mask suppresses the word size query since the word size is inherent in the mask you have chosen. The word size of the mask ++-++-++ is eight, even though only six of the eight characters under the mask must match.
The list should be large enough to cover all of the significant scores with at least 10 scores seeming to arise from the high end of the random scores. The default list size of 50 is large enough for most query sequences, but it is not large enough to include all of the globins in the sample session.
Identifying the Search Set
For information about naming groups of sequences, see Section 2, Using Sequence Files and Databases of the User's Guide.
WordSearch is one of the few programs in Accelrys GCG (GCG) that can take more than a few minutes to run. Most comparisons should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide. Very large comparisons may exceed the CPU limit set by some systems.
When WordSearch is run in batch as % wordsearch -PLOt -BATch, instructions for plotting the optional histogram is written to a figure file named wordsearch.figure, unless the plot has been directed to a specific file or graphics device from the command line. Please see the entry for the Figure program in the Program Manual for instructions on how to plot a figure file to any graphics device that GCG supports.
Interrupting a Search: <Ctrl>C
You can type <Ctrl>C to interrupt a search and see the results from the part of the search that has already been completed.
The database programs LookUp, Names, StringSearch, FindPatterns, FastA, TFastA, FastX, TFastX, SSearch, and WordSearch can be used for list refinement if you are looking for sequences with something in common. For instance, you could identify human globin nucleotide sequences with LookUp. The output list from LookUp could then be refined further with FindPatterns to show only those human globin sequences containing EcoRI sites. If you run FindPatterns with -NAMes, you could then do a FastA sequence search on the FindPatterns list file output to see if a sequence you have is similar to any of these EcoRI-containing human globin sequences.
Adding Lists Together
You can add two lists together by simply appending one of the files to the other. It is better if you use a text editor to modify the heading of the combined list so that the annotation in the list correctly reflects what you have done. Remember to delete the text heading from the second file so that it does not occur in the middle of the list.
Suppress any item in a list by typing an exclamation point (!) in front of the item. You can also put comments into a list anywhere on a line by placing an exclamation point before the comment.
WordSearch assembles a list of the best places in your search set to look for similarities to your query sequence. The output is a list file and is therefore suitable for input to any program that allows indirect file specifications. (For information about indirect file specification, see Section 2, Using Sequence Files and Databases of the User's Guide.)
The first part of the output file contains heading information about the parameters of the search, including a definition of the query sequence, the word size, the window of integration, the size of the desired list, the number of symbols found within matching words (after integration), the number of diagonals on which those words were found, the total number of diagonals in the search, and the size of the alphabet of symbols used. Several lines of the WordSearch output file have a specific format; if these lines are altered, the Segments program will not be able to read the file.
The List of Best Diagonals
The second part of the file contains the list of significant diagonals. These diagonals are defined by the following features: the sequence name, the strand (+ or -), the X - Y coordinate that identifies the peak diagonal (Diag), the number of symbols on the diagonal that were within matching words (Score), the width of the structure (Width), and a short line of documentation. All of this information is read by the Segments program. (See the ALGORITHM topic below for a further explanation of the information listed with each significant diagonal.)
The algorithm described below may be referred to as a hash-table/linked-list search. Wilbur and Lipman searches are an example of a class of comparisons that use direct addressing or k-tuple preprocessing to reduce search time.
You set a word size or define a word mask, which implies a word size. Then WordSearch makes up a dictionary of all of the possible words of that size in the query sequence. A second dictionary is compiled for the opposite strand if the query is a nucleic acid sequence. The dictionary has an entry for every possible word. Imagine each word, such as GGATGG, as a number in base four that corresponds to an entry in the dictionary. At each entry, there is a number telling the positions (coordinates) where the word occurs in the query sequence. If the word does not occur, the number at the entry is zero. Then, for each word in the searched sequences, WordSearch just looks up the word in the dictionary to find out if it occurs in the query sequence.
If the word from a search set sequence does occur in the query sequence, WordSearch adds the length of the word to the score for the diagonal on which the word occurs. If a word match overlaps another one, only the new symbols are added to the score for the diagonal. For instance, two adjacent word matches of length six would contribute a total of seven to the score for their diagonal.
The parameter alphabet that appears in the output is the number of symbols that could make up each word. For protein sequences, the alphabet is the number of sequence symbols that were actually used in the query sequence. The alphabet should be four for nucleic acids. Notice that nucleic acid ambiguity codes are not supported by this alphabet and that they confound word comparison! Any word in any search set sequence that contains characters that are not part of the comparison "alphabet" is ignored. U and T are equivalent in nucleic acid sequences however, so DNA patterns may be found in RNA sequences. Uppercase and lowercase sequence symbols are equivalent in all comparisons.
The Histogram: Score
An array of counters, one for the score on each diagonal, is maintained. Each time a word is found in both the horizontal and vertical sequences, the counter for the diagonal on which it was found is incremented by the number of symbols in the word. After each sequence is searched with the dictionary from the query sequence, the result is an array of numbers that tells how many symbols occur within matching words along each diagonal of the comparison. This array of diagonal counters is referred to as the histogram.
The Histogram is Integrated
To make the search more tolerant of short length differences (gaps) between the query and the sequences in the database to which it is similar, WordSearch combines the scores of a user-defined number of adjacent diagonals and puts the combined score (rounded up) at the center of this "window of integration." Wilbur and Lipman call this region of adjacent diagonals a window-space.
Finding the N-Best Diagonals: Structures
After integration, the histogram is searched for a position in which there is a score above the average. A structure is defined as a region of diagonal scores in the integrated histogram from the first above-average score to the last; that is, to where the scores fall back to the average again. If the peak score for a structure is better than the worst score in the list of the N-best diagonals observed so far, then the structure is put in the list and the existing worst observation in the list is discarded. The structure is recorded by recording the file and entry being searched, the coordinate of the diagonal at the center of the peak region rounded up, the peak score (after integration), the width of the structure, and whether the top or bottom strand of the query sequence was being used for the comparison. When all of the files in the horizontal search set have been examined, the list of N-best structures is reported, as shown in the output file above.
GCG must be configured for graphics before you run any program with graphics output! If the % setplot command is available in your installation, this is the easiest way to establish your graphics configuration, but you can also use commands like % postscript that correspond to the graphics languages GCG supports. See Section 5, Using Graphics in the User's Guide for more information about configuring your process for graphics.
If you need to stop this program, use <Ctrl>C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use <Ctrl>C. The graphics device should stop plotting the current page and start plotting the next page. If the current page is the last page, plotters should put the pen away and graphic terminals should return to interactive mode. The function of WordSearch depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.
Minimal Syntax: % wordsearch [-INfile1=]ggammacod.seq -Default
-BEGin=1 -END=444 sets the range of interest
[-INfile2=]GenBank:* specifies the search set
-WORdsize=6 or -MASk=++-++-++ sets the word size or mask pattern
-LIStsize=50 sets the size of the output list
-INTegrate=3 sets the width of integration window
[-OUTfile=]ggammacod.word names the output file
Local Data Files:
[-SIMplify=]simplify.txt assigns an optional simplification table
-SIMplify[=filename] simplifies sequences using the specified file
-SINce=6.90 limits search to sequences dated on or after June 1990
-LOWscore=10 sets minimum score (from 1 to 100) for diagonal
to be listed
-RESORt sorts output list by name instead of score
-NOSHOwfiles suppresses documentation at the end of each line
in the output
-PLOt makes a plot of the score distribution
-BINsize=6 sets the resolution of the score distribution plot
-NOMONitor suppresses the screen trace during the search
-NOSUMmary suppresses the screen summary at the end of the search
-BATch submits the program to run in the batch queue
All GCG graphics programs accept these and other switches. See the Using
Graphics section of the USERS GUIDE for descriptions.
-FIGure[=filename] stores plot in a file for later input to FIGURE
-FONT=3 draws all text on the plot using font 3
-COLor=1 draws entire plot with pen in stall 1
-SCAle=1.2 enlarges the plot by 20 percent (zoom in)
-XPAN=10.0 moves plot to the right 10 platen units (pan right)
-YPAN=10.0 moves plot up 10 platen units (pan up)
-PORtrait rotates plot 90 degrees
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.
If you use -SIMplify, WordSearch reads the local data file simplify.txt to find the symbol equivalences you want to use. You can specify a simplification table with another name using an expression like -SIMplify=mysimplify.txt. There is more on the subject of sequence simplification in the documentation for the Simplify program.
The simplify.txt file in the public data directory is only appropriate for simplifying protein sequences. You must create your own simplify.txt file to define equivalences for nucleic acid simplifications.
You can set the parameters listed below from the command line.
Sets the size of the word, or n-mer, used in the search. Matches between the query sequence and a sequence in the search set are identified by large numbers of identical words shared between the two sequences.
Specifies the word mask used in the search. With a word mask, you use a plus sign (+) to show those positions of the word where the sequence symbols must match and a minus sign (-) to show the positions where symbols may or may not match. The word size is implicitly defined by the size of the mask.
Sets the number of top-scoring entries to save in the output list.
Specifies the number of adjacent diagonals whose word scores are summed together. By summing the scores of adjacent diagonals, the search is tolerant of small gaps between the query sequence and the sequences being searched.
Simplifies the sequences before comparison according to a table of equivalences in the local data file called simplify.txt (see the LOCAL DATA FILES topic above). Many investigators feel that protein sequence pattern recognition for word searching is more sensitive if similar amino acids are treated as equivalent. You can name a file other than simplify.txt.
Limits the search to sequences that have been entered into the database or modified since June 1990. As this is being written, only the EMBL, GenBank, and SWISS-PROT databases support this parameter.
Sets a threshold score, from 1 to 100, at or below which a diagonal cannot be considered.
Causes WordSearch to sort the list of diagonals a second time by sequence name, so that all of the diagonals from the same sequence appear together in the output list. Usually, the diagonal list from WordSearch is shown with the most significant (highest score) diagonal first and diagonals with successively lower scores following. While this is the obvious order, it slows down the Segments display program that has to read each sequence in the list to make the display.
Suppresses the documentation at the end of each line in the output list.
Makes a plot showing the distribution (frequency) of diagonal scores. The score distribution plot is useful for determining if a score in the output list is significant. You must have a plotter or graphic screen to use this parameter. There is a whole paragraph above about the score distribution plot.
Sets the resolution of the score distribution plot (how many scores will be reported in each bin of the histogram).
This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.
Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.
You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.
Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.
The parameters below apply to all GCG graphics programs. These and many others are described in detail in Section 5, Using Graphics of the User's Guide.
Writes the plot as a text file of plotting instructions suitable for input to the Figure program instead of sending it to the device specified in your graphics configuration.
Draws all text characters on the plot using Font 3 (see Appendix I).
Draws the entire plot with the pen in stall 1.
The parameters below let you expand or reduce the plot (zoom), move it in either direction (pan), or rotate it 90 degrees (rotate).
Expands the plot by 20 percent by resetting the scaling factor (normally 1.0) to 1.2 (zoom in). You can expand the axes independently with -XSCAle and -YSCAle. Numbers less than 1.0 contract the plot (zoom out).
Moves the plot to the right by 30 platen units (pan right).
Moves the plot up by 30 platen units (pan up).
Rotates the plot 90 degrees. Usually, plots are displayed with the horizontal axis longer than the vertical (landscape). Note that plots are reduced or enlarged, depending on the platen size, to fill the page.
Printed: May 27, 2005 15:06
Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.
Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.