Table of Contents
StringSearch identifies sequences by searching for character patterns such as "globin" or "human" in the sequence documentation.
Annotations and Definitions
In addition to the actual sequence data, GCG databases contain two additional types of data: sequence annotations, and definitions.
The annotations contain the complete documentation for each entry in the sequence database, including journal and author names, sequence features, comments, etc. The annotations appear at the top of sequences copied from a GCG database with the Fetch program.
The definitions contain a minimal amount of the annotations documentation for each entry: the name of the organism, the name of the gene, the sequence length, and usually the date. Definitions for the GenBank, EMBL, and SWISS-PROT databases also contain the primary accession number for the sequences.
The StringSearch program searches through either the definitions alone or the complete sequence annotations for text patterns that you specify. Annotations take much longer to search than definitions.
Searching Sequence Definitions
The expression % stringsearch GenBank:* human finds every entry in the GenBank sequence database whose definition contains the text pattern human. The databases available in addition to GenBank are EMBL, SWISS-PROT, and PIR-Protein. GenBank specifies the sequences in both GenBank and EMBL. Additionally, definitions searches can be done on any of the individual divisions in GenBank and/or EMBL. If you believe that a published human sequence in the database is 1,531-bases long, you can search for entries that contain both human and 1531.
When searching definitions, you can specify the set of sequences you want to search in the same way as for all other Accelrys GCG (GCG) programs with the following exception. The specified sequences must be contained in a database; you cannot search the definitions of user sequences. For instance, the specification Primate:hum* would search through the definitions for all of the sequences in the Primate division of GenBank that begin with the pattern hum. You may also specify the database sequences to search by means of a list file. Each sequence in a list file must be preceded by a logical name for one of the databases or database divisions. Sequence specification is described in detail in Section 2, Using Sequence Files and Databases of the User's Guide.
Searching Complete Sequence Annotations
When you are searching complete sequence annotations, you can specify the set of sequences you want to search in the same way as for all other GCG programs. Sequence specification is described in detail in Section 2, Using Sequence Files and Databases of the User's Guide.
If your sequence specification is not preceded by a logical name, StringSearch looks in all of the databases and in all of the GCG data files to find all possible entry names. The specification GenBank:hum* will search only GenBank for sequences whose names begin with hum, while hum* will search GenBank and also databases other than GenBank and all GCG data files. A search of all the entries in all the databases takes a very long time.
Special Considerations for Searching
Keep in mind that filenames are case sensitive and database entry names are case insensitive. Because this program searches for both filenames and database entry names, you must take care when you enter the character pattern that makes up your specification.
For example, if you entered Gamma* as a file specification, this program would find all entries in the databases whose names begin with Gamma but no GCG supplied files would be found. This is because all the files in GCG are named using lowercase letters. Conversely, if you entered gamma*, this program would find all of the entries in the databases and all GCG supplied files whose names begin with gamma.
Searching for More Than One Pattern
You can search for more than one text pattern in response to the program prompt with Human,Globin. StringSearch then finds all the entries that contain both human and globin. You can set StringSearch to show all the entries that contain either human or globin with -MATch=OR.
Blank spaces are removed from the beginning of each pattern unless that pattern is enclosed in double quotes. For instance, specifying the pattern Globin shows all entries that contain globin, while specifying " Globin" excludes entries containing terms like myoglobin in which globin is not preceded by a space.
To specify a double quote (") as part of a pattern, use two double quote marks (""). To specify a comma as part of a pattern, enclose the whole pattern in quotes.
Here is a session using StringSearch to search for nucleotide sequences with pseudogenes in GenBank:
STRINGSEARCH through what sequence(s) (* GenBank:* *) ?
Do you want to search through:
B) complete sequence annotation
Please choose one (* A *):
Search for what text patterns ? Pseudo
What should I call the output file (* GenBank.strings *) ?
*** Gbba:Ab000361 ***
AB000361 Pseudomonas cichorii gene for D-Tagatose 3-epimerase, ...
*** Gbsts:Ppu85464 ***
U85464 Pseudomyrmex pallidus clone Psd2523CAC trinucleotide ...
Sequences searched: 552323
Sequences with matches: 6237
Patterns sought: Pseudo
Output file: GenBank.strings
The output file from StringSearch is a list file. (See Section 2, Using Sequence Files and Databases of the User's Guide for more information.) Here is what the output from the example session looks like:
! STRINGSEARCH from: GenBank:*
October 22, 1998
! searching for: "pseudo" ..
Gb_ba:Ab000361 AB000361 Pseudomonas cichorii gene for D-Tagatose ...
Gb_ba:Ab001577 AB001577 Pseudomonas sp. DNA for low specificity ...
Gb_ba:Ab001722 AB001722 Pseudomonas stutzeri carbazole catabolic ...
Gb_sts:Ppu85462 U85462 Pseudomyrmex pallidus clone Psd2421AAG tri ...
Gb_sts:Ppu85463 U85463 Pseudomyrmex pallidus clone Psd2427AAG tri ...
Gb_sts:Ppu85464 U85464 Pseudomyrmex pallidus clone Psd2523CAC tri ...
! Sequences searched: 552323
StringSearch takes as input any valid GCG sequence database specification. This may represent a single sequence, for example GenBank:humcyc. But usually you specify multiple sequences by using a database specification with an asterisk (*) wildcard, for example GenBank:*; or by using a list file, for example @project.list, that contains the names of database sequences.
When searching complete sequence annotations, you may also search one or more user sequences, using a wildcard asterisk or a list file to specify multiple sequences. To search user sequences in your own directories instead of in the GCG data files directories, you must preface the specification with the path to your sequences, for example: /usr/user/burgess/seqs/*.seq
LookUp identifies sequence database entries by name, accession number, author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences.
The search is case insensitive.
The database programs LookUp, Names, StringSearch, FindPatterns, FastA, TFastA, FastX, TFastX, SSearch, and WordSearch can be used for list refinement if you are looking for sequences with something in common. For instance, you could identify human globin nucleotide sequences with LookUp. The output list from LookUp could then be refined further with FindPatterns to show only those human globin sequences containing EcoRI sites. If you run FindPatterns with -NAMes, you could then do a FastA sequence search on the FindPatterns list file output to see if a sequence you have is similar to any of these EcoRI-containing human globin sequences.
Adding Lists Together
You can add two lists together by simply appending one of the files to the other. It is better if you use a text editor to modify the heading of the combined list so that the annotation in the list correctly reflects what you have done. Remember to delete the text heading from the second file so that it does not occur in the middle of the list.
Suppress any item in a list by typing an exclamation point (!) in front of the item. You can also put comments into a list anywhere on a line by placing an exclamation point before the comment.
You cannot assume that a text pattern search is exhaustive. The text you choose may not have been used by the data collectors. Worse yet, all databases contain errors -- the misspelling psuedo appears 14 times in the definitions for GenBank Release 108.0!
Hyphenation is particularly prone to inconsistent usage. A search for pseudogene would only be complete if pseudo-gene (or pseudo gene) were never used to refer to pseudogenes.
Using a nonspecific pattern such as pseudo to find sequences of pseudogenes will result in many false matches. (In the example session, there were 1570 instances of Pseudomonas, 321 instances of pseudoobscura, and 42 instances of pseudoautosomal out of 6237 total matches.) But restricting the search by setting -MATch=AND and searching for pseudo,gene will miss pseudogene sequences whose definitions use terms like pseudoexon or the prefix pseudo- used with the name of the gene. It's usually better to use a less-specific search pattern and then edit the resulting list file to remove entries that you aren't interested in.
The conclusion is that a search with StringSearch can only tell you what is available and not what is not available.
The complete annotation search takes a lot of computing, but the search includes a lot of information, such as author and journal names, that is not found in the sequence definitions. You can speed up the search considerably by using a sequence specification like Primate:hum* to look only at the group of sequences in which you really expect the text pattern to be found.
Use the expression % typedata primate:hum* to see some examples of sequence annotations.
StringSearch is one of the few programs in GCG that can take more than a few minutes to run. Searches should probably be run in the batch queue if an entire database is being searched, especially if the complete annotations search (-MENu=B) is chosen. You can specify that this program run at a later time in the batch queue by using -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide.
If you want this (or any) program to stop so you can read the screen, use <Ctrl>S. Restart the program by using <Ctrl>Q.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.
Minimal Syntax: % stringsearch [-INfile=]GenBank:* [-STRings=]pseudo -Default
-MENu=a selects the sequence documentation to search:
A=definitions, B=complete records
[-OUTfile=]GenBank.strings names the output list file
-MATch=or finds entries with any of the patterns specified
-WIDth=100 limits length of documentation in the output file
-NOHEAding suppresses the heading in the output file
-BATch submits the program to run in the batch queue
-NOSCReen suppresses the screen output
-NOMONitor suppresses the '.'s in the screen trace
You can set the parameters listed below from the command line.
String pattern or patterns to search for.
Searches complete entry records (-MENu=B) or just the definition lines of the entries (-MENu=A, the default).
When you are looking for more than one text pattern, this parameter sets StringSearch to find sequence entries that contain any one, but not necessarily all, of the text patterns you have specified. -MATch=AND requires that the sequences found contain all of the patterns sought. -MATch=2 requires that each of the sequences found have two of the patterns sought.
StringSearch normally appends a line of documentation after each sequence name in the output list file, starting at the 20th column. Use this parameter to set the length of the documentation. A value of 100 gives lines that are a maximum of 120 characters long. -WIDth=0 suppresses the documentation next to each sequence name completely.
Suppresses the heading at the top of the list file that shows the input specification and the time.
Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.
Suppresses the output on the screen that shows each sequence as it is found. You must direct output to a file for this parameter to work.
When searching complete sequence annotations, a dot normally appears on your screen for every 50 complete sequence annotations that are searched without a find. This parameter suppresses the display of the dots.
Printed: May 27, 2005 14:47
Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.
Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.