LOOKUP

LookUp identifies sequence database entries by name, accession number, author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences.

DESCRIPTION

[ Previous | Top | Next ]

LookUp uses the Sequence Retrieval System (SRS) created by Dr. Thure Etzold to identify sequences in sequence databases (CABIOS 9(1); 49-57 (1993)). For example, you can find all of the protein sequences published by a particular author or all of the sequences whose annotation contains a particular word.

The expressions you use to find sequences in a database are known as queries. LookUp presents a form on your screen that lets you enter the elements of your query. Then LookUp finds all the sequences that contain those elements. The output of LookUp is a list file that can be used as input to any GCG programs that accept multiple sequence input.

EXAMPLE

[ Previous | Top | Next ]

Here is a session with LookUp that finds the sequences in the PIR database that were published by any author whose last name starts with Smithies.

> lookup

LookUp identifies sequence database entries by name, accession number,

author, organism, keyword, title, reference, feature, definition, length, or

date. The output is a list of sequences.

 Complete the query form below:

                 All text:

               Definition:

                   Author:  smithies <ctrl D>

                  Keyword:

            Sequence name:

         Accession number:

                 Organism:

                Reference:

                    Title:

                  Feature:

 On or after(dd-mmm-yyyy):              On or before(dd-mmm-yyyy):

 Shortest sequence length:                Longest sequence length:

     Inter-field operator:  AND             Form of output list:  Whole Entries

Searching uniprot

 26 entries were found.

 Do you wish to:

   1) write out this list to a file

   2) preview the results

   3) refine the query

   4) choose different libraries

   q) quit

 Please choose one (* 1 *):  1

 What should I call the output file (* lookup.list *) ?

 26 entries were written to "lookup.list"

OUTPUT

[ Previous | Top | Next ]

LookUp writes a list file naming the sequences which conform to your query. Associated with each sequence in the list file is an ID number. If you use this list file to specify the search set for another session with LookUp (for example with -INfile=@lookup.list), the ID numbers help LookUp quickly find the entries in the database.

 !!SEQUENCE_LIST 1.0

LOOKUP in: uniprot  of: "[SQ-AUT: smithies*]"

 26 entries  March 31, 2005 12:07 ..

UNIPROT_SPROT:B2MG_CANFA ! ID: 88260001

! DE   Beta-2-microglobulin (Fragment).

! GN   B2M.

UNIPROT_SPROT:CYTN_HUMAN ! ID: ef620001

! DE   Cystatin SN precursor (Salivary cystatin SA-1) (Cystain SA-I).

! GN   CST1.

INPUT FILES

[ Previous | Top | Next ]

In most cases the search set for LookUp is an entire database. On the command line, this is specified like -LIBrary=uniprot. Note that this usage is different from that used by other GCG programs, which specify databases with a wildcard expression such as Uniprot:*. Alternatively, the search set can be specified by a list file created by a previous LookUp session. This is done by placing a parameter such as -INfile=@lookup.list on the command line. Any sequences in the list that are not indexed for use with LookUp are ignored.

RELATED PROGRAMS

[ Previous | Top | Next ]

StringSearch identifies sequences by searching for character patterns such as "globin" or "human" in the sequence documentation. Names identifies GCG data files and sequence entries by name. It can show you what set of sequences is implied by any sequence specification. FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.

BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds. NetBLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA. The programs FastA, TFastA, FastX, TFastX, and SSearch can also be used to search databases or sequence sets local to your installation for sequences that are similar to a query sequence.

RESTRICTIONS

[ Previous | Top | Next ]

You can never be certain that the list of sequences in an output list contains every sequence of interest. Usually this is because of inconsistent annotation within the databases. See the CONSIDERATIONS topic for more information about this problem.

Accelrys GCG (GCG) cuts sequences in GenBank that are longer than 350,000 bases into fragments of 110,000 bases each. Queries that make finds in such fragmented sequences return only the first fragment in the series of fragments. See the VERY LONG SEQUENCES topic for more information.

If you search both protein and nucleotide databases in the same session of LookUp, your output list will usually contain sequences of both types. Most GCG programs that analyze multiple sequences do not support lists of mixed sequence type, and so these lists are not suitable for input to programs such as PileUp, WordSearch, FastA, FrameSearch, etc.

If you use a list file that is the output from another GCG program to specify the search set for LookUp (for example with -INfile=@lookup.list), the program ignores the sequences that are not indexed for use with LookUp.

Most of the advanced features of GCG list files are not supported by LookUp. In particular, you cannot include a reference to another list within a list. You cannot include sequences specified by accession number. You cannot include sequences that are specified ambiguously. For instance, the specification GenBank:Pp* has no meaning to LookUp. Note that the specification Viral:Ppv is also ambiguous, as this could refer GB_Vi:Ppv. The specification GB_Vi:Ppv is allowed, since this plum pox virus sequence is indexed for LookUp.

Very ambiguous queries do not always work. For example, if you search for all sequences whose names start with hum, LookUp loops endlessly.

Not all fields are present in every database. For example, PIR does not have a Feature or Date index, and UNIPROT does not have a Title index.

ABOUT DATABASES

[ Previous | Top | Next ]

What is a Database?

A database is a structured way to represent a group of things that have common attributes. Most sequence databases consist of different fields such as accession number, definition, author, etc., that are filled with appropriate values like U01317, Human beta globin region on chromosome 11 , Smithies, etc. Fields are grouped together into larger units referred to as entries. LookUp identifies sequences in GenBank, PIR, and UNIPROT based on the values found in the different fields associated with each sequence entry.

What is an Index?

An indexed database has one or more of its fields organized into a data structure that allows rapid searching. An indexed field is like the index of a book. The subjects in the book are organized alphabetically into an index at the back of the book. When you look up a subject in the index, you will find the page numbers where that subject is mentioned in the body of the book. Likewise in an indexed database, if there were an author index, you could find all of the entries in the database where Smithies is one of the authors just by looking up the name in the index.

What Fields in the Databases Can I Search?

The Sequence Retrieval System (SRS) on which LookUp is based has indices for each of several fields that usually occur in the annotation of a sequence database. These indexed fields are: accession number, author, date, definition, feature, keyword, length, entry name, organism, reference, and title. Each of these indices is described in detail under the INDEXED FIELDS topic below.

NOTE: Use the lookup.config file in $GCGROOT/etc/seqlab directory to edit the databases that are installed on your machine. By editing this file, lookup menu in SeqLab is updated. To know more on how to edit the lookup.config file please refer to the PDF version of SeqLab Support document.

THE QUERY FORM

[ Previous | Top | Next ]

The query form has a line for each field that has been indexed for retrieval with LookUp. You can search for values in one or more fields, and LookUp finds all the sequences containing those values. To move the cursor from field to field, use the <Up-Arrow>, <Down-Arrow>, or <Return> keys.

The field on the query form that is labeled "Form of output list" toggles between the values Whole entries (the default) and Fragments when the cursor is positioned in the field and you press the <Space Bar>. You may want to use the Fragments value if you are searching the Features index for a particular feature that occurs in the sequences. LookUp can represent that feature more precisely by showing its beginning and ending positions and, for nucleic acid sequences, the strand. See the FRAGMENT OUTPUT topic below.

WRITING QUERIES

[ Previous | Top | Next ]

Keep the following guidelines in mind as you write LookUp queries.

Logical Operators

You can type one or more values on each line of the query form. There are three logical operators that let you combine values in different ways:

- AND. Use & to specify AND. A & B means find all entries that contain both A and B.

- OR. Use | to specify OR. A | B means find all entries that contain either A or B.

- BUT-NOT. Use ! to specify BUT-NOT. A ! B means find all entries that contain A but do not contain B. Notice that BUT-NOT is order dependent. A but not B would find a completely different set of entries from B but not A.

Special Case: (C shell only) If you are specifying a query on the command line, and the query expression contains the ! (BUT-NOT) logical operator, you must preface the ! with a backslash ( \), for example -AUThor=McDonald\!Strand. (This does not apply to Korn shell users.) If you are using an init file to specify a query, you must enclose any query expression that contains a ! in double quotation marks. In this case, you do not include a backslash before a !, for example -AUThor="McDonald!Strand".

If you type values for more than one index, LookUp finds entries where each field conforms to the values you have typed. This is equivalent to saying that LookUp joins the values on the different lines with the logical operator AND. You can change this to OR by moving the cursor to the field labeled "Inter-field operator" and pressing the <Space Bar>.

Case Insensitivity

All queries are case insensitive. Regardless of whether you type uppercase or lowercase letters, LookUp converts all queries to uppercase.

Parentheses

If more than one logical operator appears in an expression without parentheses, LookUp evaluates the expression from left to right. However, you can group expressions within a query to define the order in which they are performed. Use parentheses to group expressions you want LookUp to evaluate first. For example, when you type Smithies & (Slightom | Blechl) as the value for the Author field, LookUp first searches for sequence entries containing references with Slightom or Blechl as authors. Then, out of those entries it searches for those which also contain Smithies as an author.

Wildcard Extension

LookUp accepts question marks (?) or asterisks (*) as wildcards anywhere within a value. A question mark represents any single character. If you type s?ith, you will retrieve entries containing authors named Smith, Slith, Sjith, etc., but not named Sith.

An asterisk represents zero or more characters. Typing *smith* will retrieve entries with authors named Smith, Hocsmith, Smithies, Hocsmithels, etc. Values with leading wildcards, such as *Smith, significantly reduce the speed of LookUp. Trailing wildcards usually have little effect on performance.

By default, LookUp treats every value in your query as if it ended with an asterisk wildcard. This automatic wildcard extension means that when you type pseudo, LookUp treats it as pseudo* and will retrieve entries containing the patterns pseudo, pseudo-, pseudogene, pseudoknot, etc.

You can turn off automatic wildcard extension for a single value by appending a pound sign (#) to the value, for instance pseudo#. LookUp then will find only those entries where the word pseudo occurs by itself. You can turn off automatic wildcard extension with -NOWILdcardextension. If you have automatic wildcard extension turned off, you can still use * to tell LookUp to extend a particular value. When automatic wildcard extension is turned off, the # character is treated as a literal part of your query.

Double Quotation Marks

If you are specifying a query on the command line, and the value contains a shell special character, such as a space, #, or &, you must enclose the value in double quotation marks ("), for example -KEYword="ribosomal proteins" or -DEFInition="transport#".

Special Case: If you are specifying a query on the command line, and the value has a comma in it, you must enclose the value in single (') AND double quotation marks ("), for example -AUThor='"Slightom,J.L."'. (Within init files however, use only the double quotation marks.) If you specify a query where the value has a comma or dash in it, you must enclose the value in double quotation marks ("), for example -AUThor="Slightom,J.L.".

INDEXED FIELDS

[ Previous | Top | Next ]

Below is a description of each indexed field on the LookUp query form. Following the example for each index is the parameter you would use to set a value for this index from the command line.

All text: globin & duplication (-ALLtext="globin & duplication")

This index is composed of most indices combined, including: Author, Definition, Feature, Keyword, Organism, Reference, and Title. If you think a word like globin or duplication might occur in a title, or a definition, or a feature, this index can search all three indices at once without making you type the value for each index separately.

Definition: globin (-DEFInition=globin)

This index contains each word in the definition of every database entry. The words are each indexed separately, without any regard for the order in which they appear. A definition like Human beta globin region on chromosome 11 generates completely independent indices for the words human, beta, globin, region, on, chromosome, and 11. A query value that would likely find this definition would be human & beta & globin. Hyphenated terms, such as beta-globin, are indexed as two separate words.

Author: Slightom (-AUThor=Slightom)

This index contains all of the authors cited in each database entry. Most databases do not use "et al.," so second, third, and fourth authors will usually be present. The index includes the author's surname followed by first and middle initials. No spaces separate the surname and initials. For example, Dr. J. L. Slightom would be indexed as Slightom,J.L. If you do not include the initials, LookUp will find all entries with an author whose surname starts with slightom.

Keyword: ribosomal proteins (-KEYword="ribosomal proteins")

This index contains every keyword in each database entry. Unlike most of the fields in LookUp, keyword values may contain spaces, as in ribosomal proteins. The discipline of assigning keywords differs greatly from database to database; for example, you cannot be sure that both the organism name and enzyme superfamily name will appear in every entry's keyword list. (See the All text index and also the CONSIDERATIONS topic.)

Entry name: Humhbb (-NAMe=Humhbb)

This index contains all of the sequence names in each database entry. These names (referred to as locus names in GenBank) should be unique, so you should not find more than one entry in a database for any name.

Accession number: J00179 (-ACCession=J00179)

This index contains all of the accession numbers in each database entry. While primary accession numbers are supposed to be unique in each database, secondary accession numbers can appear in more than one sequence.

Organism: Homo sapiens (-ORGanism="Homo sapiens")

Most sequence databases name the organism from which the sequence is derived. Organism values may contain spaces. In recent years both EMBL and GenBank have used systematic nomenclature whenever possible. If you want to specify a species name, the genus must precede the species (for example Homo sapiens). Typing just sapiens will find nothing.

The higher-order systematic names like Eukaryota, Animalia, Metazoa, Chordata, Vertebrata, Mammalia, Theria, Eutheria, Primates, Haplorhini, Catarrhini, Hominidae are indexed independently. If your query is not a species name, use only one higher-order systematic name.

Reference: EMBO&6#&523-&1987 (-REFerence="EMBO&6#&523-&1987")

*** The Reference index does not work correctly in this release! ***

Each reference is indexed into four independent subfields: journal name, volume number, beginning page number, and year. You can specify values for any or all of these subfields on this line. The order of the subfield values does not matter.

Journal names are indexed exactly as they appear in each database. If the curators of one database call a journal Nucleic Acids Res., the curators of another call the same journal Nucl. Acids Res., and a third database uses NAR, these differences will be reflected in the indices used by LookUp. Notice, however, that the expression (NAR | NUCL) & 1989 would probably find all of the sequences published in 1989 in Nucleic Acids Research.

The volume is a number less than 1950, the date is a number greater than 1950, and the beginning page is a number followed by a hyphen. If you specify values for more than one subfield, you must join the subfields with logical operators (see Logical Operators in the WRITING QUERIES topic).

For most references, specifying both the volume number and starting page number is definitive.

Title: globin & duplication (-TITle="globin & duplication")

This index contains all words in the titles of each citation in the databases. Some databases do not include the title for each citation, so failure to find a word that you think occurs does not imply that the reference of interest is not cited in one of the databases (see the Reference index). The words are indexed without regard for the order in which they first appeared. A title like A history of the human fetal globin gene duplication generates independent indices for each separate word: a, history, of, the, human, fetal, globin, gene, and duplication. An expression likely to find this title, if it were present, would be: globin & duplication & history.

This index is not available for UNIPROT.

Feature: cds (-FEAture=cds)

A feature is a region of a sequence that is identified in the feature table of a sequence database. Associated with each feature is a set of words that may include a gene name, a function, an EC number, etc. Every word associated with each feature is indexed independently without regard for order. If you type cds as the value, every coding sequence that is documented by a CDS feature will be found.

You can have the output show where each feature occurs within a sequence by selecting Fragments instead of Whole entries on the query form (see the FRAGMENT OUTPUT topic).

This index is not available for PIR.

On or before: 31-dec-1994 (-LATest=31-dec-1994)

On or after: 1-jan-1994 (-EARliest=1-jan-1994)

A Date index contains the date sequences were entered or updated in each database. With these two fields, you can identify sequences that were updated between any two dates. The format for a date value is DD-MMM-YYYY where D, M, and Y stand for day, month, and year respectively. The English abbreviations for the months of the year are: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec.

This index is not available for PIR.

Shortest sequence length: 10 (-SHOrtest=10)

Longest sequence length: 400 (-LONgest=400)

The lengths of every sequence in each database are indexed. You can use these fields to restrict the sequences found to those with certain lengths, for example between 10 and 400 characters.

LIST INPUT

[ Previous | Top | Next ]

Normally LookUp's search set consists of all of the sequences in one or more of the sequence databases. If you have a list file created by an earlier session with LookUp, you can use this smaller set of sequences as your search set by adding a parameter like -INfile=@lookup.list to the command line. (An at sign (@) must precede the name of the list file.)

You should use only list files that were created by earlier sessions with LookUp, as the input list file can contain only sequences that have been indexed for searching with LookUp. You cannot add sequences to the list that were not in the libraries when the LookUp indices were created.

Most of the advanced features of GCG list files are not supported by LookUp. In particular, you cannot include a reference to another list within a list. You cannot include sequences specified by accession number. You cannot include sequences that are specified ambiguously. For instance, the specification GenBank:Pp* has no meaning to LookUp. Note that the specification Viral:Ppv is also ambiguous, as this could refer either to EM_Vi:Ppv or GB_Vi:Ppv. The specification GB_Vi:Ppv is allowed, since this plum pox virus sequence is indexed for LookUp.

FRAGMENT OUTPUT

[ Previous | Top | Next ]

LookUp normally writes a list of sequences defined only by their database and entry names. Each element of such a list refers to the whole entry. The field on the query form that is labeled "Form of output list" toggles between the values Whole entries (the default) and Fragments when the cursor is positioned in the field and you press the <Space Bar>. You may want to use the Fragments value if you are searching the Features index for a particular feature that occurs in the sequences. LookUp can represent that feature more precisely by showing its beginning and ending positions and, for nucleic acid sequences, the strand. Features consisting of separate fragments are listed contiguously in the output list file and share the sameJoin name.

Here is some fragment output from a query designed to find complete coding regions for genes encoding xanthine dehydrogenase.

LOOKUP in: genbank  of: "[SQ-ALL: complete* & xanthine* &

                                dehydrogenase*] > [SQ-FTS: cds*]"

 6 features  July  3, 1995 14:34 ..

GB_IN:DROXDHA  Begin: 1086 End: 1139 Strand: + Join: DROXDHA-8

GB_IN:DROXDHA  Begin: 2164 End: 4776 Strand: + Join: DROXDHA-8

GB_IN:DROXDHA  Begin: 4839 End: 5981 Strand: + Join: DROXDHA-8

GB_IN:DROXDHA  Begin: 6049 End: 6216 Strand: + Join: DROXDHA-8

GB_IN:DROXDHA  Begin: 6284 End: 6334 Strand: + Join: DROXDHA-8

GB_PR:HSU06117  Begin: 64 End: 4065 Strand: + Join: HSU06117-3

GB_PR:HUMXDH  Begin: 131 End: 4147 Strand: + Join: HUMXDH-2

GB_PR:HUMXDHA  Begin: 58 End: 4059 Strand: + Join: HUMXDHA-2

GB_RO:RATXDHA  Begin: 27 End: 4022 Strand: + Join: RATXDHA-2

If you are querying LookUp from the command line, you can get this form of output with the -FRAgments parameter.

PIR does not support fragment output.

Extracting Features

Each fragment in the fragment output list file is accompanied by Begin,End, Strand and Join sequence attributes. You can use the Assemble program to extract the features into separate GCG sequence files. All sequences listed contiguously in the list file that share the same Join name are concatentated into a single sequence and the resulting sequence file is given the same name as the Join name. Fragments listed individually are extracted into separate sequence files, each with the same name as the corresponding Join name.

Using the features list file example above, Assemble writes five new sequence files. The first file, called droxdha-8.seg, contains the assembly from the first five sequence segments in the list file. The second file, hsu06117-3.seg, contains the assembly from the sixth sequence segment in the list file. Similarly, the remaining three sequence segments in the list file are extracted into separate sequence files. See the entry for Assemble in the Program Manual for more information about extracting features according to sequence attributes in a list file.

You can use the Translate program to translate features according to the sequence attributes in a features list file and write each translated sequence into its own GCG sequence file. See the entry for Translate in the Program Manual for more information about translating features according to sequence attributes in a list file.

Features with Ambiguous Begin or End Positions

Use -COMplete to ignore features whose start or end positions are not accurately identified. See the PARAMETER REFERENCE topic for more information.

CONSIDERATIONS

[ Previous | Top | Next ]

Note that the databases are inconsistent in their annotation. You can find misspellings as well as differences in hyphenation and the type of information entered in fields. As a result, you can never be certain that the list of entries in an output list contains every sequence of interest.

Suppose you wanted to find pseudogenes. You might consider searching nucleotide database entries using pseudo* as the value for the Definition field. But you cannot assume that such a search would be exhaustive. The text you choose may not have been used by every annotator who created an entry containing a pseudogene. If the pseudogene were an incidental part of the entry, the annotator may have noted it only in the feature table. For example, the definition for the sequence GB_Pr:Humhbb, which contains two pseudogenes, is Human beta globin region on chromosome 11.

In addition, be aware that all databases contain spelling errors; the misspelling psuedo occurred 10 times in the definitions of GenBank Release 95.0.

Hyphens in particular are used inconsistently. For example, to find as many entries as possible that are pseudogenes, you should search for pseudo-gene as well as pseudogene. Another example where inconsistent use of hyphens can cause problems is the globin family. GenBank definition lines may contain the terms beta-globin, beta globin, and beta-hemoglobin. One way to deal with this is to specify just one of the words, since LookUp indexes the words on either side of a hyphen separately. You can also use a leading wildcard. For example, if you type *globin, LookUp will retrieve the following members of the globin family: haptoglobin, hemoglobin, haemoglobin, myoglobin, cyanoglobin, plakoglobin, alphaglobin, alpha-globin, alpha-globin-3, alpha-1 globin, beta-min-globin, beta-3-globin, beta-2-globin, beta-H1-globin, beta-B globin, uteroglobin, y2-globin, beta-major globin, zeta-globin, and so on. Note that using leading wildcards significantly reduces the speed of LookUp.

Another consideration is that a value such as pseudo may occur in words other than pseudogene. In addition to pseudogene sequences, your output list may also contain RNA sequences known to form pseudoknots or sequences from the organism Pseudomonas.

VERY LONG SEQUENCES

[ Previous | Top | Next ]

Future releases of GenBank are expected not to have any sequences longer than than 350,000 bases. However as release 10.0 of GCG was being prepared, two sequences longer than 350,000 bases were still present in GenBank proper (GB_Ba:Ecouw67 and GB_Pl:Scchrix) and several dozen such sequences were present in the High Throughput Genome (HTG) division. These sequences are broken into overlapping fragments in GCG. The 372 kilobase sequence Ecouw67, for instance, is divided into four fragments: Ecouw67_0, Ecouw67_1, Ecouw67_2, and Ecouw67_3. Each fragment is 110,000 bases long and overlaps the one following it by 10,000 bases. All of the annotation appears with the first fragment, so LookUp normally returns only the first fragment if your query makes a hit on one of these long sequences. If you are searching for features and you are asking for fragment output, LookUp tries to infer which fragment contains the feature of interest. If a feature you find spans two of the fragments, it will not be represented correctly.

SUGGESTIONS

[ Previous | Top | Next ]

Note that the output list can contain any number of entries and may result in an extremely large output list file.

Become familiar with the format of each database by doing a number of simple queries and looking at the output carefully. The topic ANNOTATING LISTS tells you how to display the original records from each database.

Use the # symbol to turn off automatic wildcard extension, thereby reducing the number of entries in your output (see Wildcard Extension in the WRITING QUERIES topic).

If you search both protein and nucleotide databases in a single session, your output list will probably contain sequences of both types. Most GCG programs that do multiple sequence analysis do not support lists of mixed sequence type. For example, mixed lists are not suitable for input to programs such as PileUp, WordSearch, FastA, and FrameSearch. Therefore, if you want to use the output list as input to other GCG programs, you should search protein and nucleotide databases separately.

ANNOTATING LISTS

[ Previous | Top | Next ]

LookUp normally writes a simple list of sequences identified by entry name and definition. The -ANNotate parameter lets you add other annotation from the original sequence record to each sequence in the list to help you identify the sequence and understand how LookUp processed your query.

The values you can use with this parameter correspond to fields that are indexed for LookUp: ACCession, AUThor, DATe, DEFInition, FEAture, NAMe, KEYword, ORGanism, REFerence, and TITle. For example, -ANNotate=AUThor annotates each sequence in an output list with author names.

If you have chosen Whole entries for the "Form of output list" field of the query form when -ANNotate=FEAture, LookUp includes the whole feature table next to each sequence. This can create large output files. If you have chosen Fragments for this field when -ANNotate=FEAture, LookUp includes only the feature of interest.

The date does not appear on a separate line in GenBank, so if you want to see the date for GenBank entries, use -ANNotate=NAMe instead of -ANNotate=DATe. Note that LookUp does not support date or reference annotation for PIR.

Annotated lists, like other lists, are compatible with GCG programs that support multiple sequence specifications.

You can turn off annotation altogether with -NOANNotate. LookUp is much faster with annotation turned off.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % lookup [-ALLtext=]globin -Default

Prompted Parameters:

-LIBrary=pir[,...]        specifies one or more data libraries

-ALLtext=globin           searches all text indices

-DEFInition=globin        searches definition index for one or moe words

                            indexed independently, eg. "Globin & Region"

-AUThor=smithies          searches author index for one or more,

                            e.g. "Smithies, O. & Slightom, J.L."

-KEYword=globin           see document before using keywords

-NAMe=hsggl3              searches entry name index

-ACCessionnumber=s12345   searches accession number index

-ORGanism="Homo Sapiens"  searches genus and species index

-REFerence=cell&1981      searches complete reference index

-TITle=history            searches title of citation index

-FEAture=gamma            searches for any word in a feature table

-SHOrtest=100             finds only sequences of length 100 or more

-LONgest=400              finds only sequences of length 400 or less

-EARliest=01-apr-1992     searches for sequences modified on or after

                            specified date

-LATest=30-apr-1992       searches for sequences modified on or before

                            specified date

-MATch=or                 specifies inter-field logic (AND is default)

-OUTfile=lookup.list      names output file for list of sequences

Local Data Files:         None

Optional Parameters:

-NOWILdcardextension      turns off automatic wildcard extension

-INfile=@lookup.list      searches in lookup.list instead of libraries

-ANNotate=feature[,...]   shows fields from original annotation in output

                            acceptable values include: ACCession, AUThor,

                            DATe, DEFinition, FEAture, NAMe, KEYword,

                            ORGanism, REFerence, and TITle

-FRAgments                shows features as fragments instead of whole

                            entries

-COMplete                 shows only features with unambiguous coordinates

-MONitor                  shows databases searched and how many hits found

-ENTries                  prints the entries of the libraries menu and quits

-SRS                      prints the existing SRS libraries and quits

-NORC                     disables use of gendbconfigure:lookuprc

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-LIBrary=UniProt,GenBank

Searches the UniProt and GenBank data libraries.

-ALLtext=Globin

Searches all text indices for the word globin. The text indices are Author, Definition, Feature, Keyword, Organism, Reference, and Title. (Note that the Name and Accession Number indices are not included.)

-DEFInition=Globin

Searches for entries whose definition line contains the word globin.

-AUThor=Smithies

Searches for entries derived from publications containing an author whose surname is Smithies.

-KEYword=Globin

Searches for entries that contain the word globin in their KEYWORDS field.

-NAMe=hsggl3

Searches for the sequence entry whose name is HSGGL3. Depending on the database, the name may correspond to the LOCUS name, the ID name, the ENTRY name, etc.

-ACCessionnumber=S12345

Searches for the sequence entry whose accession number is S12345.

-ORGanism="Homo Sapiens"

Searches for any sequence entries deriving from the organism Homo sapiens. The genus and species names are indexed as a unit. If you want to search on the species name alone, you must preface it with a wild card: -ORGanism=*Sapiens.

-REFerence=Cell&1981

Searches for entries reported in the journal Cell in 1981. (This index does not work correctly in this release.)

-TITle=History

Searches for sequences reported in articles whose name contains the word history.

-FEAture=Gamma

Searches for sequence entries whose feature table contains the word gamma.

-SHOrtest=100

Searches for sequences containing 100 or more residues.

-LONgest=400

Searches for sequences containing 400 or fewer residues.

-EARliest=01-apr-1992

Searches for sequence entries that were entered or last modified on or after April 1, 1992

-LATest=30-apr-1992

Searches for sequence entries that were entered or last modified on or before April 30, 1992.

-MATch=OR

Specifies the logic to be used to combine index fields (the default is AND).

-NOWILdcardextension

LookUp normally treats all values in your query as if they ended with an asterisk wildcard (See Wildcard Extension in the WRITING QUERIES topic). You can suppress this automatic wildcard extension either by adding a # to the end of any value that you do not want extended or by using this parameter to suppress it for all values. With automatic wildcard extension turned off, you must explicitly append an asterisk to make any particular field value wild.

-INfile=@lookup.list

LookUp can use a list file created during a previous session with LookUp as the search set. A parameter like the one in this example can be used in place of -LIBrary=Uniprot. If both -LIBrary and -INfile are used, the program uses the list file and ignores the library parameter.

-ANNotate=AUThor[,...]

The values you can use with this parameter correspond to fields that are indexed for LookUp: ACCession, AUThor, DATe, DEFInition, FEAture, NAMe, KEYword, ORGanism, REFerence, and TITle. For example, -ANNotate=AUThor annotates each sequence in an output list with authors.

If you have chosen Whole entries for the "Form of output list" field of the form and -ANNotate=FEAture, LookUp includes the whole feature table next to each sequence. This can create large output files. If you have chosen Fragments for this field and -ANNotate=FEAture, LookUp includes only the feature of interest.

The date does not appear on a separate line in Genbank, so if you want the date for GenBank entries in your output list, use -ANNotate=NAMe instead of -ANNotate=DATe. Note that PIR entries are not indexed for date.

Annotated lists, like other lists, are compatible with GCG programs that support multiple sequence specifications.

You can turn off annotation altogether with -NOANNotate. LookUp is much faster with annotation turned off.

-FRAgments

LookUp normally writes a list file even if you search for sequences in the Feature index. In addition, it can show the exact locations of most features. If you select Fragments from the "Form of output list" field in the form or use -FRAgments, LookUp will represent features with their beginning and ending coordinates, the strand on which they are found, and whether they are joined to other features appearing below them in the list. These multi-fragment features can be joined together into a new composite sequence with Assemble or Translate.

-COMplete

Some features have starting and ending positions that are beyond the bounds of the sequence data archived in a particular database entry. In the features table of GenBank and EMBL, these features are represented with ranges that have a < before the beginning coordinate and/or a > before the ending coordinate. Here is a feature whose beginning lies before the first base stored in the sequence entry:

      CDS             <1. .81

                      /note="gamma globin;  NCBI gi: 386767"

There is no way to represent this ambiguous coordinate in a GCG list file, so it is written as if the coordinate were exact. Here is how the feature is normally represented in the list file:

GBPR:HUMHBG3E  Begin: 1 End: 81 Strand: + Join: HUMHBG3E-2 !Id: 0200...

!      CDS             <1. .81

!                      /note="gamma globin;  NCBI gi: 386767"

If you want to keep such features out of your LookUp output, you should use -COMplete. The ambiguous features will still appear in your output, but they are printed with an exclamation point preceding their names so that programs that use this list as input will ignore them. Here is how the fragment is represented when you use -COMplete:

! GBPR:HUMHBG3E  Begin: 1 End: 81 Strand: + Join: HUMHBG3E-2 !Id: 0200...

!      CDS             <1. .81

!                      /note="gamma globin;  NCBI gi: 386767"

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

LookUp prints a period on your screen every time it writes 100 lines into your output file.

RESTRICTIONS

Lookup searches on large EMBL or GenBank sequences may not yield correct results for annotation fields: ALL Text, AUTHOR, REFERENCE, EARLIEST, and LATEST. Ambiguous lookup queries for large sequences may not find any hits or can cause the Lookup program to crash.

EXAMPLE

Lookup may not find correct results when queries like one mentioned below is specified.

            %lookup –LIB=EMBLNEW –ALLTEXT=* -REF=* -DEFAULT

This problem is seen for large sequences which are formatted using embltogcg.

WORKAROUND

Use Dataset+ instead of genbanktogcg or embltogcg to format very large GenBank or EMBL sequences and then build lookup indices using this formatted data.

Printed: June 2, 2005 17:11

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.