SSEARCH

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

ALGORITHM

CONSIDERATIONS

SUGGESTIONS

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST and FastA, it can be very slow.

DESCRIPTION

[ Previous | Top | Next ]

SSearch uses William Pearson's implementation of the method of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to search for similarities between one sequence (the query) and any group of sequences of the same type (nucleic acid or protein) as the query sequence.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using SSearch to identify sequences in the PIR protein sequence database that are similar to a human globin protein sequence:

 
 
% ssearch
 
 SSEARCH with what query sequence ?  ggamma.pep
 
 Removing terminal * from query sequence...
 
                  Begin (* 1 *) ?
                End (*   147 *) ?
 
 Search for query in what sequence(s) (* PIR:* *) ?
 
 Don't show scores whose E() value exceeds: (* 10.0 *):
 
 What should I call the output file (* ggamma.ssearch *) ?
 
          1 Sequences         105 aa searched    PIR1:CCHU
        501 Sequences      93,217 aa searched    PIR1:IHQFT
 
        ///////////////////////////////////////////////////
 
 CPU time used:
       Database scan:  0:10:19.8
Post-scan processing:  0:00: 4.6
      Total CPU time:  0:10:24.6
 Output File: ggamma.ssearch
 
%

OUTPUT

[ Previous | Top | Next ]

The output from SSearch is a list file, and is suitable for input to any GCG program that allows indirect file specifications. (For information about indirect file specification, see Section 2, Using Sequence Files and Databases of the User's Guide.)

Here is some of the output file:

 
 
!!SEQUENCE_LIST 1.0
 
(Peptide) SSEARCH of: ggamma.pep  from: 1 to: 147  October 16, 1998 12:08
 
TRANSLATE of: gamma.seq check: 6474 from: 2179 to: 2270
      and of: gamma.seq check: 6474 from: 2393 to: 2615
      and of: gamma.seq check: 6474 from: 3502 to: 3630
generated symbols 1 to: 148.
Human fetal beta globins G and A gamma
from Shen, Slightom and Smithies,  Cell 26; 191-203. . . .
 
 TO: PIR:*  Sequences:    109,075  Symbols: 34,814,664
 
 Databases searched:
   NBRF, Release 57.0, Released on 30Jun1998, Formatted on 18Aug1998
 
 Scoring matrix: GenRunData:Blosum50.Cmp
 Variable pamfactor used
 Gap creation penalty: 12  Gap extension penalty: 2
 
Histogram Key:
 Each histogram symbol represents 179 search set sequences
 Each inset symbol represents 17 search set sequences
 z-scores computed from opt scores
 
z-score obs    exp
        (=)    (*)
 
< 20    879      0:=====
  22      8      0:=
  24     15      0:=
  26     21      2:*
  28     63     25:*
  30    153    149:*
  32    433    577:===*
  34   1130   1565:======= *
  36   2412   3213:==============   *
  38   4595   5310:==========================   *
  40   6860   7408:=======================================  *
  42   8728   9055:================================================= *
  44  10204   9988:=======================================================*==
  46  10709  10173:========================================================*===
  48  10286   9740:======================================================*===
  50   9525   8888:=================================================*====
  52   8477   7814:===========================================*====
  54   7042   6674:=====================================*==
  56   5774   5575:===============================*=
  58   4499   4577:=========================*
  60   3812   3708:====================*=
  62   2981   2972:================*
  64   2282   2364:=============*
  66   1778   1868:==========*
  68   1284   1470:========*
  70   1053   1152:======*
  72    791    900:=====*
  74    561    702:===*
  76    485    546:===*
  78    311    424:==*
  80    239    330:=*
  82    212    252:=*
  84    149    200:=*
  86    105    155:*
  88     87    120:*
  90     55     93:*
  92     50     72:*         :=== *
  94     43     55:*         :===*
  96     31     43:*         :==*
  98     23     33:*         :=*
 100     22     26:*         :=*
 102     17     20:*         :=*
 104      9     15:*         :*
 106      7     12:*         :*
 108      7      9:*         :*
 110      4      7:*         :*
 112      5      6:*         :*
 114      7      4:*         :*
 116      1      3:*         :*
 118      1      3:*         :*
>120    850      2:*====     :*=======================================
 
 Smith-Waterman (PGopt): reg.-scaled
 
The best scores are:                                    s-w    z-sc  E(108303)..
 
PIR1:HGCZG
! hemoglobin gamma-G chain - chimpanzee                 971  1317.6  1.5e-66
PIR1:I37025
! hemoglobin gamma-G chain - gorilla                    971  1317.6  1.5e-66
PIR1:HGHUG
! hemoglobin gamma-G chain - human                      971  1317.6  1.5e-66
 
////////////////////////////////////////////////////////////////////////////
 
\\End of List
 
ggamma.pep
PIR1:HGCZG
 
P1;HGCZG - hemoglobin gamma-G chain - chimpanzee
N;Alternate names: hemoglobin gamma-1 chain
C;Species: Pan troglodytes (chimpanzee)
C;Date: 31-May-1996 #sequence_revision 21-Jan-1997 #text_change 14-Nov-1997
C;Accession: I36939; I61853
R;Slightom, J.L.; Chang, L.Y.; Koop, B.F.; Goodman, M. . . .
 
SCORES   z-score: 1317.6 E(): 1.5e-66
Smith-Waterman score: 971;   100.0% identity in 147 aa overlap
 
                     10        20        30        40        50        60
ggamma.pep   MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPK
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HGCZG        MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPK
                     10        20        30        40        50        60
 
                     70        80        90       100       110       120
ggamma.pep   VKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFG
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HGCZG        VKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFG
                     70        80        90       100       110       120
 
                    130       140
ggamma.pep   KEFTPEVQASWQKMVTGVASALSSRYH
             |||||||||||||||||||||||||||
HGCZG        KEFTPEVQASWQKMVTGVASALSSRYH
                    130       140
 
/////////////////////////////////////////////////////////////////////////
 
! Distributed over 1 thread.
!      Start time: Fri Oct 16 11:56:55 1998
! Completion time: Fri Oct 16 12:08:38 1998
 
! CPU time used:
!        Database scan:  0:10:19.8
! Post-scan processing:  0:00:04.6
!       Total CPU time:  0:10:24.6
! Output File: ggamma.ssearch

What is the Output?

The first part of the output file contains a histogram showing the distribution of the z-scores between the query and search set sequences. (See the ALGORITHM topic for an explanation of z-score.) The histogram is composed of bins of size 2 that are labeled according to the higher score for that bin (the leftmost column of the histogram). For example, the bin labeled 24 stores the number of sequence pairs that had scores of 23 or 24.

The next two columns of the histogram list the number of z-scores that fell within each bin. The second column lists the number of z-scores observed in the search and the third column lists the number of z-scores that were expected.

The body of the histogram displays a graphical representation of the score distributions. Equal signs (=) indicate the number of scores of that magnitude that were observed during the search, while asterisks (*) plot the number of scores of that magnitude that were expected.

At the bottom of the histogram is a list of some of the parameters pertaining to the search.

Below the histogram, SSearch displays a listing of the best scores. Strand:- after the sequence name in this list indicates that the match was found between search set sequence and the reverse complement of the query sequence.

Following the list of best scores, SSearch displays the alignments of the regions of best overlap between the query and search sequences. /rev following the query sequence name indicates that the search sequence is aligned with the reverse complement of the query sequence.

This program displays only the region of overlap between the two aligned sequences (plus some residues on either side of the region to provide context for the alignment) unless you use -SHOWall. The display of identities and conservative replacements between the aligned sequences depends on the value of -MARKx. By default ( -MARKx=3), the pipe character (|) is used to denote identities and the colon (:) to denote conservative replacements.

INPUT FILES

[ Previous | Top | Next ]

SSearch accepts a single protein sequence or a single nucleic acid sequence as the query sequence. The search set is either a single sequence or multiple sequences of the same type as the query. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*. The function of SSearch depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST.

BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds. NetBLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

TFastX does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA, and like TFastA, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.

ProfileSearch and MotifSearch use a profile (derived from a set of aligned sequences) instead of a query sequence to search a collection of sequences. FindPatterns uses a pattern described by a regular expression to search a collection of sequences. HmmerSearch uses a profile hidden Markov model as a query to search a sequence database to find sequences similar to the family from which the profile HMM was built. Profile HMMs can be created using HmmerBuild.

StringSearch, LookUp, and Names identify sequences by searching the annotation (non-sequence) portions of seqence files or sequence databases.

FastA+ does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA+ may be more sensitive than BLAST+.

BLAST+ searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST+ can produce gapped alignments for the matches it finds. NetBLAST+ searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST+ can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

TFastA+ does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

TFastX+ does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA+, and like TFastA+, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX+ does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX+ translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

RESTRICTIONS

[ Previous | Top | Next ]

The query sequence cannot be longer than 32,000 symbols. You cannot select a list size of more than 1,000 best scores nor view more than 1,000 alignments.The sequence type (nucleic acid or protein) of the query sequence and the search set sequences must match.

For the estimates of statistical significance to be valid, the search set must contain a large sample of unrelated sequences. The statistical estimates will not be calculated at all if there are fewer than 10 sequences in the search set (20 sequences if only one strand is searched).

ALGORITHM

[ Previous | Top | Next ]

SSearch uses William Pearson's implementation of the method of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to search for similarities between one sequence (the query) and any group of sequences of the same type (nucleic acid or protein) as the query sequence. This method uses a scoring matrix (containing match/mismatch scores), a gap creation penalty, and a gap extension penalty as scoring criteria to determine the best region of local similarity between a pair of sequences. This score is reported as the Smith-Waterman score.

After the Smith-Waterman score for a pairwise alignment is determined, SSearch uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair. (See William R. Pearson, Protein Science 4; 1145-1160 (1995) for an explanation of how this z-score is calculated.)

The distribution of the z-scores tends to closely approximate an extreme-value distribution; using this distribution, the program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() score.

When all of the search set sequences have been compared to the query, the list of best scores is printed. If alignments were requested, the alignments are also printed.

In evaluating the E() scores, the following rules of thumb can be used: for searches of a protein database of 10,000 sequences, sequences with E() less than 0.01 are almost always found to be homologous. Sequences with E() between 1 and 10 frequently turn out to be related as well.

CONSIDERATIONS

[ Previous | Top | Next ]

Accelrys GCG (GCG) version of SSearch searches using both strands of nucleic acid queries unless you use -ONEstrand. The SSEARCH program distributed with Dr. Pearson's FASTA package searches with one strand only.

The E() scores are affected by similarities in sequence composition between the query sequence and the search set sequence. Unrelated sequences may have "significant" scores because of composition bias.

If there is a database entry that overlaps your query in several places, but there are large gaps between the matching regions, only the best overlap appears in the alignment display.

There are two ways to control the size of the list of best scores. By default, scores are listed until a specific E() value is reached. You may set the value in response to the program prompt or by using -EXPect; otherwise the program uses 10.0 for protein searches, 2.0 for nucleic acid searches. (If you are running the program interactively, it will show no more than 40 scores initially, and ask if you want to see more scores if there are any more that are less than the E() value.)

If you use -LIStsize, the E() value is ignored, and the program will list the number of scores you requested.

You can control the number of alignments using -NOALIgn and -ALIgn. The program behaves differently depending on whether it is being run noninteractively (in batch or with -Default on the command line) or interactively. In the noninteractive case, the program displays the number of alignments set by -ALIgn. (If this is not present, it shows 40 alignments or the number of scores that were listed, whichever is smaller.) If you run the program interactively, it displays the list of best scores, then asks you how many alignments you want to see. (This prompt does not appear if you use -NOALIgn or -ALIgn.)

Adjusting Gap Creation and Extension Penalties

Unlike other GCG programs, SSearch does not read the default gap creation and gap extension penalties from the scoring matrix file. It uses default gap creation and extension penalties that were empirically determined to be appropriate for the default scoring matrices. If you select a different scoring matrix with -MATRix, you may need to change the gap penalties. The histogram display gives a qualitative view of the quality of fit between the actual distribution of scores and the expected distribution of scores. This information may indicate whether or not suitable gap creation and extension penalties were used for the search. When the histogram shows poor agreement between the actual distribution and the theoretical distribution, you might consider using -GAPweight and/or -LENgthweight to specify higher gap creation and extension penalties, respectively. For example, you might increase the gap creation penalty from 12 to 16 and the gap extension penalty from 2 to 4.

Differences in Applying Gap Extension Penalties

There are two different philosophies on how to penalize gaps in an alignment. One way is to penalize a gap by the gap creation penalty plus the extension penalty times the length of the gap (gapweight + (lengthweight x gap length)). The other way is to use the gap creation penalty plus the extension penalty times the gap length excluding the first residue in the gap (gapweight + (lengthweight x (gap length - 1)).

"Native" GCG programs, such as Framesearch and Bestfit, handle gap extension penalties the first way, while the FastA-family programs use the second way. Therefore a value for -LENgthweight that gives good results with one of the FastA-family programs may not give equivalent results with a native GCG program, and vice versa.

Increasing Program Speed Using Multithreading

This program is multithreaded. It has the potential to run faster on a machine equipped with multiple processors because different parts of the analysis can be run in parallel on different processors. By default, the program assumes you have one processor, so the analysis is performed using one thread. You can use -PROCessors to increase the number of threads up to the number of physical processors on the computer.

Under ideal conditions, the increase in speed is roughly linear with the number of processors used. But conditions are rarely ideal. If your computer is heavily used, competition for the processors can reduce the program's performance. In such an environment, try to run multithreaded programs during times when the load on the system is light.

As the number of threads increases, the amount of memory required increases substantially. You may need to ask your system administrator to increase the memory quota for your account if you want to use more than two threads.

Never use -PROCessors to set the number of threads higher than the number of physical processors that the machine has -- it does not increase program performance, but instead uses up a lot of memory needlessly and makes it harder for other users on the system to get processor time. Ask your system administrator how many processors your computer has if you aren't sure.

SUGGESTIONS

[ Previous | Top | Next ]

Identifying the Search Set

If you want to search a single database division instead of an entire database, see the "Using Database Sequences" topic of Section 2, Using Sequence Files and Databases of the User's Guide for a list of the logical names used for the databases and the divisions of each database. The search set can also consist of a group of sequence files that are not in a database. Use a multiple sequence specification to name these. For information about naming groups of sequences for the search set, see the topics "Specifying Files" and "Using Wildcards" in Section 1, Getting Started, and "Using Database Sequences," "Using Multiple Sequence Format (MSF) Files", "Using Rich Sequence Format (RSF) Files", and "Using List Files" in Section 2, Using Sequence Files and Databases of the User's Guide.

Batch Queue

SSearch is one of the few programs in GCG that can take more than a few minutes to run. Most comparisons should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide. Very large comparisons may exceed the CPU limit set by some systems.

Interrupting a Search: <Ctrl>C

You can type <Ctrl>C to interrupt a search and see the results from the part of the search that has already been completed. Because the program is multithreaded, the search may not be interrupted immediately, but will continue until one of the threads finishes processing its data and returns for more data.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % ssearch [-INfile1=]ggamma.pep -Default
 
Prompted Parameters:
 
[-INfile2=]pir:*               specifies the search set
[-OUTfile=]ggamma.ssearch      names the output file
-BEGin=1 -END=148              sets the range of interest
-EXPect=2.0                    lists scores until E() value reaches 2.0
 
Local Data Files:
 
-MATRix=fastadna.cmp           assigns the scoring matrix for nucleic acids
-MATRix=blosum50.cmp           assigns the scoring matrix for proteins
 
Optional Parameters:
 
-PROCessors=2      sets the number of threads devoted to the analysis
                     on a multiprocessor computer
-MINLength=1000    searches only sequences of 1000 or more residues
-MAXLength=5000    searches only sequences of 5000 or fewer residues
-SINce=6.90        limits search to sequences dated on or after June 1990
-ONEstrand         searches using only the top strand of nucleotide queries
-GAPweight=16      sets the gap creation penalty (12 is protein default)
-LENgthweight=4    sets the gap extension penalty (2 is protein default)
-LIStsize=40       shows the best 40 scores (overrides EXPect)
-ALIgn=20          shows the best 20 alignments
-NOALIgn           suppresses sequence alignments
-SHOWall           shows complete sequences in alignment, not just overlaps
-MARKx=3           sets the alignment display mode
-NOHIStogram       suppresses printing the histogram
-LINesize=60       sets number of sequence symbols per line of the alignment
-NODOCLines        suppresses sequence documentation in the alignment
-BATch             submits the program to run in the batch queue
-NOMONitor         suppresses the screen trace for each search set sequence

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.

SSearch reads a scoring matrix containing the values for every possible match from your working directory or the public database. The files fastadna.cmp (for nucleic acid sequences) and blosum50.cmp (for protein sequences) contain the default values for matches. blosum50.cmp is a BLOSUM50 matrix. You can use the Fetch program to obtain a copy of these files in order to modify them to suit your own needs.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-MATRix=mymatrix.cmp

Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-EXPect=2.0

Shows all scores whose E() value is less than 2.0. Ignored if -LIStsize is used.

-PROCessors=2

Tells the program to use 2 threads for the database search on a multiprocessor computer.

-MINLength=1000

Restricts the search to search set sequences that are equal to or longer than 1000 residues.

-MAXLength=5000

Restricts the search to search set sequences that are equal to or shorter than 5000 residues.

-SINce=6.1990

Limits the search to sequences that have been entered into the database or modified since June 1990. As this is being written, only the EMBL, GenBank, and SWISS-PROT databases support this parameter.

-ONEstrand

Searches using only the top strand of a nucleotide query sequence.

-GAPweight=12

Specifies the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

-LENgthweight=2

Specifies the gap extension penalty that is subtracted from the alignment score for each residue added to an existing gap.

-LIStsize=40

Shows the best 40 scores. Overrides -EXPect.

-ALIgn=10

Limits the number of alignments to display in the output file to the 10 best matches in the list. Use the -NOALIgn to suppress the sequence alignments in the output file.

-SHOWall

Shows entire sequences in the alignment display, instead of just the best region of overlap and its surroundings.

-MARKx=3

Determines the alignment display mode -- especially the symbols that identify matches and mismatches. The default value, 3, uses a pipe character (|) to show identities and a colon (:) to show conservative replacements. -MARKx=0 uses a colon to show identities and a period (.) to show conservative replacements. -MARKx=1 will not mark identities; instead, conservative replacements are connected with a lowercase x, and non-conservative substitutions are connected with an uppercase X. If -MARKx=2, the residues in the second sequence are shown only if they differ from the first sequence.

Use -MARKx=10 to get aligned sequences in the FastA "parsable" output format. A document describing this format appears after FastA in the Program Manual.

-NOHIStogram

Suppresses printing the histogram.

-LINesize=60

Lets you set the number of sequence symbols in each line of the alignment to any number between 60 and 200.

-NODOCLines

Suppresses the documentation from the search set sequence accompanying the alignment in the output file. Use -DOCLines=5 to copy only five non-blank lines of documentation.

-BATch

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

-MONitor=500

Monitors this program's progress on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it with -NOMONitor.

The monitor is updated every time the program processes 500 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.

Printed: May 27, 2005 14:44 


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio