FRAMESEARCH

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

SCORE DISTRIBUTION PLOT

INPUT FILES

RELATED PROGRAMS

ALGORITHM

ALIGNMENT METRICS

CONSIDERATIONS

INCREASING PROGRAM SPEED USING MULTITHREADING

SUGGESTIONS

GRAPHICS

<CTRL>C

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

DESCRIPTION

[ Previous | Top | Next ]

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program creates the optimal local alignment of the best region of similarity between the protein sequence and all possible codons on each strand of the nucleotide sequence. Because FrameSearch can match the protein to codons in different reading frames of the nucleotide sequence as part of the same alignment, it can identify sequence similarity even when the nucleotide sequence contains reading frame shifts.

In standard sequence alignment programs, you routinely specify gap creation and extension penalties. In addition to these penalties, FrameSearch also allows you to specify a separate frameshift penalty for the creation of gaps that result in reading frame shifts in the nucleotide sequence. (See the ALGORITHM topic for a more detailed explanation of how gaps are penalized.)

By default, the search proceeds as a local alignment between the query sequence and each sequence in the search set. Optionally, you can search using a global alignment procedure where FrameSearch inserts gaps to optimize the alignment between the entire nucleotide sequence and the entire protein sequence.

The search output contains an ordered list of the sequences in the search set that have the highest comparison scores when aligned to the query sequence. The actual alignments for these top-scoring matches are displayed after the list.

You can specify multiple query sequences (such as a list file or a sequence specification using an asterisk (*) wildcard) as input to FrameSearch. The program compares each query sequence separately to the sequences specified in the search set, and it writes a separate output file for each query search. If you use a list file as your query, you can add begin and end sequence attributes to specify the range for each query sequence. For more information about list files, see "Using List Files" in Section 2, Using Sequence Files and Databases in the User's Guide.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using FrameSearch to find sequences in PIR with similarities to the translation product of the cDNA sequence EST:Z17438.

 
 
% FrameSearch
 
 FRAMESEARCH with what query sequence(s) ? EST:Z17438
 
                  Begin (* 1 *) ?
                End (*   286 *) ?
 
 Search for query in what sequence(s) (* PIR:* *) ?
 
 What is the gap creation penalty (* 8 *) ?
 
 What is the gap extension penalty (* 2 *) ?
 
 What is the frameshift penalty (* 0 *) ?
 
 This program can plot the distribution of alignment search scores graphically.
 Do you want to:
 
     A) Plot to a FIGURE file called "z17438.figure"
     B) Plot graphics on LaserWriter attached to /dev/tty10
     C) Suppress the plot
 
 Please choose one (* A *):
 
 What should I call the output file (* z17438.framesearch *) ?
 
          1 Sequences         105 aa searched    PIR1:CCHU
        101 Sequences      10,862 aa searched    PIR1:CCQFM2
 
        //////////////////////////////////////////////////////
 
    108,901 Sequences  34,781,270 aa searched    PIR4:GNLRL1
    109,001 Sequences  34,799,349 aa searched    PIR4:I54264
 
 Aligning........................................
 
 FIGURE instructions are now being written into z17438.figure.
 
 CPU time used:
        Search time: 13:53:45.3
   Post-search time:  0: 0: 4.4
     Total CPU time: 13:53:49.7
 
 Output File: z17438.framesearch
 
%

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:

 
 
!!SEQUENCE_LIST 1.0
  FRAMESEARCH of: GB_EST2:Z17438  check: 2422   from: 1  to: 286
 
LOCUS       Z17438        286 bp    mRNA            EST       10-NOV-1992
DEFINITION  ATTS0012 AC16H Arabidopsis thaliana cDNA clone TAT1B11 5' similar
            to GLYCERALDEHYDE 3-PHOSPHATE DEHYDROGENASE. Swiss-Prot entry
            P25858, mRNA sequence.
ACCESSION   Z17438
VERSION     Z17438.1  GI:16580 . . .
 
 TO: PIR:*  Sequences: 250,417  Total-length: 85,931,133  December 29, 2001 06:3
0
 
 Databases searched:
        NBRF, Release 70.0, Released on 17Oct2001, Formatted on 14Nov2001
 
 Scoring matrix: GenRunData:blosum62.cmp
 Translation table: GenRunData:translate.txt
 
  Gap creation penalty:      8
 Gap extension penalty:      2
    Frameshift penalty:      0
 
The best scores are:                                                  ..
 
PIR2:JQ1287  glyceraldehyde-3-phosphate dehydrogenase (EC 1.2.1.12)...  363
PIR1:DEIS3C  glyceraldehyde-3-phosphate dehydrogenase (EC 1.2.1.12)...  351
PIR1:DENDG  glyceraldehyde-3-phosphate dehydrogenase (EC 1.2.1.12) ...  333
 
///////////////////////////////////////////////////////////////////////////
 
PIR2:D85788  glyceraldehyde-3-phosphate dehydrogenase A [imported] ...  262
PIR2:H90939  glyceraldehyde-3-phosphate dehydrogenase A [imported] ...  262
PIR1:DEBYG3  glyceraldehyde-3-phosphate dehydrogenase (EC 1.2.1.12)...  262
 
\\End of list
 
        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   2
                    . =   1
 
z17438
JQ1287
 
            Quality:    363             Length:    240
              Ratio:  4.654               Gaps:      2
 Percent Similarity: 98.718   Percent Identity: 97.436
 
                  .         .         .         .         .
       3 GAAATCAAGAAGGCCATCAAGGAGGAATCTGAAGGCAAAATGAAGGGAAT 52
         |||||||||||||||||||||||||||||||||||||||:::||||||||
     261 GluIleLysLysAlaIleLysGluGluSerGluGlyLysLeuLysGlyIl 277
                  .         .         .         .         .
      53 TTTGGGATACTCTGAGGATGATGTTGTGTCTACCGACTTTGTTGGTGACA 102
         ||||||||||...|||||||||||||||||||||||||||||||||||||
     278 eLeuGlyTyrThrGluAspAspValValSerThrAspPheValGlyAspA 294
                  .         .         .         .         .
     103 ACAGGTCAAGCATTTTCGATGCCAAGGCTGGATTGCATTGCATTGAGCGA 152
         ||||||||||||||||||||||||||||||||    ||||||||||||||
     295 snArgSerSerIlePheAspAlaLysAlaGly....IleAlaLeuSerAs 309
                  .         .         .         .         .
     153 CAAGTTTGTGAAGTTGGTGTCATGGTACGACAACGAATGGGGTTACACAG 202
         ||||||||||||||||||||||||||||||||||||||||||||||  ||
     310 pLysPheValLysLeuValSerTrpTyrAspAsnGluTrpGlyTyr..Se 325
                  .         .         .         .
     203 TTCTCGTGTCGTTGACCTTATCGTTCACATGTCAAAGGCC 242
         ||||||||||||||||||||||||||||||||||||||||
     326 rSerArgValValAspLeuIleValHisMetSerLysAla 338
 
///////////////////////////////////////////////////////////////
! CPU time used:
!        Search time: 13:53:45.3
!   Post-search time:  0: 0: 4.4
!     Total CPU time: 13:53:49.7
 

The FrameSearch output is an ordered list of those sequences with the highest alignment scores when compared to the query sequence. It reports each high-scoring sequence name along with a short line of sequence documentation and the alignment score. If /rev follows the sequence name, the match is to the reverse-complement strand of the nucleotide sequence.

By default, each line of the output list has space for 70 characters, including the sequence name and documentation. You can increase this space for documentation that accompanies each reported sequence by specifying a larger number with -LINesize.

Following the list of best scores, FrameSearch displays the optimal alignments between the query sequence and the top-scoring sequences in the search list. The alignment output displays sequence similarity by printing one of three characters between a codon and an amino acid: a pipe character (|), a colon (:), or a period (.). Normally, a pipe character is put between a codon and an amino acid when the translated codon is identical to the amino acid. A colon is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to the average positive non-identical comparison value in the amino acid substitution matrix. A period is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to 1. You can change these match display thresholds by specifying -PAIr. (See Appendix VII for more information about comparison values in scoring matrices.)

The FrameSearch output file can be used as a list file for input to other Accelrys GCG (GCG) programs.

If you specify multiple query sequences as input (see the INPUT FILES topic), FrameSearch writes a separate text output file for each query sequence used to search the search set.

SCORE DISTRIBUTION PLOT

[ Previous | Top | Next ]

By default, FrameSearch plots a histogram showing the number of sequence comparisons with each different score. This plot can help you judge which of the sequences in your output list are significant and whether the output list was large enough to contain all of the significant scores. Here is the score distribution plot from the example session:

By looking at this plot, you can conclude that comparisons with a score of less than about 65 are probably part of the population of sequences with only random similarity to EST:Z17438.

If you specify multiple query sequences as input (see the INPUT FILES topic), or you add either -BATch or -Default, the score distribution plot for each query search is written to its own Figure file. Each Figure file is named after the query sequence and given the .figure file name extension. You can then use the Figure program to display any of the score distribution plots on the supported graphics device of your choice.

INPUT FILES

[ Previous | Top | Next ]

The input to FrameSearch is one or more query sequences and one or more search set sequences. If the query input is one or more nucleotide sequences, the program will search a set of protein sequences; if the query input is one or more protein sequences, the program will search a set of nucleotide sequences. If the query input contains both nucleotide and protein sequences, the program will skip those query sequences that are not of the same type as the first sequence in the group. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*.

If you use a list file to specify query sequences, you can add begin and end sequence attributes to specify a range for each sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?" TFastX does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA, and like TFastA, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST. TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST and FastA, it can be very slow.

BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds.

ProfileSearch uses a profile (representing a group of aligned sequences) as a query to search the database for new sequences with similarity to the group. The profile is created with the program ProfileMake.

MotifSearch uses a set of profiles (representing similarities within a family of sequences) as a query to either a) search a database for new sequences similar to the original family, or b) annotate the members of the the original family with details of the matches between the profiles and each of the members. Normally, the profiles are created with the program MEME.

FindPatterns, LookUp, StringSearch, and Names are other sequence identification programs.

FrameAlign creates an optimal alignment of the best segment of similarity (local alignment) between a protein sequence and the codons in all possible reading frames on a single strand of a nucleotide sequence. Optimal alignments may include reading frame shifts.

ALGORITHM

[ Previous | Top | Next ]

FrameSearch aligns the query sequence to each sequence in the search set. The alignment procedure is an extension of the local alignment algorithm of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) that is modified to determine the score of the best segment of similarity between a protein sequence and the codons in a nucleotide sequence.

Scoring Matrix

To create the alignments, FrameSearch requires a scoring matrix that contains values for matches between all possible amino acids and codons. FrameSearch derives this amino acid - codon scoring matrix on the fly from a translation table and an amino acid substitution matrix. The translation table contains a list of all possible codons for each amino acid. The amino acid substitution matrix contains match values for the comparison of all possible amino acids.

In the derived amino acid - codon scoring matrix, the value of a match between any amino acid and any codon is the value of the match between the amino acid and the translated codon in the amino acid substitution matrix. If a codon contains IUB nucleotide ambiguity symbols (described in Appendix III), and all possible unambiguous representations of the codon translate to the same amino acid (e.g. MGR always translates to arginine in the standard genetic code), then the value of a match between that codon and any amino acid can be similarly determined. If all possible unambiguous representations of the codon do not translate to the same amino acid, then that codon is assumed to translate to an 'X'.

FrameSearch chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with -MATRix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use -GAPweight and -LENgthweight or respond to the program prompts to specify alternative gap penalties if you don't want to accept the default values.

Protein-Nucleotide Alignment

FrameSearch uses the values in the amino acid - codon scoring matrix to determine the score of the best alignment between the protein and nucleotide sequences. If you consider a graph, or path matrix, with the nucleotide sequence placed on the X axis and the protein sequence placed on the Y axis, then every point on the path matrix represents the best alignment between the sequences that ends at that point. For any point on the path matrix, the X coordinate is the first nucleotide of the final codon in the alignment, and the Y coordinate is the final amino acid in the alignment. Each possible alignment end point is associated with a path, which is a series of steps (insertions, deletions, matches) through the path matrix required to create the alignment. Each step has its own score, and the scores for all the steps in an alignment path determine the quality score for the alignment. The quality score for an alignment is equal to the sum of the scoring matrix values of the matches in the alignment, minus the gap creation penalty multiplied by the number of gaps in the alignment, minus the frameshift penalty multiplied by the number of gaps in the alignment that change the reading frame, minus the gap extension penalty multiplied by the total length of all gaps in the alignment. (You can set the value for each of the penalties.)

 
 
quality = SUM(scoring matrix values of the matches in the alignment) -
          gap creation penalty  x  number of gaps in the alignment -
          frameshift penalty    x  number of gaps in the alignment
                                   that change the reading frame -
          gap extension penalty x  total length of all gaps
                                   in the alignment
 

For example, the following protein-nucleotide alignment consists of six steps:

 
 
       1 UGUUGUAUUCG....UGGUGG 17
         ||||||:::      ||||||
       1 CysCysValGlnIleTrpTrp 7
 

The first two steps are UGU-Cys matches. The third step is an AUU-Val match. The fourth step is a four nucleotide deletion. The last two steps are UGG-Trp matches. The quality score for this alignment is the sum of the scoring matrix values for two UGU-Cys matches, one AUU-Val match, and two UGG-Trp matches, minus one gap creation penalty, minus four gap extension penalties, minus one frameshift penalty.

Matches between an amino acid and a partial codon, like

 
 
                  CG.
 

 

                  Gln

 

in the above example, do not add any match value to the alignment score. By convention, all gap characters in partial codons are placed at the end of the codon. For example, the partial codon CG. in the above example will never be written as C.G

If the best alignment ending at any point has a negative value, a zero is put at that position of the path matrix; otherwise, the quality score for the alignment is put at that position. After the path matrix is completely filled, the highest value in the matrix represents the score of the best region of similarity between the sequences (optimal local alignment). This highest value is reported as the comparison score between the nucleotide and protein sequences. The alignment itself can be reconstructed for display by following the best path from this point of highest value backward to the point where the path matrix has a value of zero.

ALIGNMENT METRICS

[ Previous | Top | Next ]

Four figures of merit are displayed along with the optimal alignments between the query sequence and the top-scoring search sequences: Quality, Ratio, Identity, and Similarity.

The Quality score (described above in the ALGORITHM topic) is the measure that is maximized in order to align the sequences. Ratio is the Quality divided by the smaller of one-third the number of bases in the alignment and the number of amino acids in the alignment. Gap symbols are ignored in the calculation of Ratio. Identity is the percent of identical matches between amino acids and codons in the alignment (i.e. the amino acid is identical to the translated codon). Similarity is the percent of matches between amino acids and codons in the alignment whose comparison values exceed the similarity threshold. By default, this threshold is the average positive non-identical comparison value in the scoring matrix. FrameSearch uses this same threshold to decide when to put a colon (:) between an aligned codon and amino acid in the alignment display. You can reset this threshold with -PAIr.

CONSIDERATIONS

[ Previous | Top | Next ]

FrameSearch displays the alignments between each query sequence and the top-scoring sequences in the search set. If the program cannot gain access to enough computer memory to display the alignments, the program stops after listing the top-scoring sequences in the output file.

FrameSearch can take several hours to search the protein database for sequences similar to the translation product of a single nucleotide query sequence (see the SUGGESTIONS topic for details).

INCREASING PROGRAM SPEED USING MULTITHREADING

[ Previous | Top | Next ]

This program is multithreaded. It has the potential to run faster on a machine equipped with multiple processors because different parts of the analysis can be run in parallel on different processors. By default, the program assumes you have one processor, so the analysis is performed using one thread. You can use -PROCessors to increase the number of threads up to the number of physical processors on the computer.

Under ideal conditions, the increase in speed is roughly linear with the number of processors used. But conditions are rarely ideal. If your computer is heavily used, competition for the processors can reduce the program's performance. In such an environment, try to run multithreaded programs during times when the load on the system is light.

As the number of threads increases, the amount of memory required increases substantially. You may need to ask your system administrator to increase the memory quota for your account if you want to use more than two threads.

Never use -PROCessors to set the number of threads higher than the number of physical processors that the machine has -- it does not increase program performance, but instead uses up a lot of memory needlessly and makes it harder for other users on the system to get processor time. Ask your system administrator how many processors your computer has if you aren't sure.

SUGGESTIONS

[ Previous | Top | Next ]

Searching Only the Top Strand of Nucleotide Sequences

By default, FrameSearch searches both strands of nucleotide sequences. If your nucleotide query sequence is known to represent the coding strand, you can use -ONEstrand to search using only the top strand of the query sequence. This reduces the time required to search the protein database by 50 percent. If you are searching a nucleotide sequence database for similarity to a protein query sequence, -ONEstrand will search only the top strand of each sequence in the database.

Global Similarity

By default, FrameSearch uses a local alignment algorithm to determine the best segment of similarity between the query sequence and each sequence in the search set (see the ALGORITHM topic for details). If you specify -GLObal, FrameSearch uses a global alignment procedure to determine similarity between the entire length of each query sequence and the entire length of each sequence in the search set.

Nucleotide Sequences Using Nonstandard Genetic Codes

If the nucleotide sequence(s) involved in the search are from an organism or organelle that uses a nonstandard genetic code, then you should specify an appropriate translation table using -TRANSlate. Different translation tables are discussed in Appendix VII.

Batch Queue and Execution Speed

FrameSearch may take a considerable amount of time to run. Very large comparisons may exceed the CPU limit set by some systems.

Because of the extensive search time, you should probably run most searches in the batch queue. You can specify that this program run at a later time in the batch queue by using -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide.

If you specify a non-zero frameshift penalty with -FRAmeweight or in response to the program prompt, FrameSearch takes about 40% longer to complete a search than if you accept the default frameshift penalty of 0. Our experience using the default search parameters suggests that specifying a non-zero frameshift penalty does not significantly improve the search results.

If you use -PENAlizedlength to specify a maximum gap penalty for any gap in the alignment, FrameSearch takes about 67% longer to complete a search than if you had not specified a maximum gap penalty. Still, you might find this useful, for instance, if you are aligning a protein sequence with the corresponding genomic DNA sequence containing large introns.

Interrupting a Search: <Ctrl>C

You can type <Ctrl>C to interrupt a search and see the results from the part of the search that has already been completed. Once you've interrupted a search, you cannot resume it.

GRAPHICS

[ Previous | Top | Next ]

GCG must be configured for graphics before you run any program with graphics output! If the % setplot command is available in your installation, this is the easiest way to establish your graphics configuration, but you can also use commands like % postscript that correspond to the graphics languages GCG supports. See Section 5, Using Graphics in the User's Guide for more information about configuring your process for graphics.

<CTRL>C

[ Previous | Top | Next ]

If you need to stop this program, use <Ctrl>C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use <Ctrl>C. The graphics device should stop plotting the current page and start plotting the next page. If the current page is the last page, plotters should put the pen away and graphic terminals should return to interactive mode.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % framesearch [-INfile1=]est:atts0012 -Default
 
Prompted Parameters:
 
-BEGin1=1 -END1=286              sets the range of interest for a single
                                   query sequence
[-INfile2]=pir:*                 specifies the search set
-GAPweight=8                     sets the gap creation penalty
-LENgthweight=2                  sets the gap extension penalty
-FRAmeweight=0                   sets the frameshift gap penalty
[-OUTfile]=atts0012.framesearch  specifies the output file name
 
Local Data Files:
 
-MATRix=blosum62.cmp      assigns the scoring matrix for proteins
-TRANSlate=translate.txt  contains the genetic code
 
Optional Parameters:
 
-BEGin1=1 -END1=100    sets the range of interest for each query sequence
-ONEstrand             searches only the top strand of nucleotide sequences
-LIStsize=40           sets the number of scores to show
-ALIgn=40              sets the number of alignments to show
                         ( -NOALIgn suppresses alignments)
-GLObal                searches by global alignment
  -ENDWeight           penalizes end gaps in global alignments like
                         other gaps
-PENAlizedlength=12    penalizes gaps longer than 12 sequence characters
                         the same as gaps of length 12
-HIGhroad              among equally optimal alignments, shows one
                         with maximum gaps in protein sequence
-LOWroad               among equally optimal alignments, shows one
                         with maximum gaps in nucleotide sequence
-INFRame               restores the correct reading frame after frameshifts
                         in the nucleotide sequence by adding gaps to the
                         alignment
-PROCessors=2          sets the number of threads devoted to the analysis
                         on a multiprocessor computer
-LINesize=70           specifies length of documentation for each sequence
                         in the output list
-PAIr=x,2,1            thresholds for displaying "|", ":", and "."
-WIDth=50              set the number of sequence symbols per line
-PAGe[=60]             adds a line with a form feed every 60 lines
-NOBIGGaps             suppresses abbreviation of large gaps with '.'s
-RSF[=framesearch.rsf] saves the locations of the top-scoring matches in
                         the query sequence as features in an RSF file
-NOPLOt                suppresses the plot of the search score distribution
-BATch                 submits program to the batch queue
-NOMONitor             suppresses the screen trace of program progress
-NOSUMmary             suppresses the screen summary
 
All GCG graphics programs accept these and other switches. See the Using
Graphics section of the USERS GUIDE for descriptions.
 
-FIGure[=filename]  stores plot in a file for later input to FIGURE
-FONT=3             draws all text on the plot using font 3
-COLor=1            draws entire plot with pen in stall 1
-SCAle=1.2          enlarges the plot by 20 percent (zoom in)
-XPAN=10.0          moves plot to the right 10 platen units (pan right)
-YPAN=10.0          moves plot up 10 platen units (pan up)
-PORtrait           rotates plot 90 degrees

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.

FrameSearch creates a scoring matrix on the fly that contains values for matches between all possible amino acids and all possible codons. (See the ALGORITHM topic for details.) FrameSearch creates this amino acid - codon scoring matrix from a translation table and an amino acid substitution matrix. The translation table, containing a list of all possible codons for each amino acid, is defined in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file with exactly the same name in your working directory or name an alternative file on the command line with an expression like -TRANSlate=mycode.txt. The amino acid substitution matrix, containing match values for the comparison of all possible amino acids, is defined in the file blosum62.cmp. This matrix is a copy of the BLOSUM62 scoring matrix described by Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)). You can use the Fetch program to copy this file to your local directory and modify the match values to suit your own needs. (See Appendix VII for more information about translation tables and scoring matrices.)

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-GAPweight=8

Sets the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

-LENgthweight=2

Sets the gap extension penalty that is substracted from the alignment score for each gapped symbol.

-FRAmeweight=0

Sets the frameshift creation penalty that is subtracted from the alignment whenever a gap changes the reading frame of the nucleotide sequence.

-MATRix=mymatrix.cmp

Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-TRANSlate=filename.txt

Usually, translation is based on the translation table in a default or local data file called translate.txt. This parameter allows you to use a translation table in a different file. (See Appendix VII for information about translation tables.)

-BEGin=1

Sets the beginning position for all query sequences. When the beginning position is set from the command line, FrameSearch ignores beginning positions specified for individual sequences in a list file.

-END=100

Sets the ending position for all query sequences. When the ending position is set from the command line, FrameSearch ignores ending positions specified for individual sequences in a list file.

-ONEstrand

Uses only the top strand of nucleotide sequences in searches.

-LIStsize=40

Sets the number of top-scoring entries to save in the output list.

-ALIgn=40

Sets the number of top-scoring sequence alignments to display in the output file.

Use -NOALIgn to suppress the sequence alignments.

-GLObal

Aligns the entire lengths of the nucleotide and protein sequences (global alignment). By default, FrameSearch determines a local alignment of the best region of similarity between the protein sequence and the codons in the nucleotide sequence.

-ENDWeight

Penalizes gaps placed before the beginning of a sequence and after the end of a sequence the same as gaps inserted within a sequence. By default, gaps placed at the very ends of sequences in global alignments are not penalized at all.

-PENAlizedlength=12

Lets you set the maximum penalty for any gap in the alignment. For instance, if you specify -PENAlizedlength=12, then any gap longer than 12 characters is penalized the same as a gap of length 12. Using this parameter, alignments can contain large gaps without incurring large gap extension penalties. This may be useful, for instance, if you are aligning a protein sequence with the corresponding genomic DNA sequence containing large introns.

If you use -PENAlizedlength FrameSearch takes about 67% longer to complete a search than if you had not specified a maximum gap penalty.

-HIGhroad

Displays the optimal alignment with the maximal number of gaps in the protein sequence when several equally optimal alignments are possible.

-LOWroad

Displays the optimal alignment with the maximal number of gaps in the nucleotide sequence when several equally optimal alignments are possible.

-INFRame

Restores the correct reading frame after frameshifts in the nucleotide sequence by adding gaps to the alignment. For instance, the alignment

 
 
             GGATCCC
             ||| |||
             Gly.Pro
 

would be written as

 
 
             GGAT..CCC
             ||| |||||
             Gly...Pro
 

if you use -INFRame.

-PROCessors=2

Tells the program to use 2 threads for the database search on a multiprocessor computer. Check with your system manager for the number of processors available at your site. Never set the number of processors greater than what you have available.

-LINesize=70

Sets the length of documentation for each sequence in the output list.

-PAIr=4,2,1

Changes the thresholds for the display of sequence similarity in the alignment output.

In the program output, the paired alignment displays sequence similarity by printing one of three characters between similar sequence symbols: a pipe character (|), a colon (:), or a period (.). Normally, a pipe character is put between a codon and an amino acid when the translated codon is identical to the amino acid. A colon is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to the average positive non-identical comparison value in the amino acid substitution matrix. A period is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than 1.

The three parameter values for -PAIr are the display thresholds for the pipe character, colon, and period, respectively. By default, a pipe character is inserted between identical sequence symbols. If you specify a numerical threshold as the first value, a pipe character will no longer be inserted between identical symbols unless their comparison value is greater than or equal to this threshold. If you want to specify a threshold for the display of colons and periods, but you still want a pipe character to connect identical symbols, use x instead of a number as the first value. (See Appendix VII for more information about comparison values in scoring matrices.)

-WIDth=50

Sets the number of sequence symbols on each line of the alignment display.

-PAGe=60

Adds form feeds to the output file so that each alignment begins at the top of a new page. Also, a form feed is added after every 60 lines of each alignment output. You can change the number of lines per page for each alignment display by specifying a number after the -PAGe parameter.

-NOBIGGaps

Normally, if one of the sequences is aligned opposite gap characters for one or more complete lines of the alignment, then that portion of the alignment is abbreviated with three dots arranged in a vertical line. -NOBIGGaps displays the entire alignment without abbreviation.

-RSF=framesearch.rsf

Writes an RSF (rich sequence format) file containing the input sequences annotated with features generated from the results of FrameSearch. This RSF file is suitable for input to other GCG programs that support RSF files. In particular, you can use SeqLab to view this features annotation graphically. If you don't specify a file name with this parameter, then the program creates one using framesearch for the file basename and .rsf for the extension. For more information on RSF files, see "Using Rich Sequence Format (RSF) Files" in Section 2 of the User's Guide. Or, see "Rich Sequence Format (RSF) Files" in Appendix C of the SeqLab Guide.

For each top-scoring entry in the output list, FrameSearch writes the matching segment of the query sequence as a feature in the RSF file.

-NOPLOt

Suppresses the histogram plot of the search score distribution.

-BATch

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

-MONitor=100

Monitors this program's progress on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it with -NOMONitor.

The monitor is updated every time the program processes 100 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.

-SUMmary

Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

The parameters below apply to all GCG graphics programs. These and many others are described in detail in Section 5, Using Graphics of the User's Guide.

-FIGure=programname.figure

Writes the plot as a text file of plotting instructions suitable for input to the Figure program instead of sending it to the device specified in your graphics configuration.

-FONT=3

Draws all text characters on the plot using Font 3 (see Appendix I).

-COLor=1

Draws the entire plot with the pen in stall 1.

The parameters below let you expand or reduce the plot (zoom), move it in either direction (pan), or rotate it 90 degrees (rotate).

-SCAle=1.2

Expands the plot by 20 percent by resetting the scaling factor (normally 1.0) to 1.2 (zoom in). You can expand the axes independently with -XSCAle and -YSCAle. Numbers less than 1.0 contract the plot (zoom out).

-XPAN=30.0

Moves the plot to the right by 30 platen units (pan right).

-YPAN=30.0

Moves the plot up by 30 platen units (pan up).

-PORtrait

Rotates the plot 90 degrees. Usually, plots are displayed with the horizontal axis longer than the vertical (landscape). Note that plots are reduced or enlarged, depending on the platen size, to fill the page.

Printed: May 27, 2005 12:29


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio