Table of Contents
FrameAlign creates an optimal alignment of the best segment of similarity (local alignment) between a protein sequence and the codons in all possible reading frames on a single strand of a nucleotide sequence. Optimal alignments may include reading frame shifts.
FrameAlign inserts gaps to obtain the optimal local alignment of the best region of similarity between a protein sequence and the codons in a nucleotide sequence. Because FrameAlign can align the protein to codons in different reading frames of the nucleotide sequence, it can identify sequence similarity even when the nucleotide sequence contains reading frame shifts.
In standard sequence alignment programs, you routinely specify gap creation and extension penalties. In addition to these penalties, FrameAlign also allows you to specify a separate frameshift penalty for the creation of gaps that result in reading frame shifts in the nucleotide sequence. (See the ALGORITHM topic for a more detailed explanation of how gaps are penalized.)
By default, FrameAlign creates a local alignment between the nucleotide and protein sequences. If you specify -GLObal, FrameAlign creates a global alignment where gaps are inserted to optimize the alignment between the entire nucleotide sequence and the entire protein sequence.
Here is a session using FrameAlign to align the codons in the cDNA sequence EST:Z17438 with the protein sequence PIR:JQ1287.
Local alignment of what sequence 1 ? EST:z17438
Begin (* 1 *) ?
End (* 286 *) ?
Reverse (* No *) ?
to what protein sequence ? PIR:jq1287
Begin (* 1 *) ?
End (* 338 *) ?
What is the gap creation penalty (* 8 *) ?
What is the gap extension penalty (* 2 *) ?
What is the frameshift penalty (* 0 *) ?
What should I call the paired output display file (* atts0012.pair *) ?
Quality Ratio: 4.654
% Similarity: 98.718
Here is the output file:
Local alignment of: Z17438 check: 2422 from: 1 to: 286
LOCUS Z17438 286 bp mRNA EST
DEFINITION ATTS0012 AC16H Arabidopsis thaliana cDNA clone TAT1B11 5' similar
to GLYCERALDEHYDE 3-PHOSPHATE DEHYDROGENASE. Swiss-Prot entry
P25858, mRNA sequence.
VERSION Z17438.1 GI:16580 . . .
to: JQ1287 check: 7459 from: 1 to: 338
P1;JQ1287 - glyceraldehyde-3-phosphate dehydrogenase (EC 188.8.131.52), cytosolic -
C;Species: Arabidopsis thaliana (mouse-ear cress)
31-Mar-1992#sequence_revision 31-Mar-1992#text_change 11-Jun-1999
C;Accession: JQ1287; JS0614
R;Shih, M.C.; Heinrich, P.; Goodman, H.M.
Gene 104, 133-138, 1991 . . .
Scoring matrix: /package/share/10.3/gcgcore/data/rundata/blosum62.cmp
Translation table: /package/share/10.3/gcgcore/data/rundata/translate.txt
Gap Weight: 8 Average Match: 2.778
Length Weight: 2 Average Mismatch: -2.248
Frameshift Weight: 0
Quality: 363 Length: 240
Ratio: 4.654 Gaps: 2
Percent Similarity: 98.718 Percent Identity: 97.436
Match display thresholds for the alignment(s):
| = IDENTITY
: = 2
. = 1
Z17438 x JQ1287 January 2, 2002 10:26 ..
. . . . .
3 GAAATCAAGAAGGCCATCAAGGAGGAATCTGAAGGCAAAATGAAGGGAAT 52
261 GluIleLysLysAlaIleLysGluGluSerGluGlyLysLeuLysGlyIl 277
. . . . .
53 TTTGGGATACTCTGAGGATGATGTTGTGTCTACCGACTTTGTTGGTGACA 102
278 eLeuGlyTyrThrGluAspAspValValSerThrAspPheValGlyAspA 294
. . . . .
103 ACAGGTCAAGCATTTTCGATGCCAAGGCTGGATTGCATTGCATTGAGCGA 152
295 snArgSerSerIlePheAspAlaLysAlaGly....IleAlaLeuSerAs 309
. . . . .
153 CAAGTTTGTGAAGTTGGTGTCATGGTACGACAACGAATGGGGTTACACAG 202
310 pLysPheValLysLeuValSerTrpTyrAspAsnGluTrpGlyTyr..Se 325
. . . .
203 TTCTCGTGTCGTTGACCTTATCGTTCACATGTCAAAGGCC 242
326 rSerArgValValAspLeuIleValHisMetSerLysAla 338
The alignment output displays sequence similarity by printing one of three characters between a codon and an amino acid: a pipe character (|), a colon (:), or a period (.). Normally, a pipe character is put between a codon and an amino acid when the translated codon is identical to the amino acid. A colon is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to the average positive non-identical comparison value in the amino acid substitution matrix. A period is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to 1. You can change these match display thresholds by specifying -PAIr. (See Appendix VII for more information about comparison values in scoring matrices.)
The input to FrameAlign is a nucleotide sequence and a protein sequence. You can specify the sequences in any order as input to the program.
Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps. BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. Both Gap and BestFit align two sequences of the same type (i.e. both nucleotide sequences or both protein sequences).
FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.
FrameAlign aligns a nucleotide sequence with a protein sequence. The alignment procedure is an extension of the local alignment algorithm of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) that is modified to determine the score of the best segment of similarity between a protein sequence and the codons in a nucleotide sequence.
To create the alignments, FrameAlign requires a scoring matrix that contains values for matches between all possible amino acids and codons. FrameAlign derives this amino acid - codon scoring matrix on the fly from a translation table and an amino acid substitution matrix. The translation table contains a list of all possible codons for each amino acid. The amino acid substitution matrix contains match values for the comparison of all possible amino acids.
In the derived amino acid - codon scoring matrix, the value of a match between any amino acid and any codon is the value of the match between the amino acid and the translated codon in the amino acid substitution matrix. If a codon contains IUB nucleotide ambiguity symbols (described in Appendix III), and all possible unambiguous representations of the codon translate to the same amino acid (e.g. MGR always translates to arginine in the standard genetic code), then the value of a match between that codon and any amino acid can be similarly determined. If all possible unambiguous representations of the codon do not translate to the same amino acid, then that codon is assumed to translate to an 'X'.
FrameAlign chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with -MATRix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use -GAPweight and -LENgthweight or respond to the program prompts to specify alternative gap penalties if you don't want to accept the default values.
FrameAlign uses the values in the amino acid - codon scoring matrix to determine the score of the best alignment between the protein and nucleotide sequences. If you consider a graph, or path matrix, with the nucleotide sequence placed on the X axis and the protein sequence placed on the Y axis, then every point on the path matrix represents the best alignment between the sequences that ends at that point. For any point on the path matrix, the X coordinate is the first nucleotide of the final codon in the alignment, and the Y coordinate is the final amino acid in the alignment. Each possible alignment end point is associated with a path, which is a series of steps (insertions, deletions, matches) through the path matrix required to create the alignment. Each step has its own score, and the scores for all the steps in an alignment path determine the quality score for the alignment. The quality score for an alignment is equal to the sum of the scoring matrix values of the matches in the alignment, minus the gap creation penalty multiplied by the number of gaps in the alignment, minus the frameshift penalty multiplied by the number of gaps in the alignment that change the reading frame, minus the gap extension penalty multiplied by the total length of all gaps in the alignment. (You can set the value for each of the penalties.)
quality = SUM(scoring matrix values of the matches in the alignment) -
gap creation penalty x number of gaps in the alignment -
frameshift penalty x number of gaps in the alignment
that change the reading frame -
gap extension penalty x total length of all gaps
in the alignment
For example, the following protein-nucleotide alignment consists of six steps:
1 UGUUGUAUUCG....UGGUGG 17
1 CysCysValGlnIleTrpTrp 7
The first two steps are UGU-Cys matches. The third step is an AUU-Val match. The fourth step is a four nucleotide deletion. The last two steps are UGG-Trp matches. The quality score for this alignment is the sum of the scoring matrix values for two UGU-Cys matches, one AUU-Val match, and two UGG-Trp matches, minus one gap creation penalty, minus four gap extension penalties, minus one frameshift penalty.
Matches between an amino acid and a partial codon, like
in the above example, do not add any match value to the alignment score. By convention, all gap characters in partial codons are placed at the end of the codon. For example, the partial codon CG. in the above example will never be written as C.G
If the best alignment ending at any point has a negative value, a zero is put at that position of the path matrix; otherwise, the quality score for the alignment is put at that position. After the path matrix is completely filled, the highest value in the matrix represents the score of the best region of similarity between the sequences (optimal local alignment). This highest value is reported as the comparison score between the nucleotide and protein sequences. The alignment itself can be reconstructed for display by following the best path from this point of highest value backward to the point where the path matrix has a value of zero.
Four figures of merit are displayed along with the optimal alignment between the protein and nucleotide sequences: Quality, Ratio, Identity, and Similarity.
The Quality score (described above in the ALGORITHM topic) is the measure that is maximized in order to align the sequences. Ratio is the Quality divided by the smaller of one-third the number of bases in the alignment and the number of amino acids in the alignment. Gap symbols are ignored in the calculation of Ratio. Identity is the percent of identical matches between amino acids and codons in the alignment (i.e. the amino acid is identical to the translated codon). Similarity is the percent of matches between amino acids and codons in the alignment whose comparison values exceed the similarity threshold. By default, this threshold is the average positive non-identical comparison value in the scoring matrix. FrameAlign uses this same threshold to decide when to put a colon (:) between an aligned codon and amino acid in the alignment display. You can reset this threshold with -PAIr.
FrameAlign Always Finds Something
FrameAlign always finds an alignment for any protein and nucleotide sequences you compare, even if there is no significant similarity between them. You must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity.
FrameAlign Shows Only a Single Segment of Similarity
FrameAlign shows only one optimal alignment between a protein sequence and a nucleotide sequence. There are reasons why you might want to evaluate several optimal and suboptimal alignments.
- If there are several disjoint segments of similarity, the selection of only a single segment for display does not provide a comprehensive view of the relationship between the nucleotide and protein sequences.
- The alignments displayed by FrameAlign are sensitive to your choices for the scoring matrix and gap penalties. If you vary these choices even slightly, FrameAlign may calculate different optimal alignments for the same segment of similarity between the sequences. If FrameAlign were able to display multiple and suboptimal alignments of the same region, you would be able to use the variation among the different alignments to determine which portions of the alignments were reliably determined.
Nucleotide Sequences Using Nonstandard Genetic Codes
If the nucleotide sequence is from an organism or organelle that uses a nonstandard genetic code, then you should specify an appropriate translation table using -TRANSlate. Different translation tables are discussed in Appendix VII.
Aligning a Protein Sequence with a Genomic Sequence Containing Introns
If you align a genomic sequence containing long introns to its corresponding protein sequence, FrameAlign will often display the local alignment of only one of the exons to its corresponding portion of the protein. To align the entire protein sequence to the entire genomic sequence, use -GLObal and reduce the gap extension penalty in response to the program prompt.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.
Minimal Syntax: % framealign [-INfile1=]est:atts0012 \
-BEGin1=1 -END1=286 sets the range of interest for first sequence
-BEGin2=1 -END2=338 sets the range of interest for second sequence
-REVerse uses the reverse strand for nuc. sequence
-GAPweight=8 sets the gap creation penalty
-LENgthweight=2 sets the gap extension penalty
-FRAmeweight=0 sets the frameshift gap penalty
[-OUTfile1]=atts0012.pair specifies the output file for alignment
Local Data Files:
-MATRix=blosum62.cmp assigns the scoring matrix for proteins
-TRANSlate=translate.txt contains the genetic code
-GLObal creates global alignment (default is local)
-ENDWeight penalizes end gaps in global alignments like
-PENAlizedlength=12 penalizes gaps longer than 12 sequence characters
the same as gaps of length 12
-HIGhroad among equally optimal alignments, shows one
with maximum gaps in protein sequence
-LOWroad among equally optimal alignments, shows one
with maximum gaps in nucleotide sequence
-INFRame restores the correct reading frame after
frameshifts in the nucleotide sequence by
adding gaps to the alignment
-PAIr=x,2,1 sets thresholds for displaying "|", ":", and "."
-WIDth=50 sets the number of sequence symbols per line
-PAGe=60 adds a line with a form feed every 60 lines
-NOBIGGaps suppresses abbreviation of large gaps with '.'s
-OUTfile2[=atts0012.gap] specifies new file for nuc. seq. with gaps added
-OUTfile3[=jq1287.gap] specifies new file for prot. seq. with gaps added
-BATch submits program to the batch queue
-NOMONitor suppresses the screen trace of program progress
-NOSUMmary suppresses the screen summary
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.
Local Scoring Matrices
This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.
FrameAlign creates a scoring matrix on the fly that contains values for matches between all possible amino acids and all possible codons. (See the ALGORITHM topic for details.) FrameAlign creates this amino acid - codon scoring matrix from a translation table and an amino acid substitution matrix. The translation table, containing a list of all possible codons for each amino acid, is defined in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file with exactly the same name in your working directory or name an alternative file on the command line with an expression like -TRANSlate=mycode.txt. The amino acid substitution matrix, containing match values for the comparison of all possible amino acids, is defined in the file blosum62.cmp. This matrix is a copy of the BLOSUM62 scoring matrix described by Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)). You can use the Fetch program to copy this file to your local directory and modify the match values to suit your own needs. (See Appendix VII for more information about translation tables and scoring matrices.)
You can set the parameters listed below from the command line.
Sets the program to use the reverse strand of the input sequence.
Sets the gap creation penalty that is subtracted from the alignment score whenever a gap is created.
Sets the gap extension penalty that is substracted from the alignment score for each gapped symbol.
Sets the gap penalty that is subtracted from the alignment score whenever a gap is inserted that shifts the reading frame of the nucleotide sequence.
Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.
For more information see the Local Scoring Matrices section.
Usually, translation is based on the translation table in a default or local data file called translate.txt. This parameter allows you to use a translation table in a different file. (See Appendix VII for information about translation tables.)
Aligns the entire lengths of the nucleotide and protein sequences (global alignment). By default, FrameAlign determines a local alignment of the best region of similarity between the protein sequence and the codons in the nucleotide sequence.
Penalizes gaps placed before the beginning of a sequence and after the end of a sequence the same as gaps inserted within a sequence. By default, gaps placed at the very ends of sequences in global alignments are not penalized at all.
Lets you set the maximum penalty for any gap in the alignment. For instance, if you specify -PENAlizedlength=12, then any gap longer than 12 characters is penalized the same as a gap of length 12. Using this parameter, alignments can contain large gaps without incurring large gap extension penalties. This may be useful, for instance, if you are aligning a protein sequence with the corresponding genomic DNA sequence containing large introns.
Displays the optimal alignment with the maximal number of gaps in the protein sequence when several equally optimal alignments are possible.
Displays the optimal alignment with the maximal number of gaps in the nucleotide sequence when several equally optimal alignments are possible.
Restores the correct reading frame after frameshifts in the nucleotide sequence by adding gaps to the alignment. For instance, the alignment
would be written as
if you use -INFRame. If you then use -OUTfile2 to write the nucleotide sequence, with gaps added for alignment to the protein sequence, into a separate output file, you can translate the entire output sequence in-frame.
Changes the thresholds for the display of sequence similarity in the alignment output.
In the program output, the paired alignment displays sequence similarity by printing one of three characters between similar sequence symbols: a pipe character (|), a colon (:), or a period (.). Normally, a pipe character is put between a codon and an amino acid when the translated codon is identical to the amino acid. A colon is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to the average positive non-identical comparison value in the amino acid substitution matrix. A period is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than 1.
The three parameter values for -PAIr are the display thresholds for the pipe character, colon, and period, respectively. By default, a pipe character is inserted between identical sequence symbols. If you specify a numerical threshold as the first value, a pipe character will no longer be inserted between identical symbols unless their comparison value is greater than or equal to this threshold. If you want to specify a threshold for the display of colons and periods, but you still want a pipe character to connect identical symbols, use x instead of a number as the first value. (See Appendix VII for more information about comparison values in scoring matrices.)
Sets the number of sequence symbols on each line of the alignment display.
Adds form feeds to the output file so that each alignment begins at the top of a new page. Also, a form feed is added after every 60 lines of each alignment output. You can change the number of lines per page for each alignment display by specifying a number after the -PAGe parameter.
Normally, if one of the sequences is aligned opposite gap characters for one or more complete lines of the alignment, then that portion of the alignment is abbreviated with three dots arranged in a vertical line. -NOBIGGaps displays the entire alignment without abbreviation.
Writes the nucleotide sequence, with gaps added for alignment to the protein sequence, into a separate output file. You can use the output sequence as input to other GCG programs expecting nucleotide sequence input.
Writes the protein sequence, with gaps added for alignment to the nucleotide sequence, into a separate output file. You can use the output sequence file as input to other GCG programs expecting protein sequence input.
Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.
This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.
Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.
You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.
Printed: May 27, 2005 12:26
Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.
Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.