Table of Contents
CodonPreference is a frame-specific gene finder that tries to recognize protein coding sequences by virtue of the similarity of their codon usage to a codon frequency table or by the bias of their composition (usually GC) in the third position of each codon.
CodonPreference finds regions of each forward reading frame of a nucleic acid sequence that show either strong codon preference or unusual compositional bias in the third (wobble) position of each codon. CodonPreference is useful for locating protein coding regions, determining their reading frames, estimating the level of expression of a gene, and locating nucleic acid sequencing errors.
The Preference Curves
The codon preference statistic for each reading frame shows the similarity of the codon usage in a window of that reading frame to a previously calculated codon frequency table. The statistic used for the comparison is described below. The codon frequency table is a file of the kind generated by the CodonFrequency program. A window of a given size is moved along the sequence in increments of one codon (three bases) and the statistic is recalculated at every position to make a continuous function. The statistic for the correct reading frame of a real gene may rise significantly above the background if a suitable codon frequency table is used and if the codon usage of the sequence is strongly biased. Suppress the codon preference curves with -NOPREFerence.
The Bias Curves
The bias for each reading frame is the fraction of the third position in each codon that is either G or C. The bias of other nucleotides may be seen with -BIAS=AT. Suppress the bias curves with -NOBIAS.
Errors in Sequence Data
A sequencing error that causes a frame shift may make the curves for the correct reading frame fall at the same time that one of the incorrect frames rises. See the example below.
Open reading frames are shown as boxes beneath the plot for their respective translation frames. Potential start codons are shown as short lines that extend above the top of the box, and stop codons extend below the bottom of the box. By default, only the start and stop codons at the beginnings and ends of open reading frames are shown in the frame display. This can be altered with -ALLFrames which shows all start and stop codons. Suppress the frame display with -NOFRAMes.
The Rare Codon Display
Rare codons in each reading frame are marked below the open reading frame plot. A codon is considered rare when its fraction in the codon frequency table falls below the threshold you set such as -RARe=0.1. Suppress the rare codon display with -NORARe.
Regions of Known Interest
Regions of known interest can be marked below the x-axis with -MARk.
Here is a session using CodonPreference to plot the similarity of the codon usage in the E. coli outer membrane protein II operon using a codon frequency table derived from E. coli highly expressed genes:
% codonpreference -NOBIAS
CODONPREFERENCE for what sequence ? Bacterial:EcoOmpa
Begin (* 1 *) ?
End (* 2270 *) ?
Reverse (* No *) ?
What codon frequency file (* GenRunData:ecohigh.cod *) ?
What codon preference window size (in codons) (* 25 *) ?
The minimum density for a one-page plot is 74.48 bases/cm.
What density would you like (* 74.48 *) ?
When your LaserWriter attached to tty07 is ready, press <Return>.
Average codon preference for frame 1 = 0.8853
Average codon preference for frame 2 = 0.5070
Average codon preference for frame 3 = 0.4742
Average codon preference for a random sequence = 0.4742
The output from this session is shown in the first figure at the end of this program entry. One of the E. coli outer membrane genes shows a strong pattern of codon choices similar to the pattern in the codon frequency table for highly expressed genes in enteric bacteria. What is just as interesting, however, is that the other gene does not. (The correct reading frame for both genes is shown in the top panel of the plot.) The same sequence is used to illustrate TestCode, which finds significant coding region likeness for both genes!
In the second figure (based on Figure 4 of Uchiyama and Weisblum, Gene 38; 103-110 (1985)), % codonpreference -NOPREFerence was used to display the third position G+C bias for a methyl transferase gene in Streptomyces (Bacterial:Sererme2). The gene of interest shows a very strong third position G+C bias (second panel of the plot). Note that the 3' end of the transferase gene shows less bias than the 5' end, and that the bias of another reading frame rises right where the bias in the correct reading frame falls. Also note that a large open reading frame continues beyond the end of the correct gene. A single extra or missing base in the original data would create such a pattern and cause the end of the putative gene to be identified incorrectly.
In the second figure, note that we used the file sererme2.trans for translation and sererme2.mrk for marking the reading frames identified by the authors.
CodonPreference requires a codon frequency table created by CodonFrequency or by you. See the format for these tables in the Program Manual under CodonFrequency. The program accepts a single nucleic acid sequence as its other input file. If CodonPreference rejects your nucleotide sequence, turn to Appendix VI to see how to change or set the type of a sequence.
CodonFrequency writes codon frequency tables that can be used as input to CodonPreference. CodonFrequency can also be used to assemble a consensus table from several existing codon frequency tables. Correspond determines the similarity of codon usage of two or more codon frequency tables. TestCode plots a measure of the non-randomness of the codon choices along a DNA molecule in order to locate putative genes. Unlike CodonPreference, TestCode does not require you to supply a model (codon frequency table) of codon choices. However, TestCode does not give information on the strand, reading frame, or level of expression of a gene.
Unknown. More than 10,000 bases per page creates a curve that is somewhat condensed. Very small plots may not label properly.
CodonPreference uses the technique of Gribskov et al. (Nucl. Acids Res. 12(1); 539-549 (1984)) to look for coding regions.
The statistic used in CodonPreference is based on the concept of synonymous codons. Synonymous codons are those codons specifying the same amino acid. A codon parameter is calculated for each codon in the reading frame based on the codon's frequency of occurrence (f) and the total number of occurrences of its synonymous family (F) in the codon frequency table, and the calculated occurrences of the codon (r) and its synonymous family (R) in a random sequence with the same base composition as the sequence being analyzed. The codon preference statistic for each codon (p) is then given by:
p = -----
For simplicity, the actual plotted statistic is calculated using logarithms. The codon preference statistic for each window (P) is given by
(sum over window ln(p)) / window
P = e
In other words, P is the windowth root of the product of the p for each codon in the window. This is the statistic shown on the plot.
Since the statistic is strongly affected by codons whose occurrence is zero (log of zero is undefined) in the codon frequency table, these codons are assigned an occurrence of 1. This is equivalent to saying that a zero value in the table doesn't mean that these codons are never seen, it only means that they haven't been seen in F observations, and that the upper bound on their occurrence as a fraction of their synonymous family is 1/F.
Using the codon frequencies in the default codon frequency file, ecohigh.cod, the value of p varies from 8.77 for the Arg codon (CGT) to 0.005 for the Ile codon (ATA). Met (ATG) and Trp (TGG) codons have p = 1.0 since they are the sole members of their synonymous families. A p value of 1.0 indicates that a codon is used equally in the random sequence and the codon frequency file. Values greater than 1.0 indicate the codon is present at higher than the random frequency in the codon frequency file, and p values less than 1 indicate a codon is present at less than the random frequency in the codon frequency file.
There are two advantages of calculating the statistic in this way. The statistic is insensitive to the amino acid composition of the protein encoded by the gene since the statistic is based on the occurrence of codons as a fraction of their synonymous families. The statistic is also fairly insensitive to differences in the G+C content of the sequence since G+C content influences the calculated random usage.
Preference Is Based on a Model
The question answered by looking at the codon preference statistic is this: Is the codon usage in this reading frame more like the usage expected of a random sequence or the usage found in the codon frequency table? This question can only be answered if the codon frequency table shows a distinctly non-random codon usage. There can be open reading frames, as indeed there are in the first example, that are genes but do not show the pattern of codon choices of your particular table. For the genomes of at least several organisms (including bacteria and yeast) this is particularly true of weakly expressed genes since they show much less bias in codon usage than do highly expressed genes. The sequence in the file Bacterial:EcoOmpa has two genes, but only one can be recognized by virtue of its codon preference. You should compare the first figure below with the one for the same sequence analyzed by the TestCode program.
Non-Standard Genetic Codes
If the start and stop codons for your analysis are not standard, then you should provide a translation table yourself (see the LOCAL DATA FILES topic below).
Biases Other Than G+C
The compositional bias at the third position is dependent on what genome you are looking at. You can reset the bias to calculate the bias for any nucleotide or group of nucleotides with the -BIAS command-line parameter.
To use CodonPreference, you must have a codon frequency table created by CodonFrequency or by you. See the format for these tables in the Program Manual under CodonFrequency. You can specify the table on the command line with an expression like -FREQuency=tablename.cod. See Appendix VII for a listing of the codon frequency tables currently provided with Accelrys GCG (GCG).
GCG must be configured for graphics before you run any program with graphics output! If the % setplot command is available in your installation, this is the easiest way to establish your graphics configuration, but you can also use commands like % postscript that correspond to the graphics languages GCG supports. See Section 5, Using Graphics in the User's Guide for more information about configuring your process for graphics.
If you need to stop this program, use <Ctrl>C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use <Ctrl>C. The graphics device should stop plotting the current page and start plotting the next page. If the current page is the last page, plotters should put the pen away and graphic terminals should return to interactive mode.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.
Minimal Syntax: % codonpreference [-INfile1=]bacterial:ecoompa -Default
-BEGin=1 -END=2270 sets the range of interest
-REVerse use the back strand
-FREQuency[=ecohigh.cod] specifies the codon frequency table
-PWINdow=25 sets the preference window in codons
-RARe=0.1 sets the rare codon display threshold
( -NORARe suppresses)
-DENsity=74.48 sets density in bases per centimeter (11 x 17 paper)
Local Data Files:
-TRANSlate=translate.txt defines the start and stop codons
-MARk=ecoompa.mrk defines regions of known interest
-BIAS=gc shows third position bias curves for G+C
( -NOBIAS suppresses)
-NOPREFerence suppresses the codon preference curves
-BWINdow=25 sets the bias window in codons
-FILe[=fname] makes an output file of the preference curve values
-TABle[=fname] creates a table with the statistics for each codon
-NOPLOt suppresses the whole plot
-ALLFrames shows all start and stop codons
-NOFRAmes suppresses the reading frame part of the plot
-PHEIght=77.0 sets the height of the vertical axis in platen units
-PLENgth=120.0 sets the length of the horizontal axis in platen units
-PSCAlemax=2.2 sets the maximum value on the codon preference scale
-BSCAlemax=1.1 sets the maximum value on the third position bias scale
All GCG graphics programs accept these and other switches. See the Using
Graphics section of the USERS GUIDE for descriptions.
-FIGure[=filename] stores plot in a file for later input to FIGURE
-FONT=3 draws all text on the plot using font 3
-COLor=1 draws entire plot with pen in stall 1
-SCAle=1.2 enlarges the plot by 20 percent (zoom in)
-XPAN=10.0 moves plot to the right 10 platen units (pan right)
-YPAN=10.0 moves plot up 10 platen units (pan up)
-PORtrait rotates plot 90 degrees
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.
The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate=mycode.txt. Translation tables are discussed in more detail in Appendix VII.
If you are studying a sequence with known features, this program can mark the plot with small boxes showing the positions of these features. The presence of a file in your directory with the same name as your sequence and the filename extension .mrk causes the program to mark each range specified in the file. You can provide a marking file on the command line with an expression like -MARk=gamma.mrk. The file gamma.mrk contains information about the format of marking files. The figure for the example session shows marked regions. The first figure at the end of this program entry was marked by the local data file ecoompa.mrk and the second figure was marked by the local data file sererme2.mrk.
You can set the parameters listed below from the command line.
Sets the codon preference curve to use the codon frequencies in the file named codontablename.cod. The file ecohigh.cod is used if you do not specify otherwise (see the FILES USED topic above).
Sets the number of codons in the window used to calculate the codon preference curves.
Sets the threshold for the display of rare codons. Accepted values range from 0 to 1 and represent the frequency with which a particular codon is found in the codon frequency table when compared with all codons with which it is synonymous.
Sets the number of bases or amino acids per 100 platen units (PU). This is usually equivalent to the number of bases or amino acids per page. Output from different GCG graphics programs that are run at the same density can be compared by lining up the plots on a light box.
Usually, translation is based on the translation table in a default or local data file called translate.txt. This parameter allows you to use a translation table in a different file. (See Appendix VII for information about translation tables.)
If you are studying a sequence with known features, this program can mark the plot with small boxes showing the positions of these features. The presence of a file in your directory with the same name as your sequence and the file name extension .mrk causes the program to mark each range specified in the file. The file gamma.mrk contains information about the format of marking files.
Makes a plot of the fraction of G+C in the third position of each codon for each reading frame. You can set these third position bias curves to sum any nucleotides you wish. The curves can be suppressed with -NOBIAS.
Suppresses the codon preference curves, leaving only the third position bias curves.
Sets the number of codons in the window used to calculate the third position bias curves.
Creates a file with the value of the codon preference statistic for each codon and the plotted statistic at each position of each frame. If you do not name the output file, CodonPreference creates a file with the sequence name and the file name extension .dat.
Creates a file with the input and random codon frequencies and the codon preference statistic for each codon. If you do not name the output file, CodonPreference creates a file with the sequence name and the file name extension .tab.
Suppresses the whole plot. This parameter is only useful in conjunction with the -FILe or -TABle parameters.
Changes the reading frame display to show all potential start and stop codons while continuing to connect each potential start codon to the next stop codon downstream.
Suppresses the open reading frames display.
The parameters below may help set up plots for publication.
Sets the height of the vertical axis in units of percent of the total vertical height available on your plotter.
Sets the length of the horizontal axis in units of percent of the total vertical [sic] height available on your plotter. The maximum is 150.0 units.
Sets the maximum value on the codon preference scale.
Sets the maximum value on the third position bias scale.
The parameters below apply to all GCG graphics programs. These and many others are described in detail in Section 5, Using Graphics of the User's Guide.
Writes the plot as a text file of plotting instructions suitable for input to the Figure program instead of sending it to the device specified in your graphics configuration.
Draws all text characters on the plot using Font 3 (see Appendix I).
Draws the entire plot with the pen in stall 1.
The parameters below let you expand or reduce the plot (zoom), move it in either direction (pan), or rotate it 90 degrees (rotate).
Expands the plot by 20 percent by resetting the scaling factor (normally 1.0) to 1.2 (zoom in). You can expand the axes independently with -XSCAle and -YSCAle. Numbers less than 1.0 contract the plot (zoom out).
Moves the plot to the right by 30 platen units (pan right).
Moves the plot up by 30 platen units (pan up).
Rotates the plot 90 degrees. Usually, plots are displayed with the horizontal axis longer than the vertical (landscape). Note that plots are reduced or enlarged, depending on the platen size, to fill the page.
Printed: May 27, 2005 11:53
Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.
Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.