Table of Contents
TestCode helps you identify potential protein coding regions in nucleic acid sequences by plotting a measure of the non-randomness of the composition at every third base. The statistic does not require a codon frequency table.
TestCode helps identify genes when you do not
have specific knowledge of codon preferences for the DNA being examined.
TestCode plots a measure of the period three constraint of each region
of a DNA sequence using a statistic developed by Dr. James Fickett at
The statistic is independent of the reading frame and is based on measurements of the period three compositional constraints found in regions known to be coding and non-coding. The output file plot is divided into three regions for which the statistic makes predictions. For windows larger than 200 nucleotides, the top region is supposed to predict coding regions to a 95 percent level of confidence. The bottom region is supposed to predict non-coding regions to the same confidence level. The middle region is the window of vulnerability for the method where the statistic can make no significant prediction.
In the plot, there are markings above the curve that identify the potential start codons (ATG) and stop codons for the forward reading frame of the sequence. Starts are indicated by short vertical lines and stops by small diamonds.
Here is a session using TestCode to plot the TestCode statistic for the E. coli outer membrane proteins in the sequence Bacterial:EcoOmpa:
Plot TESTCODE for what sequence ? Bacterial:EcoOmpa
Begin (* 1 *) ?
End (* 2270 *) ?
Reverse (* No *) ?
What window size in bp (* 200 *) ?
The minimum density for a one page plot is: 2270.0 bases/page
A typical density is about 3000.0 bases/page
What density would you like (* 2270.0 *) ?
When your LaserWriter attached to tty07 is ready, press <Return>.
The plot from this session is shown in the figure at the end of this program entry.
TestCode accepts a single nucleotide sequence as input. If TestCode rejects your nucleotide sequence, turn to Appendix VI to see how to change or set the type of a sequence.
The method of Gribskov et al. (Nucl. Acids Res. 12(1); 539-549 (1984)) is available in the CodonPreference program if you have an appropriate codon frequency table in GCG format. CodonPreference displays a separate plot for each of the forward reading frames.
Fickett's TestCode statistic was described by James Fickett in Nucleic Acids Research 10(17); 5303-5318 (1982). We believe that TestCode is a formal implementation of Fickett's method.
The statistic is high when measures of compositional bias with a periodicity of three are high. The key measures of bias are simply the three measures:
Maximum(n(1), n(2), n(3)) / Minimum(n(1), n(2), n(3))
where n(1), n(2) and n(3) are the composition of each nucleotide at positions (1,4,7,...), (2,5,8,...) and (3,6,9,...). The composition is simply the number of observations of n in the window.
The path to the final TestCode statistic is quite tortuous, but there is good reason. Fickett measured the biases for the coding and noncoding sequences that were then in the database and derived an empirical statistic that would separate coding sequences from non-coding sequences. He did not take a sliding-window approach to that measurement but instead used whole coding sequences. Unfortunately, the exons of many eukaryotic coding sequences are considerably shorter than the resolution of the method. The TestCode statistic does not claim to make a significant prediction for windows of less than 200 bases.
Fickett also found that compositional constraint is characteristic of coding sequences, and his TestCode statistic takes composition into account. However, we have received two personal communications suggesting that the TestCode statistic is actually more sensitive when composition is ignored. We have done no experiments to confirm this.
The method was designed to detect coding regions that are more than 200 bases long. Therefore, the method misses many eukaryotic coding sequences that are considerably shorter than this. The statistic is very sensitive when coding regions have strong codon preferences.
Frameshift errors in the data reduce the TestCode statistic as the window passes over them.
Plotting at a density of more than 5,000 bases per page may make a pattern difficult to read.
Accelrys GCG (GCG) must be configured for graphics before you run any program with graphics output! If the % setplot command is available in your installation, this is the easiest way to establish your graphics configuration, but you can also use commands like % postscript that correspond to the graphics languages GCG supports. See Section 5, Using Graphics in the User's Guide for more information about configuring your process for graphics.
If you need to stop this program, use <Ctrl>C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use <Ctrl>C. The graphics device should stop plotting the current page and start plotting the next page. If the current page is the last page, plotters should put the pen away and graphic terminals should return to interactive mode.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.
Minimal Syntax: % testcode [-INfile=]bacterial:ecoompa -Default
-BEGin=1 -END=2270 sets the range of interest
-REVerse uses the reverse strand
-WINdow=200 sets the window size
-DENsity=2270 sets the density in bp per 100 platen units
Local Data Files:
-MARk=ecoompa.mrk marks the plot with regions of known interest
-INCrement=3 lets you set the window slide increment
-POInts makes points instead of a curve
All GCG graphics programs accept these and other switches. See the Using
Graphics section of the USERS GUIDE for descriptions.
-FIGure[=filename] stores plot in a file for later input to FIGURE
-FONT=3 draws all text on the plot using font 3
-COLor=1 draws entire plot with pen in stall 1
-SCAle=1.2 enlarges the plot by 20 percent (zoom in)
-XPAN=10.0 moves plot to the right 10 platen units (pan right)
-YPAN=10.0 moves plot up 10 platen units (pan up)
-PORtrait rotates plot 90 degrees
The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.
If you are studying a sequence with known features, this program can mark the plot with small boxes showing the positions of these features. The presence of a file in your directory with the same name as your sequence and the filename extension .mrk causes the program to mark each range specified in the file. You can provide a marking file on the command line with an expression like -MARk=gamma.mrk. The file gamma.mrk contains information about the format of marking files. The figure for the example session shows marked regions.
You can set the parameters listed below from the command line.
Sets the width of the window, in bases, used for each TestCode measurement. The default is 200.
Sets the number of bases or amino acids per 100 platen units (PU). This is usually equivalent to the number of bases or amino acids per page. Output from different GCG graphics programs that are run at the same density can be compared by lining up the plots on a light box.
If you are studying a sequence with known features, this program can mark the plot with small boxes showing the positions of these features. The presence of a file in your directory with the same name as your sequence and the file name extension .mrk causes the program to mark each range specified in the file. The file gamma.mrk contains information about the format of marking files.
Allows you to set the distance that the window is moved after each TestCode measurement. The default is three.
Causes TestCode to plot unconnected points instead of a continuous line.
The parameters below apply to all GCG graphics programs. These and many others are described in detail in Section 5, Using Graphics of the User's Guide.
Writes the plot as a text file of plotting instructions suitable for input to the Figure program instead of sending it to the device specified in your graphics configuration.
Draws all text characters on the plot using Font 3 (see Appendix I).
Draws the entire plot with the pen in stall 1.
The parameters below let you expand or reduce the plot (zoom), move it in either direction (pan), or rotate it 90 degrees (rotate).
Expands the plot by 20 percent by resetting the scaling factor (normally 1.0) to 1.2 (zoom in). You can expand the axes independently with -XSCAle and -YSCAle. Numbers less than 1.0 contract the plot (zoom out).
Moves the plot to the right by 30 platen units (pan right).
Moves the plot up by 30 platen units (pan up).
Rotates the plot 90 degrees. Usually, plots are displayed with the horizontal axis longer than the vertical (landscape). Note that plots are reduced or enlarged, depending on the platen size, to fill the page.
Printed: May 27, 2005 14:50
Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.
Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.