CONSENSUS

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[ Top | Next ]

Consensus calculates a consensus sequence for a set of pre-aligned short nucleic acid sequences by tabulating the percent of G, A, T, and C for each position in the set. FitConsensus uses the Consensus output table as a probe to search for the best examples of the derived consensus in other nucleotide sequences.

DESCRIPTION

[ Previous | Top | Next ]

Consensus reads a file of aligned nucleotide sequences for which you want to know the consensus pattern. Consensus constructs a consensus table with the percent of each nucleotide at each position. The total number of nucleotides contributing to each position in the sequence shown in the table is also reported. Below the table, Consensus writes the least ambiguous expression of the consensus sequence for a confidence level that you request.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Consensus to find the consensus of the intervening sequence acceptor splice sites from the file acceptor.dat:

% consensus

  CONSENSUS on sequences in what file ?  acceptor.dat

  Find consensus to what percent certainty (* 75.0 *) ?

  What should I call the output file (* acceptor.csn *) ?

      ................

OUTPUT

[ Previous | Top | Next ]

Here is the output file, which is a valid GCG sequence file:

 CONSENSUS of: acceptor.dat

IVS Acceptor Splice Site Sequences

from Stephen Mount NAR 10(2); 459-472 figure 1 page 460

Acceptor

                                             *****

 %G      15   22   10   10   10    6    7    9    7    5    5   24    1    0

 %A      15   10   10   15    6   15   11   19   12    3   10   25    4  100

 %T      52   44   50   54   60   49   48   45   45   57   58   30   31    0

 %C      18   25   30   21   24   30   34   28   36   35   27   21   64    0

 Total  114  114  115  127  127  127  128  128  128  130  131  131  131  131

 %G     100   52   24   19

 %A       0   22   17   20

 %T       0    8   37   29

 %C       0   18   22   32

 Total  131  131  131  131

                                             *****

 CONSENSUS sequence to a certainty level of 75.0 percent at each position:

        Length: 18  July 27, 1994 10:06  Type: N  Check: 3343  ..

       1  BBYHYYYHYY YDYAGVBH

INPUT FILES

[ Previous | Top | Next ]

Consensus does not use one of the standard GCG file formats as its input file, but instead requires a file in a specific format that you must create with a text editor. This file has a heading of indefinite length, followed by a line containing two adjacent periods (..). The sequences follow with one sequence per line, each sequence starting in the first column. There must be no space characters within the sequence. Gaps must be represented with periods. All sequences must be the same length, up to a maximum of 130 bases. Consensus assumes that the sequences are already in alignment.

Here is part of the input file for the example above:

IVS Acceptor Splice Site Sequences

from Stephen Mount NAR 10(2); 459-472 figure 1 page 460

Acceptor

                /       ..

 .........AAATAGGAT

 .........TTGTAGGTG

 ..........TGTAGGTG

 TTTATTTATTTCAAGATT

 //////////////////

 GTCACTTGTCACTAGGTA

RELATED PROGRAMS

[ Previous | Top | Next ]

FitConsensus uses the file written by Consensus to search for the best places in a nucleotide sequence where the consensus table fits. The mapping programs can be run with the command-line parameter -ALL to search for all potential restriction sites in an ambiguous sequence.

ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap).

CONSIDERATIONS

[ Previous | Top | Next ]

Consensus makes no attempt to align the sequences in the input file, so you should be sure that they are optimally aligned before running the program. (The input file format is described above.) The ambiguous representation of the sequence may be arbitrary if there are equal numbers of observations of some nucleotides.

STATISTICS USED

[ Previous | Top | Next ]

Consensus counts the number of G's, A's, T's, and C's in each position of the prealigned sequences. G, A, T, and C each have a value of one. The ambiguous nucleotide codes are divided. R, for instance, represents A or G and therefore contributes 0.5 to G and 0.5 to A. Periods (gaps) have no value. When the count is complete, the counts of each nucleotide at each position are totaled, normalized to 100, and rounded to the nearest integer. The normalized integers are reported as the %G, %A, etc., at each position. The total number of observations used to generate the percent figures is also shown. An observation is any IUPAC-IUB code (see Appendix III); periods do not count as observations.

For some user-set certainty level, Consensus writes the least ambiguous expression of the sequence in the table using the IUPAC-IUB ambiguity codes. For each column (position) in the table, the computer starts with the largest member (G, A, T, or C) and adds successively smaller members until the sum is equal to or greater than the certainty level set by you. If two nucleotides have the same score, Consensus picks one to add to the consensus arbitrarily. This may be somewhat misleading.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % consensus [-INfile=]acceptor.dat -Default

Prompted Parameters:

[-OUTfile=]acceptor.csn  names the output file

-CERtainty=75.0          sets the % certainty at which to find consensus

Local Data Files:     None

Optional Parameters:  None

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-CERtainty=75.0

Sets the threshold level for finding a consensus (in percent).

LOCAL DATA FILES

None.

Printed: April 5, 2005 14:37

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.