Diverge

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

ALGORITHM

CONSIDERATIONS

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

Diverge estimates the pairwise number of synonymous and nonsynonymous substitutions per site between two or more aligned nucleic acid sequences that code for proteins. It uses a variant of the method published by Li et al.

DESCRIPTION

[ Previous | Top | Next ]

Diverge makes a pairwise codon-by-codon comparison of aligned protein coding sequences to estimate the number of synonymous and nonsynonymous substitutions. The program is based on the method described by Li et al. (Mol. Biol. Evol. 2; 150-174 (1985), as modified by Li (J. Mol. Evol. 36; 96-99 (1993)) and by Pamilo and Bianchi (Mol. Biol. Evol. 10; 271-281 (1993)). (See the ALGORITHM topic for details.) It uses a translation table to determine codon degeneracies and applies Kimura's two-parameter method (J. Mol. Evol. 16; 111-120 (1980)) to correct for multiple hits and to account for the difference in substitution rates for transitions and transversions.

The nucleotide sequences can be input to Diverge as two single sequences that are known to be in alignment or as a single multiple sequence specification. If you give Diverge two single input sequences, the program prompts you for the range and strand for each sequence. If you specify a multiple sequence alignment, all pairwise combinations are analyzed and Diverge will not prompt you for range and strand information. It assumes that you want to analyze each sequence from base 1 to the end of the sequence on the positive strand. If this is not what you want, you can specify range and strand information for each sequence in a list file or specify a range and strand on the command line that will apply to all of the sequences.

EXAMPLE

[ Previous | Top | Next ]

Below is a session using Diverge to estimate the number of synonymous and nonsynonymous substitutions between all pairwise combinations of four GLUT1 glucose transporter coding regions.

 
 
% diverge
 
 DIVERGE between what sequence(s) ?  gts1.msf{*}
 
 Reading sequences...
 
        gts1.msf{bovgluti}: 1479 total, 1479 read
       gts1.msf{humglutrn}: 1479 total, 1479 read
          gts1.msf{rabpgt}: 1479 total, 1479 read
       gts1.msf{ratglutrn}: 1479 total, 1479 read
 
 What should I call the output file (* gts1.diverge *) ?
 
  gts1.msf{bovgluti} vs. gts1.msf{humglutrn}...
  gts1.msf{bovgluti} vs. gts1.msf{rabpgt}...
  gts1.msf{bovgluti} vs. gts1.msf{ratglutrn}...
  gts1.msf{humglutrn} vs. gts1.msf{rabpgt}...
  gts1.msf{humglutrn} vs. gts1.msf{ratglutrn}...
  gts1.msf{rabpgt} vs. gts1.msf{ratglutrn}...
 Results written to the file gts1.diverge
%

OUTPUT

[ Previous | Top | Next ]

Here is an excerpt from the output file:

 
 
DIVERGE October 21, 1998 14:55 between the coding sequences:
 
   gts1.msf{bovgluti} (1 to 1479)
     and
   gts1.msf{humglutrn} (1 to 1479)
 
Translation table: GenRunData:Translate.Txt
 
            Total codon pairs read:    493
Pairs with ambiguous, gap, or stop:      1
       Total codon pairs processed:    492
         Pairs with no differences:    380
         Pairs with one difference:    108
        Pairs with two differences:      4
      Pairs with three differences:      0
 
                      nondegenerate  twofold degenerate  fourfold degenerate
                          sites            sites                sites
 
total sites (L)          951.50           247.00               277.50
transitions (S)            7.00            27.50                48.00
transversions (V)          6.75             3.25                23.50
 
Ks= 0.277 (s.d. 0.0389)    synonymous substitutions per synonymous site
Ka= 0.016 (s.d. 0.0059) nonsynonymous substitutions per nonsynonymous site
 
Ka:Ks ratio= 0.057    Ks:Ka ratio= 17.478
 
////////////////////////////////////////////////////////////////////////////
 
Synonymous substitutions (Ks) per 100 synonymous sites.
(999.99 indicates an invalid value.)
 
Symmatrix version 1
Number of matrices: 1
 
//
Matrix 1, dimension: 4
 
Key for column and row indices:
 
  1 gts1.msf{bovgluti}
  2 gts1.msf{humglutrn}
  3 gts1.msf{rabpgt}
  4 gts1.msf{ratglutrn}
 
 Matrix 1: Part 1
 
                 1         2         3         4
__________________________________________________ ..
|     1  |      0.00     27.74     28.33     51.63
|     2  |                0.00     22.76     43.14
|     3  |                          0.00     44.23
|     4  |                                    0.00
 
______________________________________
 
Nonsynonymous substitutions (Ka) per 100 nonsynonymous sites.
(999.99 indicates an invalid value.)
 
Symmatrix version 1
Number of matrices: 1
 
//
Matrix 1, dimension: 4
 
Key for column and row indices:
 
  1 gts1.msf{bovgluti}
  2 gts1.msf{humglutrn}
  3 gts1.msf{rabpgt}
  4 gts1.msf{ratglutrn}
 
 Matrix 1: Part 1
 
                 1         2         3         4
__________________________________________________ ..
|     1  |      0.00      1.59      1.93      1.54
|     2  |                0.00      1.57      1.65
|     3  |                          0.00      1.39
|     4  |                                    0.00

INPUT FILES

[ Previous | Top | Next ]

Diverge accepts two individual nucleic acid sequences or multiple nucleic acid sequences (two or more). You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*. If Diverge rejects your nucleotide sequence, turn to Appendix VI to see how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

Assemble create new sequence files from ranges within existing sequence files. When run with the command-line option -OUT, Gap creates aligned sequences in individual sequence files. PileUp performs multiple sequence alignments. Distances makes a table of the pairwise distances between the sequences in a multiple sequence alignment. GrowTree takes a table of pairwise distances or a table of synonymous or nonsynonymous substitutions and reconstructs a phylogenetic tree.

RESTRICTIONS

[ Previous | Top | Next ]

You must supply Diverge with two or more nucleic acid sequences that code for proteins, with the sequences aligned codon by codon. The starting coordinates for the analysis must correspond to the beginning of codons or the results will not be valid. If the length of a sequence range is not a multiple of three, the program will trim the ending coordinate back until the range is a multiple of three. Since Diverge does not create alignments, it is your responsibility to ensure that the sequences specified by a list file or wild-card file specification are in alignment before using them as input to Diverge . One way to verify this is to use Pretty to display the sequences; if the Pretty output shows an acceptable alignment, the sequences are suitable for use with Diverge.

To obtain valid results, you must use the proper translation scheme (genetic code). All genes in the set to analyze must use the same translation scheme. Diverge ignores stop codons and codons that contain ambiguity symbols or gap characters (see Appendix III).

ALGORITHM

[ Previous | Top | Next ]

The method of Li et al. has several improvements over earlier methods. While the earlier methods assume that nucleotide substitution occurs randomly, this method allows differences in the transitional and tranversional substitution rates. It also corrects more rigorously for multiple substitutions by considering nondegenerate sites, twofold degenerate sites, and fourfold degenerate sites separately.

The method of Li et al. works as follows.

1. It scans a pair of sequences codon by codon and classifies the nucleotide sites into nondegenerate, twofold degenerate, and fourfold degenerate sites. (Threefold degenerate sites, such as the third position of isoleucine (ATH in the universal code), are classified as twofold degenerate sites.) It then divides the totals by two to obtain the averages L(0) (nondegenerate), L(2) (twofold degenerate), and L(4) (fourfold degenerate).

2. It next compares the two sequences codon by codon to classify the observed nucleotide differences. Each difference is classified according to the type of site at which it occurs and whether it is a transition or a transversion. The statistics are accumulated as S(i) (transitions, i = 0, 2, 4) and V(i) (transversions, i = 0, 2, 4).

Twofold degenerate sites are treated differently than the other site types. If a substitution results in a synonymous change, the substitution is classed as a transition, and if it results in a nonsynonymous change, it is classed as a transversion, no matter what the actual change is. In the universal genetic code, the only sites that are misclassified by this convention are in the first position of four of the arginine codons (CGA, CGG, AGA, and AGG) and in the last position of the three isoleucine codons (ATH).

3. It calculates the proportion of transitional differences P(i) = S(i) / L(i) and the proportion of transversional differences Q(i) = V(i) / L(i).

4. It uses Kimura's two-parameter method to estimate the true number of transitional (A(i)) and transversional (B(i)) substitutions at each type of site. (This step corrects for multiple hits and for differing substitution rates for transitions and transversions.)

A(i) = -0.5 * ln(a(i)) + 0.25 * ln(b(i))
B(i) = -0.5 * ln(b(i))

where

a(i) = 1 - 2 * P(i) - Q(i)
b(i) = 1 - 2 * Q(i)

5. Finally, it computes K(s) (synonymous substitutions per synonymous site) and K(a) (nonsynonymous substitutions per nonsynonymous site).

K(s) = B(4) + (L(2) * A(2) + L(4) * A(4)) /
(L(2) + L(4))
K(a) = A(0) + (L(0) * B(0) + L(2) * B(2)) / (L(0) +
L(2))

When codons differ by only one nucleotide, applying this method is straightforward. In cases where codons differ by more than one nucleotide, the substitution order is not unique, resulting in different substitution pathways between the two codons (two possible pathways for two nucleotide differences; six pathways when all three nucleotides differ). The method of Li et al. uses empirically derived relative likelihoods of codon changes to weight these different pathways. Our implementation differs from the original published method at this point.

The method of Li et al. was originally implemented to study mammalian sequences using the universal genetic code or the mammalian mitochondrial code. The authors collected statistics on substitutions in mammalian genes to derive relative likelihoods of codon changes which could be used to weight the substitution pathways.

We wanted to be able to use translation schemes other than the universal genetic code. Therefore, we could not use the relative likelihoods in Li et al. to weight the paths. Instead we follow a suggestion of Pamilo and Bianchi and equally weight the possible paths between the codons, excluding those that contain a stop codon as an intermediate step. (Paths containing a stop codon are assumed to have a likelihood of zero.) The results from this method are very close to the results from the method of Li et al. for mammalian genes.

CONSIDERATIONS

[ Previous | Top | Next ]

The input sequences must be pre-aligned codon by codon. Most computer sequence alignment programs such as Gap or PileUp do not know about codon boundaries and may insert gaps in the middle of a codon or may insert gaps that are not a multiple of three nucleotides. You may need to align the sequences at the amino acid level and edit the nucleic acid alignment accordingly. 

Li et al. caution that "to get a reliable estimate of the rate of nucleotide substitution, the degree of sequence diversion should not be too small or too large." (Examples of genes whose divergence is too small are the highly conserved histone and glucagon genes.)

The Kimura two-parameter method uses equations that contain the natural logarithm function. Because this function is undefined when its argument is less than or equal to zero, it may not be possible to compute the K(s) and/or K(a) for certain pairs of sequences. This situation is rare for actual coding regions, but may occur if you analyze very short sequences. For example, K(s) cannot be computed for the sequences CTACTG and TTATTG (although K(a) can be computed).

The sequences given to Diverge must be nucleotide sequences. If Diverge rejects your nucleotide sequence, turn to Appendix VI to see how to change or set the type of a sequence.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % diverge [-INfile1=]gts1.msf{*} -Default
 
Prompted Parameters:
 
[-OUTfile=]ratglutrn.diverge  specifies the output file name
 
Local Data Files:
 
-TRANSlate=translate.txt      contains the genetic code
 
Optional Parameters:
 
-BEGin1=1 -END1=1479   sets the range of interest for sequence 1
                         (or for all sequences if multiple alignment)
-BEGin2=1 -END2=1479   sets the range of interest for sequence 2
                         (if not a multiple alignment)
-REV1                  uses the reverse strand for the first sequence
                         (or for all sequences if multiple alignment)
-REV2                  uses the reverse strand for the second sequence
                         (if not a multiple alignment)
-TOFiles               outputs individual files for Ka and Ks matrices
                         (if a multiple alignment)

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate=mycode.txt. Translation tables are discussed in more detail in Appendix VII.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-TRANSlate=filename.txt

Usually, translation is based on the translation table in a default or local data file called translate.txt. This parameter allows you to use a translation table in a different file. (See Appendix VII for information about translation tables.)

-BEGin=1

Sets the beginning position for all input sequences. When the beginning position is set from the command line, Diverge ignores beginning positions specified for individual sequences in a list file.

-END=100

Sets the ending position for all input sequences. When the ending position is set from the command line, Diverge ignores ending positions specified for sequences in a list file.

-REVerse

Sets the program to use the reverse strand for each input sequence. When -REVerse or -NOREVerse is on the command line, Diverge ignores any strand designation for individual sequences in a list file.

-TOFiles

When this option is on the command line and the input to Diverge is a multiple sequence alignment, Diverge will output individual matrix files for the synonymous and nonsynonymous substitutions (file extensions .ks and .ka, respectively) that can be used as input to GrowTree.

Printed: May 27, 2005  12:05


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio