DISTANCES

Distances writes a matrix of the pairwise evolutionary distances between aligned sequences. The distances are expressed as substitutions per 100 bases or amino acids. Several methods may be chosen to correct the distances for multiple substitutions at a site. For nucleic acid sequences, these methods are Kimura's two-parameter method, the Tajima-Nei method, the Jin-Nei gamma distance method, and the Tamura method; for protein sequences, the Kimura method; and for either type of sequence, the Jukes-Cantor method. It is also possible to obtain an uncorrected distance.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Distances to determine distances between the aligned sequences in the file hum_gtr.msf.

% distances

 DISTANCES for what aligned sequences ?  hum_gtr.msf{*}

 Reading sequences...

                gtr1_human: 548 total, 548 read

                gtr1_human: 548 total, 548 read

                gtr1_human: 548 total, 548 read

                gtr1_human: 548 total, 548 read

                gtr1_human: 548 total, 548 read

Distances will be computed for 5 protein sequences.

 Which distance correction method to use ?

      1 Uncorrected distance

      2 Jukes-Cantor distance

      3 Kimura protein distance

 Choose the method to use: (* 3 *)

 What should I call the distance matrix file (* hum_gtr.distances *) ?

Computing distances using Kimura method...

 1 x  2:  48.61    1 x  3:  45.50

 1 x  4:  65.74    1 x  5: 107.70

 2 x  3:  61.53    2 x  4:  74.57

 2 x  5: 113.82    3 x  4:  68.93

 3 x  5: 104.43    4 x  5: 110.86

Statistics on pairwise distances:

 5 of 10 pairs have distances exceeding 70.0.

OUTPUT

[ Previous | Top | Next ]

Here is the 5 x 5 distance matrix created during the example session:

 DISTANCES between protein sequences in: hum_gtr.msf{*}  October 20, 1998 13:00

 Correction method: Kimura protein distance

 Distances are: estimated number of substitutions per 100 amino acids

Symmatrix version 1

Number of matrices: 1

//

Matrix 1, dimension: 5

Key for column and row indices:

  1 gtr1_human

  2 gtr3_human

  3 gtr4_human

  4 gtr2_human

  5 gtr5_human

 Matrix 1: Part 1

                 1         2         3         4         5

____________________________________________________________ ..

|     1  |      0.00     48.61     45.50     65.74    107.70

|     2  |                0.00     61.53     74.57    113.82

|     3  |                          0.00     68.93    104.43

|     4  |                                    0.00    110.86

|     5  |                                              0.00

Details of this distance matrix format

If you are interested in putting your own distance information into this matrix format, for example to draw the tree for non-sequence derived distances using GrowTree, the easiest way to do so would be to make a template matrix with some short random sequences (one character in length is enough) and then replace the data points in the matrix with your own data points.

In case you are planning on doing this frequently or have a large number of data points and feel writing a script to convert your distance matrix to the GCG distance matrix would save you some time, here is the basic format of a GCG distance matrix:

Heading: At the top you can put your own comments. Then the heading needs to contain a line giving the version as follows "Symmatrix version 1", the format described here is for version 1. Next a line giving the number of matrices contained in this file is needed: "Number of matrices: 1", currently only one matrix per file is processed by GrowTree. Next you can put any amount of comments followed by two backslashes ("//") on a line by itself. Then you give the matrix number M and the number of dimensions (e.g. sequences) D as follows "Matrix M, dimension: D". After this you again can put comments and then start listing which entity (e.g. sequence) will get assigned which column number by starting with a line saying "Key for column", followed by a blank line, followed by as many lines as there are dimensions, each listing a column number followed by an entity name. You end this heading section with two dots "..".

Matrix: If you have more than 12 dimensions the matrix is split into several parts, each having 12 columns of data points. Each part has as many rows as there are dimensions. This is important, but might easily be missed, since some of the bottom rows in some parts of a multipart matrix will be empty. Each row has the first 10 characters reserved for labeling, after that it contains the not yet listed data points for the respective columns separated by white space. Each row needs to be on one line only. A line containing two dots ".." separates each part of a multipart matrix. You can have comments after one part and before the two dots.

INPUT FILES

[ Previous | Top | Next ]

Distances accepts multiple sequences (two or more) all of the same type. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*. The function of Distances depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. Pretty displays multiple sequence alignments.

Accelrys GCG (GCG) includes several programs for evolutionary analysis of multiple sequence alignments. Distances creates a matrix of pairwise distances between the sequences in a multiple sequence alignment. Diverge measures the number of synonymous and nonsynonymous substitutions per site of two or more aligned protein coding regions and can output matrices of these values. GrowTree reconstructs a tree from a distance matrix or a matrix of synonymous or nonsynonymous substitutions. PAUPSearch reconstructs phylogenetic trees from a multiple sequence alignment using parsimony, distance, or maximum likelihood criteria; PAUPDisplay can manipulate and display the trees output by PAUPSearch and can also plot the trees output by GrowTree.

RESTRICTIONS

[ Previous | Top | Next ]

The sequences must be aligned properly for Distances to work. Since Distances does not create alignments, it is your responsibility to ensure that the sequences specified by a list file or wild-card file specification are in alignment before using them as input to Distances . One way to verify this is to use Pretty to display the sequences; if the Pretty output shows an acceptable alignment, the sequences are suitable for use with Distances.

ALGORITHM

[ Previous | Top | Next ]

Distances examines each pair of aligned sequences symbol-by-symbol and counts the number of exact matches, partial matches, and gap symbols. If the sequences are nucleic acids, transitions (purine-purine or pyrimidine-pyrimidine substitutions) and transversions (purine-pyrimidine substitutions) are also tallied. These counts are used, where appropriate, to compute the distance.

The Need to Correct For Multiple Substitutions

When sequences are very closely related, the observed distance and the actual distance between two sequences are equivalent. As the time since the sequences diverged increases, the probability that more than one substitution occurred at a single site also increases. Therefore for all but closely related sequences, the observed distance between the sequences underestimates the true distance.

In order to construct a valid tree, the observed distances must be corrected to account for multiple substitutions at a single site. A number of methods have been devised to make this correction. Each makes different assumptions about the substitution process.

Uncorrected Distance

This method computes the observed distance between sequences, with no correction for multiple substitutions. This uncorrected distance is sometimes referred to as the p-distance. It can be used for either nucleic acid or protein sequences, and gap positions can be factored into the calculation or ignored. A match score is computed by summing the number of exact matches. If -AMBIGuous is used, partial matches between ambiguous symbols also contribute to the match score as fractional scores (for example, the nucleotide W matched with A would score 0.5, while N matched with A would score 0.25). The similarity S is computed by dividing the match score by the number of positions scored plus the number of gap positions times the gap penalty. The distance is 1 - S. Gaps are ignored unless a nonzero value is specified for -GAPweight. End gaps are penalized as much as internal gaps, so if you choose to apply a gap penalty and gaps exist at the beginning and/or end of some of the sequences in the alignment, make sure to set the beginning and ending coordinates to exclude these regions.

S = matches / (positions_scored + gaps * gap_penalty)

D = uncorrected distance = p-distance = 1 - S

Jukes-Cantor Distance

This method for correcting distances can be used for nucleic acid or protein sequences. Gap positions can be factored into the equation by specifying a nonzero value for -GAPweight, and partial matches between ambiguous symbols can contribute to the match score if -AMBIGuous is used. The uncorrected distance D is computed and then corrected to account for multiple substitutions at a site using the equation below. The parameter b is 3/4 for nucleic acid sequences, 19/20 for protein sequences. End gaps are penalized as much as internal gaps, so if you choose to apply a gap penalty and gaps exist at the beginning and/or end of some of the sequences in the alignment, make sure to set the beginning and ending coordinates to exclude these regions.

distance = -b ln( 1 - ^(D)/_(b) )

from "Phylogenetic Inference," Swofford, Olsen, Waddell, and Hillis, in Molecular Systematics , second edition, ed. D. M. Hillis, C. Moritz, and B. K. Mable, Sinauer Associates, Inc., 1996, Ch. 11, derived from "Evolution of Protein Molecules," Jukes and Cantor, in Mammalian Protein Metabolism, vol. III, ed. H. N. Munro, Academic Press, 1969, pp. 21-132.

The Jukes-Cantor method is based on two assumptions: that substitution occurs at any site along the sequence with equal probability, and that the probability of a change from one nucleotide to any of the other three nucleotides or from one amino acid to any of the other 19 amino acids is the same. These assumptions tend to break down as divergence time increases, so this correction method underestimates the true distance for more distantly related sequences.

Tajima-Nei Distance

This method applies to nucleic acid sequences only. It uses the same equation as the Jukes-Cantor method, except that the parameters are calculated somewhat differently: the value of the parameter b varies with the base composition of the sequence pairs. In addition, only exact matches are considered in computing the match score, and gap positions are always ignored. In the equations below, A=1, T=2, C=3, G=4.

b = ⁽¹⁾/₍₂₎ (1 - S_{(i = A,G)} (fraction[i]⁽²⁾) + D⁽²⁾ / h)

h = S_{(i = A,C)} S_{(k = T,G)} (⁽¹⁾/₍₂₎ pairfreq[i,k]⁽²⁾ * fraction[i] * fraction[k])

distance = -b ln( 1 - ^(D)/_(b) )

Tajima and Nei, Mol. Biol. Evol. 1; 269-285 (1984), equation 6.

The Tajima-Nei correction method makes two assumptions: substitution occurs at any site along the sequence with equal probability, and substitution occurs according to the "equal input" model of nucleotide substitution. The equal input model assumes that the rate of substitution to a given nucleotide is the same, regardless of the original nucleotide, i.e., that a change from A to T has the same rate as the change from G to T. If these assumptions do not hold, the method underestimates the true distance as the distance increases.

Kimura Two-Parameter Distance

This method applies only to nucleic acids and takes into consideration the fact that transition substitutions (purine-purine or pyrimidine-pyrimidine) often occur much more frequently than transversion substitutions (purine-pyrimidine). Gap positions and ambiguous symbols other than R (purine) and Y (pyrimidine) are not scored.

P = transitions / positions_scored

Q = transversions / positions_scored
distance = -⁽¹⁾/₍₂₎ ln[ (1 - 2P - Q) * sqrt(1 - 2Q) ]

M. Kimura, J. Mol. Evol. 16; 111-120 (1980).

This method gives better distance estimates than the Jukes-Cantor method when the rates of transitional and transversional substitutions are different. However, when the substitution pattern is more complex than this, this method underestimates the true distance for distantly related sequences.

Tamura Distance

This method applies only to nucleic acids and assumes that substitution occurs at any site along the sequence with equal probability. It takes different rates of transitions and transversions into account and also takes into account deviation of G+C content from the expected value of 50 percent. Gap positions and ambiguous symbols are not scored.

P = transitions / positions_scored

Q = transversions / positions_scored

theta1 = fraction G+C in sequence 1
theta2 = fraction G+C in sequence 2
C = theta1 + theta2 - 2 * theta1 * theta2

distance = -C ln(1 - ^(P)/_(C) - Q) - 0.5(1 - C) ln(1 - 2Q)

K. Tamura, Mol. Biol. Evol. 9; 678-687 (1992).

When there are strong transition-transversion and G+C-content biases, this method can yield better distance estimates than the Jukes-Cantor, Kimura two-parameter, or Tajima-Nei methods. Tamura recommends that it be used only when the corrected distance is "not very large," and implies that estimated distances greater than 50 substitutions per 100 bases may not be accurate.

Jin-Nei Gamma Distance

This is another method that applies only to nucleic acids and that takes transitions and transversions into account. Gap positions and ambiguous symbols other than R and Y are not scored. This method is designed to be used when the substitution rate varies extensively from site to site. The shape parameter a is the square of the inverse of the coefficient of variation.

L = average substitution rate = transition_rate + 2 * transversion_rate

a = (mean of L)⁽²⁾ / (variance of L)

P = transitions / nScored
Q = transversions / nScored

distance = ⁽¹⁾/₍₂₎a [(1 - 2P - Q)^((-1/a)) +
⁽¹⁾/₍₂₎ (1 - 2Q)^((-1/a)) - ⁽³⁾/₍₂₎ ]

Jin and Nei, Mol. Biol. Evol. 7; 82-102 (1990).

The gamma distance correction is based on the assumption that the nucleotide substitution rate varies from site to site according to the gamma distribution.

Kimura Protein Distance

This method applies only to proteins. The formula calculates distances based on the relationship between observed amino acid substitutions and actual (corrected) substitutions that was derived by Dayhoff and coworkers. Gap positions are ignored, and only exact matches contribute to the match score.

S = exact_matches / positions_scored

D = 1 - S
distance = -ln( 1 - D - 0.2 D⁽²⁾ )

M. Kimura, The Neutral Theory of Molecular Evolution, Cambridge University Press, Cambridge, 1983.

This method overestimates the true distance when the uncorrected distance is greater than about 70 observed substitutions per 100 amino acids (equivalent to a Jukes-Cantor distance estimate of about 127 substitutions per 100 amino acids).

CONSIDERATIONS

[ Previous | Top | Next ]

The single most critical step in tree reconstruction is the sequence alignment. If the alignment is poorly done, no amount of care or tweaking of analysis parameters will guarantee a correct tree. Multiple alignments that are created by computerized methods such as PileUp will need to be inspected and edited by hand, using an editor such as SeqLab. Be especially careful with nucleic acid sequences that are coding regions, since computerized alignment methods have no knowledge of codon boundaries. They may insert a gap whose length is not a multiple of three or may insert a gap in the middle of a codon, for example.

Once the alignment is satisfactory, you must decide whether to use the entire alignment, or only portions of it. Only homologous regions of the sequences should be used to reconstruct a tree. Any regions of an alignment that contain data for which no homologs occur in the other sequences should be eliminated from consideration. For example, if there are gap characters at the beginning or end of one or more sequences in the alignment, the sequence data at the extremes of the alignment should not be used, since the longer sequences contain regions that have no homologs in the shorter sequences. Similarly, regions in the interior of the alignment that contain gaps in some of the sequences should probably be edited out of the alignment before trying to reconstruct a tree.

Some biological phenomena can interfere with tree reconstruction. Gene duplication is one of them. When genes are duplicated (by polyploidy or by regional duplication), one of the copies often accumulates mutations and either acquires a different function than the original gene or becomes a pseudogene. In this situation, it is often unclear which of the alternative loci will give the correct tree for the functional gene. Another complication is recombination: if recombination has occurred between sequences in the data set, no single tree can correctly explain the data.

Some data sets can also confound the existing methods for tree construction. For example, a set of sequences consisting of mostly closely related sequences with a few very divergent sequences cannot be analyzed using parsimony or a distance method based on an improperly corrected distance matrix. These methods will systematically group the widely diverged sequences together as sister groups, even if they actually belong to different lineages. If you don't want to drop the diverged sequences from the analysis, you will need to add sequences to the alignment that bridge the distance between the more distant sequences and the group of closely related sequences, or use a distance method based on a properly corrected distance matrix.

Another consideration when computing distances between coding regions is whether to use all three nucleotides in each codon or just the first or second. The substitution rate at the third codon position is usually much higher than that at the other two positions because of the degeneracy of the genetic code. In these cases, it might be best to use just the first position or just the first two positions of each codon to compute the distances.

It is important to use the proper correction method when computing distances, unless the sequences are all very closely related. Some guidelines for choosing a correction method are listed under the SUGGESTIONS topic.

SUGGESTIONS

[ Previous | Top | Next ]

If the aligned sequences are not in an MSF file format, use Pretty to display the aligned sequences you pass to Distances. If they look properly aligned in the Pretty display, they will work sensibly with Distances.

To get the best nucleotide alignments of coding regions, you also should align the sequences at the protein level and adjust the nucleotide alignment to conform to the amino acid alignment. You can do this manually using SeqLab.

One way of detecting the presence of recombination in your sequence set is to reconstruct trees from different sections of the alignment. If different trees are found for different sections, it's possible that recombination has occurred.

To check the distance distribution of your sequences, create an uncorrected distance matrix from the alignment (using Distances) and examine the contents. If there are mostly closely related sequences with a few very divergent sequences, you must either add sequences to the alignment to bridge the distance between the more distant sequences and the group of closely related sequences, or you must use a distance method based on a properly corrected distance matrix.

Jin and Nei, Mol. Biol. Evol. 7; 82-102 (1990), give a set of guidelines for choosing a distance correction method for nucleic acid sequences. Here is a summary of their suggestions.

First compute the distances using the Jukes-Cantor method. If all the distances are less than or equal to 10 substitutions per 100 bases, there is no need to use another method (all the correction methods calculate about the same distances for closely related sequences). If the distances are greater than 10 substitutions per 100 bases, choose a correction method based on the following criteria:

- If the Jukes-Cantor distances are between 10 and 30 substitutions and there is a difference in the transition and transversion rates, use the Kimura two-parameter distance.

- If the Jukes-Cantor distances are between 30 and 100 substitutions and there is evidence that the substitution rate varies extensively from site to site, use the Jin-Nei gamma distance with -APARAMeter=1.0. If the distances lie between 30 and 100 and the frequencies of the four nucleotides deviate substantially from equality, use the Tajima-Nei distance.

- If the Jukes-Cantor distance is greater than 100 for many pairs of sequences, the tree that will be constructed from the distance data will not be reliable. Depending on your data, and the reason that you are computing the distances, one of the following suggestions may help:

a. For coding regions, try using just the first codon position or the first and second codon positions when computing the distances.

b. For coding regions, align the protein sequences and compute the distances as amino acid substitutions.

c. If you know that a certain region of the sequence is evolving very rapidly compared to the rest of the sequence and recompute the distances.

If there is a strong G+C content bias as well as a difference in transition and transversion rates, use the Tamura distance.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % distances [-INfile=]hum_gtr.msf{*} -Default

Prompted Parameters:

[-OUTfile=]hum_gtr.distances  names the output file

                              Correction Methods for Nucleic Acid Sequences

-MENu=1                       uncorrected distance

      2                       Jukes-Cantor distance

      3                       Kimura 2-parameter distance

      4                       Jin-Nei gamma distance

      5                       Tajima-Nei distance

      6                       Tamura distance

                              Correction Methods for Protein Sequences

-MENu=1                       uncorrected distance

      2                       Jukes-Cantor distance

      3                       Kimura protein distance

Local Data Files:  None

Optional Parameters:

-BEGin=1 -END=100       sets the range of interest

-FILe=hum_gtr.report    names the table of counts used to calculate distances

-AMBIGuous              considers partial matches between ambiguous

                          symbols

-POSition=5             sets base position(s) to consider

-GAPweight=0.0          sets gap penalty (uncorrected and Jukes-Cantor only)

-APARAMeter=1.0         sets 'a' parameter (Jin-Nei gamma distance only)

-NOMONitor              suppresses screen display of the progress of the

                          analysis

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-MENu1=1

Sets the distance correction method to use. For nucleic acid sequences, these are (in order): uncorrected distance, Jukes-Cantor distance, Kimura 2-parameter distance, Jin-Nei gamma distance, Tajima-Nei distance, and Tamura distance. For protein sequences, these are: uncorrected distance, Jukes-Cantor distance, and Kimura protein distance.

-BEGin=1

Sets the beginning position for all input sequences. When the beginning position is set from the command line, Distances ignores beginning positions specified for individual sequences in a list file.

-END=100

Sets the ending position for all input sequences. When the ending position is set from the command line, Distances ignores ending positions specified for sequences in a list file.

-FILe=hum_gtr.report

Creates a table of the counts used to calculate the distances: number of positions scored, exact matches, ambiguous symbol matches, transitions, transversions, gap positions, etc.

-AMBIGuous

Considers partial matches between ambiguous symbols when calculating distances (uncorrected and Jukes-Cantor only).

-POSition=5

Allows you to consider a single specified codon position (1, 2, or 3), the first and second positions only (4), or all three codon positions (5) when calculating distances between nucleic acid sequences.

-GAPweight=0.0

Allows you to assign a gap penalty when using the Jukes-Cantor or uncorrected distance methods.

-APARAMeter=1.0

Allows you to vary the value of the shape parameter a in the equation used by the Jin-Nei gamma distance correction method.

-NOMONitor

Suppresses screen display of the progress of the analysis.

Printed: May 27, 2005 12:03

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.