GAP

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

ALIGNING LONG SEQUENCES

EVALUATING ALIGNMENT SIGNIFICANCE

CONSIDERATIONS

ALGORITHM

ALIGNMENT METRICS

PEPTIDE SEQUENCES

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps.

DESCRIPTION

[ Previous | Top | Next ]

Gap considers all possible alignments and gap positions between two sequences and creates a global alignment that maximizes the number of matched residues and minimizes the number and size of gaps. A scoring matrix is used to assign values for symbol matches. In addition, a gap creation penalty and a gap extension penalty are required to limit the insertion of gaps into the alignment. Gap uses the alignment method of Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)) that has been shown to be equivalent to Sellers (SIAM J. of Applied Math 26; 787-793 (1974), see the CONSIDERATIONS topic below).

EXAMPLE

[ Previous | Top | Next ]

Two haptoglobin genes are aligned with Gap. The alignment from this example is displayed graphically in the example for the GapShow program. The same sequences are compared in the figures included with DotPlot.

 
 
% gap
 
 GAP of what sequence 1 ?  hpr.seq
 
                  Begin (* 1 *) ?
                End (*  2966 *) ?
               Reverse (* No *) ?
 
 to what sequence 2 (* hpr.seq *) ?  hpf.seq
 
                  Begin (* 1 *) ?
                End (*  2740 *) ?
               Reverse (* No *) ?
 
 What is the gap creation penalty (* 50 *) ?
 
 What is the gap extension penalty (* 3 *) ?
 
 What should I call the paired output display file (* hpr.pair *) ?
 
 Aligning ..................................................
          ..................................................
          ....................................-.
 Aligning ..................................................
          ..................................................
          ....................................-.............
          .
 
          Gaps:    13
       Quality: 24426
 Quality Ratio: 8.915
  % Similarity: 94.897
        Length:  2982
 
%

OUTPUT

[ Previous | Top | Next ]

Here is the output from this session:

 
 
 GAP of: hpr.seq  check: 8102  from: 1  to: 2966
 
Haptoglobin related sequence
HindIII fragment sequenced 12/27/83
  (partially from hpf sequence)
 
 to: hpf.seq  check: 2624  from: 1  to: 2740
 
Haptoglobin alpha2
HindIII fragment , region equivalent to hp1f
 
 Symbol comparison table: /package/share/9.0/gcgcore/data/rundata/nwsgapdna.cmp
 CompCheck: 8760
 
         Gap Weight:     50      Average Match: 10.000
      Length Weight:      3   Average Mismatch:  0.000
 
            Quality:  24426             Length:   2982
              Ratio:  8.915               Gaps:     13
 Percent Similarity: 94.897   Percent Identity: 94.897
 
        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   5
                    . =   1
 
 hpr.seq x hpf.seq         September 29, 1998 10:32  ..
 
                  .         .         .         .         .
       1 AAGCTTGGTATGCTCAGAAGCAGCTAAAGCGTGTATGTGGGGCGGAGGGT 50
         ||||||||||||||||||||| ||||||| ||||||| |  |    | ||
       1 AAGCTTGGTATGCTCAGAAGCTGCTAAAGTGTGTATGGGCAG....GTGT 46
 
    ////////////////////////////////////////////////////////////
                  .         .         .         .         .
    1749 TTCCTCTTTCTTCAGAGATGATGAATTATTGTAGCTCCTAGCCCTTTCTT 1798
         ||| |||||||| |||||  |||||||||||||
    1678 TTCATCTTTCTTTAGAGAGAATGAATTATTGTA................. 1710
                                  .
                                  .
                                  .
                  .         .         .         .         .
    1949 TGGCCCCTAGCCCTTTCAATGAATTTCAGGGAATTGTGAAAATTCCTTTG 1998
           |||||||||||||||||||||||||||||||||||| ||||||||||
    1711 ..GCCCCTAGCCCTTTCAATGAATTTCAGGGAATTGTGGAAATTCCTTTA 1758
 
    ////////////////////////////////////////////////////////////
                  .         .         .
    2935 GAGGACACCTGGTACGCGGCTGGGATCTTAAG 2966
         |||||||||||||| ||| |||||||||||||
    2709 GAGGACACCTGGTATGCGACTGGGATCTTAAG 2740

INPUT FILES

[ Previous | Top | Next ]

Gap accepts two individual nucleotide sequences or protein sequences as input. The function of Gap depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

When you want an alignment that covers the whole length of both sequences, use Gap. When you are trying to find only the best segment of similarity between two sequences, use BestFit. PileUp creates a multiple sequence alignment of a group of related sequences, aligning the whole length of all sequences. DotPlot displays the entire surface of comparison for a comparison of two sequences. GapShow displays the pattern of differences between two aligned sequences. PlotSimilarity plots the average similarity of two or more aligned sequences at each position in the alignment. Pretty displays alignments of several sequences. CompTable helps generate scoring matrices for peptide comparison.

RESTRICTIONS

[ Previous | Top | Next ]

Input sequences may not be more than 32,000 symbols long.

ALIGNING LONG SEQUENCES

[ Previous | Top | Next ]

The program attempts to allocate enough computer memory to align the input sequences. In the worst case, where the two sequences being aligned are unrelated, the allocation is proportional to the product of the lengths of the two input sequences. However, in many cases where the sequences being aligned are more closely related, the computer can determine an optimal alignment using less memory. When memory on your computer is limiting and the program cannot allocate all of the memory it needs to align long sequences, it completes the alignment in whatever memory it can allocate and displays the message *** Alignment is not guaranteed to be optimal ***. Because the criteria used in the calculation for guaranteeing an optimal alignment are very stringent, the alignment often may be optimal even if this message is displayed.

If you know roughly where the alignment of interest for long sequences begins, you can run the program with -LIMit. Then set the starting coordinates for each sequence near the point where the alignment of interest begins and set gap shift limits on each sequence. The program then aligns the sequences from your starting point such that the sequences do not get out of phase by more than the gap shift limits you have set. If you started both sequences at base number one and set the gap shift limit for sequence one to 100 and for sequence two to 50, then base 350 in sequence one could not be gapped to any base outside of the range from 300 to 450 on sequence two. These limited alignments often require less computer memory than unlimited alignments.

EVALUATING ALIGNMENT SIGNIFICANCE

[ Previous | Top | Next ]

This program can help you evaluate the significance of the alignment, using a simple statistical method, with -RANdomizations. The second sequence is repeatedly shuffled, maintaining its length and composition, and then realigned to the first sequence. The average alignment score, plus or minus the standard deviation, of all randomized alignments is reported in the output file. You can compare this average quality score to the quality score of the actual alignment to help evaluate the significance of the alignment. The number of randomizations can be specified by adding an optional value to -RANdomizations; the default is 10. You can preserve the dinucleotide or dipeptide composition of the input sequence in the shuffled sequence by using -PREServe=2. Use -PREServe=3 to preserve the trinucleotide or tripeptide composition of the input sequence.

The score of each randomized alignment is reported to the screen. You can use <Ctrl>C to interrupt the randomizations and output the results from those randomized alignments that have been completed.

By ignoring the statistical properties of biological sequences, this simple Monte Carlo statistical method may give misleading results. Please see Lipman, D.J., Wilbur, W.J., Smith, T.F., and Waterman, M.S. (Nucl. Acids Res. 12; 215-226 (1984)) for a discussion of the statistical significance of nucleic acid similarities.

CONSIDERATIONS

[ Previous | Top | Next ]

Other Tools May Be Better Than Gap

Gap is capable of ignoring a region of excellent similarity or similarity between two sequences if it can produce an alignment with equal or better quality in some other way. BestFit is a better tool to search for weak or unknown similarity or similarity that you suspect is not coextensive along the sequences. It is extremely important that you think formally about what Gap does. Using Gap rather than BestFit implies that you want an alignment where neither sequence is truncated.

Gap presents you with one member of the family of best alignments. There may be (and usually are) many members of this family, but no other member has a better quality. When two sequences are closely related, Gap is a good way to see the relationship between them; however, a gapped alignment obscures, or can even be confounded by, internal repeats. Graphic matrix analysis is more powerful for seeing internally repeated structures and approximating the frame of best alignment between two sequences that have never been previously compared. (See the Compare and DotPlot programs.)

Scoring Matrices

The modification of scoring matrices is discussed in Appendix VII.

There is considerable evidence that more sensitive nucleic acid alignments may be possible by scoring transitions slightly positive and transversions slightly negative.

Gap chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with -MATRix, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use -GAPweight and -LENgthweight to specify alternative gap penalties if you don't want to accept the default values.

CompTable helps you create scoring matrices based on a simplification scheme for amino acid differences. There is a also a short C program that can be modified to help you write a new scoring matrix quickly. The program is called cmpvals.c, and it is located in the public database. You may Fetch and modify cmpvals.c if you are comfortable working with the C programming language.

Forced Pairing

You can get a position in sequence one to pair with some other position in sequence two by choosing a special symbol not used in the rest of the sequences and giving it a very high match value in the scoring matrix. The alphabet of legitimate GCG sequence symbols is defined in Appendix III.

Needleman-Wunsch Versus Sellers

Gap makes an alignment to find the maximum similarity between two sequences by the method of Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)) that is similar to finding the minimum difference according to the method of Sellers (SIAM J. of Applied Math 26; 787-793 (1974)). Smith, Waterman, and Fitch (J. Mol. Evol. 18; 38-46 (1981)) showed that the methods were precisely equivalent when the Needleman and Wunsch gap creation penalty is equal to the Sellers gap creation penalty - 0.5 and when the end gaps for Needleman and Wunsch are penalized in same way as all the other gaps. -ENDWeight allows you to penalize the end gaps introduced by Gap.

Rapid Alignment

When possible, Gap tries to find the optimal alignment very quickly. If this rapid alignment is not unambiguously optimal, Gap automatically realigns the sequences to calculate the optimal alignment. When this occurs, the monitor of alignment progress on your terminal screen (Aligning...) is displayed twice for a single alignment.

ALGORITHM

[ Previous | Top | Next ]

Gap reads a scoring matrix that contains values for every possible GCG symbol match. Gap finds an alignment with the maximum possible quality where the quality of an alignment is equal to the sum of the values of the matches (each match scored with the scoring matrix) less the gap creation penalty times the number of internal gaps and less the gap extension penalty times the total length of the internal gaps. The alignment found by Gap is, therefore, sensitive to the scoring matrix values and the gap penalties. There is no penalty if either sequence is shifted to the place where the alignment begins unless end gaps are penalized by using -ENDWeight.

ALIGNMENT METRICS

[ Previous | Top | Next ]

BestFit and Gap display four figures of merit for alignments: Quality, Ratio, Identity, and Similarity.

The Quality (described above) is the metric maximized in order to align the sequences. Ratio is the quality divided by the number of bases in the shorter segment. Percent Identity is the percent of the symbols that actually match. Percent Similarity is the percent of the symbols that are similar. Symbols that are across from gaps are ignored in the calculation of Percent Identity and Percent Similarity. A similarity is scored when the scoring matrix value for a pair of symbols is greater than or equal to the average positive non-identical comparison value in the matrix, the similarity threshold. This threshold is also used by the display procedure to decide when to put a ':' (colon) between two aligned symbols. You can change this threshold by specifying optional values to -PAIr. For instance, -PAIr=10,5 would set the similarity threshold to 5.

The similarity and identity metrics are not optimized by alignment programs so they should not be used to compare alignments.

PEPTIDE SEQUENCES

[ Previous | Top | Next ]

If your input sequences are peptide sequences, this program uses a scoring matrix, blosum62.cmp, with comparison values derived from a study of substitutions between amino acid pairs in ungapped block of aligned protein segments as measured by Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)).

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % gap [-INfile1=]hpr.seq [-INfile2=]hpf.seq -Default
 
Prompted Parameters:
 
-BEGin1=1  -BEGin2=1    sets the beginning of each sequence
-END1=2966 -END2=2740   sets the end of each sequence
-NOREV1    -NOREV2      sets the strand of each sequence
-GAPweight=50           sets the gap creation penalty (8 is protein default)
-LENgthweight=3         sets the gap extension penalty (2 is protein default)
[-OUTfile1=]hpr.pair    name the alignment output file
 
Local Data Files:
 
-MATRix=nwsgapdna.cmp   assigns the scoring matrix for nucleic acids
-MATRix=blosum62.cmp    assigns the scoring matrix for proteins
 
Optional Parameters:
 
-OUTfile2=hpr.gap       names new file for sequence 1 with gaps added
-OUTfile3=hpf.gap       names new file for sequence 2 with gaps added
-PENAlizedlength=12     penalizes gaps longer than 12 sequence characters
                          the same as gaps of length 12
-LIMit1=1 -LIMit2=240   limits the surface of comparison
-RANdomizations[=10]    determines average score from 10 randomized
                           alignments
  -PREServe=2             preserves dinucleotide or dipeptide composition
                            in randomized sequence
           =3             preserves trinucleotide or tripeptide composition
                            in randomized sequence
-PAIr=x,5,1             thresholds for displaying '|', ':', and '.'
-WIDth=50               the number of sequence symbols per line
-PAGe=60                adds a line with a form feed every 60 lines
-NOBIGGaps              suppresses abbreviation of large gaps with '.'s
-ENDWeight              penalizes end gaps like other gaps
-HIGhroad               makes the top alignment for your parameters
-LOWroad                makes the bottom alignment for your parameters
-NOSUMmary              suppresses the screen summary

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.

Gap reads a scoring matrix from your local directory or the public database with the values for every possible match. The file nwsgapdna.cmp (NWS stands for Needleman, Wunsch, and Sellers) has a 10 at every place where the set of bases implied by the alphabetic IUB ambiguity codes (see Appendix III) overlap. All of the other locations have zeros. In the file blosum62.cmp, the scores for pairwise amino acid comparisons range from -4 to +11. You can use the Fetch program to copy, view, and possibly modify these scoring matrix files to suit your own needs.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-REVerse1 and -REVerse2

Sets the program to use the reverse strand for the two input sequences.

-GAPweight=8

Sets the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

-LENgthweight=2

Sets the gap extension penalty that is substracted from the alignment score for each gapped symbol.

-MATRix=mymatrix.cmp

Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-OUTfile2=seqname1.gap -OUTfile3=seqname2.gap

This program can write three different output files. The first displays the alignment of sequence one with sequence two. The second is a new sequence file for sequence one, possibly expanded by gaps to make it align with sequence two. The third, like the second, is a new sequence file for sequence two, possibly expanded by gaps to make it align with sequence one. The program writes only the first file unless there are output file options on the command line. If there are any output files named on the command line, only those output files are written. If you add -OUT to the command line without an accompanying file name, then the program will write the second and third output files after prompting you for their names.

Aligned sequences (in sequence files) can be displayed with GapShow. Their similarity can be displayed with PlotSimilarity.

-PENAlizedlength=12

Lets you set the maximum penalty for any gap in the alignment. For instance, if you specify -PENAlizedlength=12, then any gap longer than 12 characters is penalized the same as a gap of length 12. Using this parameter, alignments can contain large gaps without incurring large gap extension penalties. This may be useful, for instance, if you are aligning a cDNA sequence with the corresponding genomic DNA sequence containing large introns.

-LIMit1=20 and -LIMit2=20

Let you set gap shift limits for each sequence. When you already know of a long similarity between two sequences you can "zip" them together using this mode. The beginning coordinates for each sequence must be near the beginning of the alignment you want to see. The alignment continues so that gaps inserted do not require the sequences to get out of step by more than the gap shift limits. You can align very long sequences rapidly. When you set gap shift limits for one or both input sequences, the maximum surface of comparison available to your alignment is 3.5 million. The size of the surface of comparison that your alignment actually requires can be predicted by multiplying the average length of the two sequences by the sum of the two shift limits.

If you add just -LIMit to the command line without specifying any value, the program prompts you to enter gap shift limits for each sequence.

-RANdomizations=10

Reports the average alignment score and standard deviation from 10 randomized alignments in which the second sequence is repeatedly shuffled, maintaining the length and composition of the original sequence, and then aligned to the first sequence. You can use the optional parameter to set the number of randomized alignment to some number other than 10.

-PREServe=2

Preserves the dinucleotide or dipeptide composition of the original sequence in the shuffled sequence. Use -PREServe=3 to preserve the trinucleotide or tripeptide composition.

-PAIr=4,2,1

The paired output file from this program displays sequence similarity by printing one of three characters between similar sequence symbols: a pipe character(|), a colon (:), or a period (.). Normally a pipe character is put between symbols that are the same, a colon is put between symbols whose comparison value is greater than or equal to the average positive non-identical comparison value in the scoring matrix, and a period is put between symbols whose comparison value is greater than or equal to 1. You can change these match display thresholds from the command line. The three values associated with -PAIr are the display thresholds for the pipe character, colon, and period. The match display criterion for a pipe character changes from symbolic identity (the default) to the quantitative threshold you have set in the first parameter. A pipe character will no longer be inserted between identical symbols unless their comparison values are greater than or equal to this threshold. If you still want a pipe character to connect identical symbols, use x instead of a number as the first value. (See Appendix VII for more information about scoring matrices.)

-WIDth=50

Puts 50 sequence symbols on each line of the output file. You can set the width to anything from 10 to 150 symbols.

-PAGe=60

Printed output from this program may cross from one page to another in an annoying way. Use this parameter to add form feeds to the output file in order to try to keep clusters of related information together. You can set the number of lines per page by supplying a number after -PAGe.

-NOBIGGaps

Suppresses large gap abbreviations, showing all the sequence characters across from large gaps. Usually, gaps that extend one sequence by more than one complete line of output are abbreviated with three dots arranged in a vertical line.

-ENDWeight

Causes the end gaps to be penalized in the same way as all other gaps.

-LOWroad and -HIGhroad

The insertion of gaps is arbitrary in many cases, and equally optimal alignments can be generated by inserting gaps differently. When equally optimal alignments are possible, this program can insert the gaps differently if you select either the -LOWroad or the -HIGhroad parameter. Here are examples for the alignment of GACCAT with GACAT with different parameters.

 
 
For:       Match = 10         MisMatch = -9
      Gap weight = 10    Length Weight =  0
 
LowRoad:   1 GACCAT 6
             || |||      Quality = 40
           1 GA.CAT 5
 
HighRoad:  1 GACCAT 6
             ||| ||      Quality = 40
           1 GAC.AT 5
 
For:       Match = 10         MisMatch = 0
      Gap weight = 30    Length Weight = 0
 
HighRoad:  1 GACCAT 6
             |||           Quality = 30
           1 GACAT. 5
 
LowRoad:   1 GACCAT 6
                |||        Quality = 30
           1 .GACAT 5
 

Essentially the low road shifts all of the arbitrary gaps in sequence two to the left and all of the arbitrary gaps in sequence one to the right. The high road does exactly the opposite. When neither high road nor low road is selected, the program tries not to insert a gap whenever that is possible and uses the high road alternative for all collisions.

-SUMmary

Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: May 27, 2005  12:31


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio