PROFILEGAP

ProfileGap uses the method of Gribskov, et al. (Proc. Natl. Acad. Sci. USA 84; 4355-4358 (1987)) to make an optimal alignment between a profile and one or more sequences. Multiple sequences may be specified by an ambiguous file specification, a multiple sequence format (MSF) or rich sequence format (RSF) file specification, or a list file. ProfileGap works like BestFit but accepts a profile instead of one of the sequences. ProfileGap uses the alignment procedure of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to search for and align the segment of similarity. The scoring matrix values are present in the profile itself and need not be set. The gap creation and gap extension penalties specified in ProfileGap are maximum values. The actual position-specific gap penalties at any position are determined by multiplying the gap creation penalty by the percent value in the second to the last column of the profile, and the gap extension penalty by the percent value in the last column of the profile.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using ProfileGap to align a 75 kd membrane peptide sequence from Chlamydia with a profile generated from 75 kd heat shock and heat shock cognate peptide sequences:

% profilegap

(Local) PROFILEGAP of what sequence(s) ?  PIR:s10435

                  Begin (* 1 *) ?

                 End (*  660 *) ?

 and what profile (* s10435.prf *) ? hsp70.prf

 What is the gap creation penalty (* 24.00 *) ?

 What is the gap extension penalty (* 0.27 *) ?

 What should I call the paired output display file (* hsp70.pair *) ?

        The following levels will be marked in the alignments:

                   Bar: 1.79

                 Colon: 1.10

                   Dot: 0.55

 Aligning ...................................-..........

 PIR2:S10435

          Gaps:      46

       Quality: 1431.67

 Quality Ratio:    2.18

        Length:     751

OUTPUT

[ Previous | Top | Next ]

Here is part of the output file:

(Local) PROFILEGAP of: S10435  check: 8538  from: 1  to: 660

P1;A40158 - dnaK-type molecular chaperone - Chlamydia trachomatis

N;Alternate names: 75K membrane protein; hsp70 homolog, outer membrane

C;Species: Chlamydia trachomatis

C;Date: 13-May-1992 #sequence_revision 13-May-1992 #text_change 30-Jan-1998

C;Accession: A40158; A48866; B37840; A41498; S10435

R;Birkelund, S.; Lundemose, A.G.; Christiansen, G. . . .

 to: hsp70.prf  check: 9086  from: 1  to: 718

(Peptide) PROFILEMAKE v4.50 of: hsp70.msf{*}  Length: 718

  Sequences: 25  MaxScore: 2145.48  October 8, 1998 10:41

                          Gap: 1.00              Len: 1.00

                     GapRatio: 0.33         LenRatio: 0.10

             hsp70.msf{S11448}  From: 1         To: 718       Weight: 1.00

             hsp70.msf{S06443}  From: 1         To: 718       Weight: 1.00 . . .

 Profile: hsp70.prf

         Gap Weight: 24.000      Average Match:  1.105

      Length Weight:  0.267   Average Mismatch: -1.022

            Quality: 1123.35             Length:    710

              Ratio:   1.74               Gaps:     42

 s10435 x hsp70.prf        October 8, 1998 11:01  ..

                  .         .         .         .         .

S     11 IGIDLGTTNSCVSVMEGGQPKVIASSEGTRTTPSIVAFK.GGETLVGIPA 59

         :||||||| ||| : . .   |||. :| ||||| |||    | |||  |

P     33 IGIBLGTTYSCVGVWQHGRVEIIANBQGNRTTPSYVAFTQBTERLIGBAA 82

                  .         .         .         .         .

S     60 KRQAVTNPEKTLASTKRFIGRKFSE..VESEIKTVPYKVAPNSKGDAVFD 107

         | |   ||  |.   || |||:| :  |:::.|  |:::     .

P     83 KNQVAMNPHNTVFBAKRLIGRKFNBPVVQSBMKHWPFKVVNKBGGKPKVQ 132

 //////////////////////////////////////////////////////////////

INPUT FILES

[ Previous | Top | Next ]

ProfileGap requires a profile as one of its input files. You can create profiles from aligned sequences by means of the ProfileMake program. In the ProfileDir directory, Accelrys GCG (GCG) provides a large number of amino acid profiles derived from the PROSITE database.

ProfileGap accepts as its other input one or more sequences of the same type as the sequences used to create the profile. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*. The function of ProfileGap depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

HmmerAlign uses a profile hidden Markov model (HMM) as a template to create an optimal multiple alignment of a group of sequences.

RESTRICTIONS

[ Previous | Top | Next ]

We have little experience using nucleotide sequences with profile analysis.

The surface of comparison (see BestFit) may not be more than some value set within the program (5.5 million at most institutions). Profiles may not be longer than 1,000 residues or bases. Sequences that are too long for the surface of comparison are divided into smaller segments that are aligned separately (see the CONSIDERATIONS topic, below).

ALGORITHM

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileGap uses the same algorithm as BestFit to align the profile to each sequence. The alignment is made with the values in the profile. The alignment is displayed with the consensus sequence from the profile aligned to the sequence.

For a detailed description of Smith and Waterman-style alignments, see the entry for BestFit in the Program Manual.

CONSIDERATIONS

[ Previous | Top | Next ]

There is strong reason to believe that the BestFit algorithm used by ProfileGap is the best known way to find segments of similarity, but the best parameters must be empirically determined. Like any alignment program, ProfileGap produces alignments that are very different depending on the scoring matrix values and gap coefficients used to make up the profile, and the gap penalties used as input to ProfileGap.

ProfileGap attempts to choose default gap creation and extension penalties that are appropriate for the profile it reads. You can use -GAPweight and -LENgthweight or respond to the program prompts to specify alternative gap penalties if you don't want to accept the default values.

Unless you use -LIMit, sequences that are too long for the surface of comparison are always divided into smaller, overlapping segments that are aligned separately. -LIMit may permit long sequences to be aligned without division. Sequences longer than 32,000 are always divided and aligned as separate segments. Although ProfileGap and ProfileSegments overlap the points of division by the whole length of the profile, divided sequences may not align properly if the segment of similarity crosses the point where the sequence was divided.

-GLObal makes ProfileGap and ProfileSegments display the alignment of the whole sequence to the whole profile, instead of just the most-similar segment between the sequence and the profile. This is analogous to executing a Gap between the profile and sequence.

If multiple sequences are specified as input to ProfileGap, the command-line parameters -BEGin, -END, and -REVerse are ignored. If the sequences are specified by means of a list file, the Begin, End, and Strand list file attributes are used, if present. Otherwise, the entire length of each input sequence is used.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % profilegap [-INfile1=]pir:s10435 \

                  [-INfile2=]hsp70.prf -Default

Prompted Parameters:

-BEGin=1 -END=652        sets the range of interest

-REVerse                 uses the reverse strand

-GAPweight=4.50          sets maximum position-specific gap creation penalty

-LENgthweight=0.05       sets maximum position-specific gap extension penalty

[-OUTfile=]hsp70.pair    names the output file for the alignment

Local Data Files:        None

Optional Parameters:

-GLObal                  aligns the whole sequence and profile (global

                           alignment)

-LOCal                   aligns the best segment of similarity between

                           the sequence and profile (local alignment is

                           the default)

-ENDWeight               penalizes end gaps like other gaps

-LIMit1=719              lets you set a gap shift limit for the sequence

-LIMit2=659              lets you set a gap shift limit for the profile

-OUTfile2=s10435.gap     names new file for sequence 1 with gaps added

-OUTfile3=hsp70.gap      names new file for the profile consensus with

                           gaps added

-MSF[=hsp70.msf]         names new MSF file containing alignment of all the

                           sequences with the profile consensus

-PAIr=1.0,0.5,0.1        sets thresholds for displaying "|", ":", and "."

-NOMONitor               suppresses the screen summary for each alignment

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-GAPweight=4.5

Sets the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

-LENgthweight=0.05

Sets the gap extension penalty that is substracted from the alignment score for each gapped symbol.

-GLObal

Causes this program to make alignments using the method of Needleman and Wunsch instead of the default method of Smith and Waterman. The difference between these two methods is the same as the difference between the programs Gap and BestFit. The Needleman and Wunsch method displays the whole length of both sequences after alignment, while the Smith and Waterman method shows only the best segment of similarity from each sequence.

-LOCal

Forces this program to make alignments using the default method of Smith and Waterman instead of the method of Needleman and Wunsch. The difference between these two methods is the same as the difference between the programs BestFit and Gap. The Smith and Waterman method shows only the best segment of similarity from each sequence, while the Needleman and Wunsch method displays the whole length of both sequences after alignment.

-ENDWeight

Causes the end gaps to be penalized in the same way as all other gaps. This parameter is ignored unless -GLObal is also present on the command line.

-LIMit1=20 and -LIMit2=20

Lets you set gap shift limits for each sequence ( -LIMit1 sets a gap shift limit for the sequence and -LIMit2 sets a gap shift limit for the profile). When you already know of a long similarity between two sequences you can "zip" them together using this mode. The beginning coordinates for each sequence must be near the beginning of the alignment you want to see. The alignment continues so that gaps inserted do not require the sequences to get out of step by more than the gap shift limits. You can align very long sequences rapidly. The surface of comparison is still limited to one million. The size of a comparison can be predicted by multiplying the average length of the two sequences times the sum of the two shift limits.

If you add just -LIMit to the command line without supplying a value, the program prompts you to enter gap shift limits for each sequence.

-MSF=profilename.msf

-OUTfile2=seqname.gap

-OUTfile3=profilename.gap

This program can write up to four different output files. The primary output file (-OUTfile1) displays a pairwise alignment between the profile consensus sequence and each of the input sequences. This file is always created unless you specify -NOOUTfile1 on the command line. If you wish to create a file in MSF format that contains a multiple alignment of the consensus profile with all of the sequences, specify -MSF on the command line. If the input file is a single sequence, you can output two new sequence files that may contain gaps to reflect the sequence-profile alignment. The gapped file for the sequence is specified by -OUTfile2 and the one for the profile consensus is specified by -OUTfile3.

Aligned sequences (in sequence files) can be displayed with GapShow.

-PAIr=1.0,0.5,0.1

The paired output file from this program displays sequence similarity by putting a pipe character (|), colon (:), and period (.) between similar sequence symbols. The default thresholds for the characters are determined by the values in the profile. The pipe character is put between symbols whose comparison value in the profile is at least the average positive value in the profile plus one tenth the difference between the maximum and average values in the profile. The colon character threshold is the average positive value in the profile. The period character threshold is the larger of the average positive value in the profile minus one tenth the difference between the maximum and average values, and one half the average value.

-NOMONitor

Suppresses the screen summary for each alignment which reports some statistics for the alignment.

Printed: May 27, 2005 14:09

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.