PROFILESEGMENTS

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

ALGORITHM

CONSIDERATIONS

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

ProfileSegments makes optimal alignments showing the segments of similarity found by ProfileSearch.

DESCRIPTION

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileSearch and ProfileSegments use the method of Gribskov, et al (Proc. Natl. Acad. Sci. USA 84; 4355-4358 (1987)). ProfileSearch compares a profile to a set of sequences and lists the sequences that contain a region similar to the profile. ProfileSegments is used to display an optimized alignment between the best segment of similarity in each sequence in the list and the profile. ProfileSegments uses the alignment procedure of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to search for and align the segments. The scoring matrix values, gap creation penalties, and gap extension penalties used to find the best region of similarity between the profile and the sequence are all present in the input file itself and need not be set.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using ProfileSegments to display optimal alignments of the segments of similarity reported in the output file from the example session of ProfileSearch:

 
 
% profilesegments
 
 PROFILESEGMENTS from what PROFILESEARCH output file ? hsp70.pfs
 
 Stop after how many alignments (* 15 *) ?
 
 What should I call the paired output display file (* hsp70.pairs *)?
 
        The following levels will be marked in the alignments:
                   Bar: 1.78
                 Colon: 1.09
                   Dot: 0.55
 
 Aligning ...................................-...............................
 PIR2:JC4853
 
          Gaps:     17
       Quality: 1726.80
 Quality Ratio:   2.67
        Length:    693
 
 Aligning ...................................-...............................
 PIR2:S07197
 
          Gaps:     17
       Quality: 1726.80
 Quality Ratio:   2.67
        Length:    693
 
 ////////////////////////////////////////////////////////////////////////////
 
%

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:

 
 
 (Local) PROFILESEGMENTS of: JC4853  check: 4250  from: 1  to: 646
 
P1;JC4853 - dnaK-type molecular chaperone hsc73 - mouse
N;Alternate names: heat-shock protein 73
C;Species: Mus musculus (house mouse)
C;Date: 15-Aug-1996 #sequence_revision 18-Oct-1996 #text_change 13-Mar-1998
C;Accession: JC4853
R;Soulier, S.; Vilotte, J.L.; L'Huillier, P.J.; Mercier, J.C. . . .
 
 to: hsp70.prf  check: 1246  from: 1  to: 743
 
(Peptide) PROFILEMAKE v4.50 of: hsp70.msf{*}  Length: 743
  Sequences: 25  MaxScore: 2168.13  October 7, 1998 17:48
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
             hsp70.msf{S11448}  From: 1         To: 743       Weight: 1.00
             hsp70.msf{S06443}  From: 1         To: 743       Weight: 1.00 . . .
 
 Profile: hsp70.prf
 
         Gap Weight: 24.000      Average Match:  1.094
      Length Weight:  0.270   Average Mismatch: -0.993
 
            Quality: 1726.80             Length:    693
              Ratio:   2.67               Gaps:     17
 
 jc4853 x hsp70.prf        October 21, 1998 16:51  ..
 
                  .         .         .         .         .
S      1 MSKGPAVGIDLGTTYSCVGVFQHGKVEIIANDQGNRTTPSYVAFT.DTER 49
            . |:||||||||||||::::..|||||||||||||||||||| |:||
P     27 MTKGPAIGIBLGTTYSCVGVWQHGRVEIIANBQGNRTTPSYVAFTQBTER 76
 
//////////////////////////////////////////////////////////////
 
 (Local) PROFILESEGMENTS of: S07197  check: 4250  from: 1  to: 646
 
P1;S07197 - dnaK-type molecular chaperone hsc73 - rat
N;Alternate names: heat shock cognate protein hsc70; heat shock cognate protein
 hsc73
C;Species: Rattus norvegicus (Norway rat)
C;Date: 29-Jan-1993 #sequence_revision 29-Jan-1993 #text_change 30-Jan-1998
C;Accession: S07197; I57594; S35606
R;Sorger, P.K.; Pelham, H.R.B. . . .
 
 to: hsp70.prf  check: 1246  from: 1  to: 743
 
(Peptide) PROFILEMAKE v4.50 of: hsp70.msf{*}  Length: 743
  Sequences: 25  MaxScore: 2168.13  October 7, 1998 17:48
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
             hsp70.msf{S11448}  From: 1         To: 743       Weight: 1.00
             hsp70.msf{S06443}  From: 1         To: 743       Weight: 1.00 . . .
 
 Profile: hsp70.prf
 
         Gap Weight: 24.000      Average Match:  1.094
      Length Weight:  0.270   Average Mismatch: -0.993
 
            Quality: 1726.80             Length:    693
              Ratio:   2.67               Gaps:     17
 
 s07197 x hsp70.prf        October 21, 1998 16:51  ..
 
                  .         .         .         .         .
S      1 MSKGPAVGIDLGTTYSCVGVFQHGKVEIIANDQGNRTTPSYVAFT.DTER 49
            . |:||||||||||||::::..|||||||||||||||||||| |:||
P     27 MTKGPAIGIBLGTTYSCVGVWQHGRVEIIANBQGNRTTPSYVAFTQBTER 76
                  .         .         .         .         .
 
//////////////////////////////////////////////////////////////

INPUT FILES

[ Previous | Top | Next ]

ProfileSegments reads the ProfileSearch output file in order to obtain the names of the sequences, the name of the profile, and the gap creation and gap extension penalties. You can tell ProfileSegments to ignore any of the sequence files in the list by editing this file. To do this, insert an exclamation point (!) as the first character of the line that you wish to comment out. If the profile that was used in ProfileSearch cannot be identified and read correctly from the information in the text heading of the input file, ProfileSegments complains and stops.

If the gap creation penalty and gap extension penalty cannot be read correctly from the information in the text heading of the input file, ProfileSegments calculates values from the profile itself, using the maximum match value that is present in the profile: 3 x MaxMatch for the gap creation penalty and MaxMatch / 30 for the gap extension penalty.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

HmmerSearch uses a profile hidden Markov model as a query to search a sequence database to find sequences similar to the family from which the profile HMM was built. Profile HMMs can be created using HmmerBuild. HmmerAlign uses a profile hidden Markov model (HMM) as a template to create an optimal multiple alignment of a group of sequences.

RESTRICTIONS

[ Previous | Top | Next ]

We have little experience using nucleotide sequences with profile analysis.

The surface of comparison (see BestFit) may not be more than some value set within the program (5.5 million at most institutions). Profiles may not be longer than 1,000 residues or bases. Sequences that are too long for the surface of comparison are divided into smaller segments that are aligned separately (see the CONSIDERATIONS topic, below).

ALGORITHM

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileSegments reads the profile, the profile's consensus sequence, and the set of sequences in the list created by ProfileSearch and then uses the same algorithm as BestFit to align each sequence to the profile. The alignment is made with the values in the profile. The display is made with the consensus sequence and values from the profile. For a detailed description of Smith and Waterman-style alignments, see the BestFit program's entry.

CONSIDERATIONS

[ Previous | Top | Next ]

There is strong reason to believe that the BestFit algorithm used by ProfileSegments is the best known way to find segments of similarity, but the best parameters must be determined empirically. Like any alignment program, ProfileSegments produces alignments that are very different depending on the scoring matrix values and gap coefficients used to make up the profile, and the gap penalties used as input to ProfileSearch.

Unless you use -LIMit, sequences that are too long for the surface of comparison are always divided into smaller, overlapping segments that are aligned separately. -LIMit may permit long sequences to be aligned without division. Sequences longer than 32,000 are always divided and aligned as separate segments. Although ProfileGap and ProfileSegments overlap the points of division by the whole length of the profile, divided sequences may not align properly if the segment of similarity crosses the point where the sequence was divided.

-GLObal makes ProfileGap and ProfileSegments display the alignment of the whole sequence to the whole profile, instead of just the most-similar segment between the sequence and the profile. This is analogous to executing a Gap between the profile and sequence.

ProfileSearch/ProfileSegments finds only the best fit of the profile to any sequence. Be aware that other regions with a lower degree of similarity to the profile may also exist in the same sequence, especially in nucleic acid sequences.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax:  % profilesegments [-INfile=]hsp70.pfs -Default
 
Prompted Parameters:
 
-SEQLimit=15            limits the number of alignments
[-OUTfile=]hsp70.pairs  names the output file for the alignments
 
Local Data Files: None
 
Optional Parameters:
 
-LOCal                 aligns the best segment of similarity between
                         the sequence and profile (local alignment is
                         the default)
-GLObal                aligns the whole sequence and profile (global
                         alignment)
  -ENDWeight             penalizes end gaps like other gaps
-LIMit1=20             sets a gap shift limit for the sequence
-LIMit2=20             sets a gap shift limit for the profile
-MSF[=hsp70.msf]       names a new MSF file containing alignment
                         of all the sequences with the profile consensus
-OUTfile2=jc4853.gap   names new file for sequence 1 with gaps added
-OUTfile3=hsp70.gap    names new file for profile consensus with gaps added
-PAIr=1.0,0.5,0.1      sets thresholds for displaying "|", ":", and "."
-NOMONitor             suppresses the screen summary for each alignment

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-SEQLimit=15

Limits the number of alignments created from the segments of similarity reported by ProfileSearch.

-LOCal

Forces this program to make alignments using the default method of Smith and Waterman instead of the method of Needleman and Wunsch. The difference between these two methods is the same as the difference between the programs BestFit and Gap. The Smith and Waterman method shows only the best segment of similarity from each sequence, while the Needleman and Wunsch method displays the whole length of both sequences after alignment. By default, ProfileSegments creates a local alignment between the profile and each sequence being aligned.

-GLObal

Causes this program to make alignments using the method of Needleman and Wunsch instead of the default method of Smith and Waterman. The difference between these two methods is the same as the difference between the programs Gap and BestFit. The Needleman and Wunsch method displays the whole length of both sequences after alignment, while the Smith and Waterman method shows only the best segment of similarity from each sequence.

-ENDWeight

Causes the end gaps to be penalized in the same way as all other gaps. This parameter is ignored unless -GLObal is also present on the command line.

-LIMit1=20 and -LIMit2=20

Lets you set gap shift limits for each sequence ( -LIMit1 sets a gap shift limit for the sequence and -LIMit2 sets a gap shift limit for the profile). When you already know of a long similarity between two sequences you can "zip" them together using this mode. The beginning coordinates for each sequence must be near the beginning of the alignment you want to see. The alignment continues so that gaps inserted do not require the sequences to get out of step by more than the gap shift limits. You can align very long sequences rapidly. The surface of comparison is still limited to one million. The size of a comparison can be predicted by multiplying the average length of the two sequences times the sum of the two shift limits.

If you add just -LIMit to the command line without supplying a value, the program prompts you to enter gap shift limits for each sequence.

-MSF=profilename.msf

-OUTfile2=seqname.gap

-OUTfile3=profilename.gap

This program can write up to four different output files. The primary output file (-OUTfile1) displays a pairwise alignment between the profile consensus sequence and each of the input sequences. This file is always created unless you specify -NOOUTfile1 on the command line. If you wish to create a file in MSF format that contains a multiple alignment of the consensus profile with all of the sequences, specify -MSF on the command line. If the input file is a single sequence, you can output two new sequence files that may contain gaps to reflect the sequence-profile alignment. The gapped file for the sequence is specified by -OUTfile2 and the one for the profile consensus is specified by -OUTfile3.

Aligned sequences (in sequence files) can be displayed with GapShow.

-PAIr=1.0,0.5,0.1

The paired output file from this program displays sequence similarity by putting a pipe character (|), colon (:), and period (.) between similar sequence symbols. The default thresholds for the characters are determined by the values in the profile. The pipe character is put between symbols whose comparison value in the profile is at least the average positive value in the profile plus one tenth the difference between the maximum and average values in the profile. The colon character threshold is the average positive value in the profile. The period character threshold is the larger of the average positive value in the profile minus one tenth the difference between the maximum and average values, and one half the average value.

-NOMONitor

Suppresses the screen summary for each alignment which reports some statistics for the alignment.

Printed: April 5, 2005  15:34


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio