PROFILEMAKE

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

SPECIFYING SEQUENCES FOR PROFILEMAKE

CALCULATING THE PROFILE

FUNCTION

[ Top | Next ]

ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap).

DESCRIPTION

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileMake uses the method of Gribskov, et al (Proc. Natl. Acad. Sci. USA 84; 4355-4358 (1987)) to create a profile from a group of aligned sequences. A profile is a table that contains all of the comparison information of a group of aligned sequences. These sequences must be previously aligned (see the RELATED PROGRAMS topic below) before running ProfileMake. The profile contains as many rows as there are positions in the aligned sequences. Each row contains a score for the alignment of the corresponding position of the aligned sequences with each possible base or residue.

The profile is the input data for ProfileSearch, which can find sequences in the database similar to your group of aligned sequences, and ProfileGap, which can make an optimal alignment between the aligned sequences and another sequence.

The aligned sequences may be specified to ProfileMake with an ambiguous file expression or in a list file similar to the input for Pretty (See Section 2, Using Sequence Files and Databases in the User's Guide for more information.)

EXAMPLE

[ Previous | Top | Next ]

Here is a session using ProfileMake to make a profile from aligned 70 kd heat shock and heat shock cognate peptide sequences (these sequences were aligned in the example session for PileUp):

% profilemake

    Profile of what aligned sequence(s) hsp70.msf{*}

 hsp70.msf{s11448}, begin: 1  end: 718  len: 743  weight: 1.00

 hsp70.msf{s06443}, begin: 1  end: 718  len: 743  weight: 1.00

 hsp70.msf{a25398}, begin: 1  end: 718  len: 743  weight: 1.00

 /////////////////////////////////////////////////////////////////

    What should I call the output file (* hsp70.prf *) ?

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:

!!AA_PROFILE 1.0

(Peptide) PROFILEMAKE v4.50 of: hsp70.msf{*}  Length: 743

  Sequences: 25  MaxScore: 2172.36  October 7, 1998 11:41

                          Gap: 1.00              Len: 1.00

                     GapRatio: 0.33         LenRatio: 0.10

         hsp70.msf{S11448}  From: 1         To: 743       Weight: 1.00

         hsp70.msf{S06443}  From: 1         To: 743       Weight: 1.00

         /////////////////////////////////////////////////////////////////

         hsp70.msf{S29261}  From: 1         To: 743       Weight: 1.00

Symbol comparison table: GenRunData:blosum62.cmp  FileCheck: 6430

     Relaxed treatment of non-observed characters

     Exponential weighting of characters

Cons A    B    C    D    E    F    G    H    I    K    L  ... Gap  Len  ..

 M   -1   -4   -1   -4   -2    0   -4   -2    1   -1    2 ...   9    9

 L   -1   -5   -1   -5   -4    0   -5   -4    2   -2    4 ...   9    9

 /////////////////////////////////////////////////////////////////////

 E   -2    5  -10    5   12   -7   -5    0   -7    2   -7 ...   2    2

 V    0   -7   -3   -7   -5   -2   -7   -7    7   -5    2 ...   2    2

 B   -5   15   -7   15    5   -7   -3   -2   -7   -2  -10 ...   2    2

 * 1390    0  114 1140 1219  600 1333  167 1011 1254 1183 ...

INPUT FILES

[ Previous | Top | Next ]

ProfileMake accepts multiple sequences (two or more) all of the same type. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*. The function of ProfileMake depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

HmmerBuild creates a position-specific scoring table, called a profile hidden Markov model (HMM), that is a statistical model of the consensus of a multiple sequence alignment. The profile HMM can be used for database searching (HmmerSearch), sequence alignment (HmmerAlign) or generating random sequences that match the model (HmmerEmit). HmmerCalibrate "calibrates" a profile hidden Markov model in order to increase the sensitivity of database searches performed using that profile HMM as a query. The program compares the original profile HMM with a large number of randomly generated sequences and computes the extreme value distribution (EVD) parameters for this simulated search. The original profile HMM is replaced with a new one that contains these EVD parameters.

RESTRICTIONS

[ Previous | Top | Next ]

We have little experience using nucleotide sequences with profile analysis.

Profiles must be no more than 1000 residues long. ProfileMake cannot accept more than 5000 aligned sequences for the profile. It is your responsibility to ensure that the sequences input to ProfileMake are in alignment.

SPECIFYING SEQUENCES FOR PROFILEMAKE

[ Previous | Top | Next ]

The sequences used to make the profile can be specified in two ways. (See Section 2, Using Sequence Files and Databases in the User's Guide for more information.) A group of sequences may be named with an ambiguous expression like kf*.pep or pileup.msf{*}. The sequences may also be specified in a list file, and a beginning and ending position can be assigned to each sequence in the list with the begin: and end: sequence attributes, respectively. (See "Using List Files" in Section 2, Using Sequence Files and Databases in the User's Guide. Make sure that the sequence ranges you specify will result in the sequences being in alignment. If beginning and ending positions are not specified, the entire sequence is used.

If the sequences are specified in a list file, you can optionally specify a weight for each sequence with the weight: sequence attribute. A weight of 1.0 is assumed if none is specified with the sequence.

You can assign weights to sequences in an MSF file by editing the MSF file and modifying the weight on the name/weight line for each sequence. (See "Using Multiple Sequence (MSF) Files" in Section 2, Using Sequence Files and Databases in the User's Guide for a complete description of MSF files.)

You can assign vote weights to sequences in an RSF (rich sequence format) file by modifying the weight attribute for each sequence within SeqLab. (See "Using Rich Sequence Format (RSF) Files" in Section 2, Using Sequence Files and Databases in the User's Guide for a complete description of RSF files. Also see "Viewing and Editing Sequence Attribute and Reference Information" in Section 2, Editing Sequences and Alignments in the SeqLab Guide for more information about modifying the weight attribute for each sequence within an RSF file.)

If a sequence from an MSF or RSF file is listed in a list file with a weight, the sequence weight is taken from the list file (the sequence weight in the MSF or RSF file is ignored).

Part of a file of sequence names that could be used as input to ProfileMake follows.

A multiple sequence alignment represented as a list file for input to

the program PROFILEMAKE.

5/3/90   ..

fa10.ugly    begin: 201       end: 250       weight: 0.5

fa12.ugly    begin: 201       end: 250       weight: 0.5

fo1k.ugly    begin: 201       end: 250       weight: 1.0

e.ugly       begin: 201       end: 250       weight: 1.0

////////////////////////////////////////////////////////

CALCULATING THE PROFILE

[ Previous | Top | Next ]

Similarity Scores

In a scoring matrix, a score can be found for the comparison of any two sequence symbols. (See Appendix VII for more information.) Given a group of aligned sequences, a score can be calculated for the comparison of a symbol to each position of the aligned sequences. This comparison score differs from position to position in the aligned sequences, because each position contains a different spectrum of sequence symbols. The overall score is, in a sense, the average of the comparison scores for the sequence symbols found at a particular aligned sequence position.

Each row of a profile contains the scores for a comparison of the corresponding position of a multiple sequence alignment to each possible sequence symbol. For example, if a profile is made from a group of aligned protein sequences, the 10th row of the profile has values for the comparison of the 10th position in the alignment to each possible amino acid. The profile has as many rows as there are positions in the alignment, and each row has as many comparison scores as there are amino acid symbols. Thus, the profile is a position-specific scoring matrix for every position in a multiple sequence alignment.

The consensus sequence character is the symbol with the largest value in each row of the profile. It is used solely for the display of alignments and not for the calculation of the optimal alignment between a profile and a sequence.

The last row of the profile contains the composition for the whole profile. In the A column, for instance, the total number of A's in the multiple sequence alignment is shown.

Sequence Symbol Weights

As stated above, the comparison score of an alignment position and a given sequence symbol is an average of the comparison scores for the different sequence symbols at that position. This average is weighted so that a symbol's weight in the calculation of the average score increases along with its fraction of the symbols at that position. Two types of weighting are currently used. Linear weighting (chosen with -NOLOGwgt) gives a weight to each symbol that is directly proportional to the number of occurrences of that symbol at a given position. The default logarithmic weighting gives a symbol that predominates at a given position a disproportionately higher weight than a symbol that occurs only once. This causes positions in the aligned sequences that have many identical residues to bias the profile more strongly towards the identical residues than when linear weighting is used.

Using either kind of weighting, the weight for a residue is 0 when that residue does not occur at a given position; the weight is 1 when only that residue is found at a given position.

If the number of aligned sequences is fairly small, the sequence symbols observed at each position of the alignment may not represent the whole spectrum of symbols that would be observed if more sequences were available. In these cases, even residues that are not observed at a given position in the alignment should perhaps be given a small weight. For nucleic acids, non-observed bases are given a weight of 0 by default. The default for proteins is to give non-observed amino acids a weight equal to 0.025 divided by the sum of the sequence weights. -STRINgent gives non-observed sequence symbols a weight of 0.

Gap Coefficients

The profile also includes position-specific gap coefficients, expressed as percentages. The gap coefficient determines the penalty that an alignment must pay in order to create a gap, and the gap length coefficient determines the penalty that must be paid in order to extend a gap. The actual gap penalties are calculated by multiplying the position-specific gap coefficients by the gap penalties specified when running the other Profile programs.

All gaps in the aligned sequences that overlap are treated as a single gap for purposes of calculating gap coefficients. The gap is considered to begin at the position of the leftmost gap character (. or ~) in any of the sequences, and to end at the rightmost gap character. The position-specific gap coefficients are reduced from 100 percent as a function of the longest gap through the position of interest in the aligned sequences. The gap coefficient G and gap length coefficient L are calculated as

G = C_(G) x ( R_(G) / (1 + GapLength x R_(L) )

L = C_(G) x ( R_(G) / (1 + GapLength x R_(L) )

where GapLength is the length of the gap as defined above. GapCoefficient (C_(G)), GapRatio (R_(G)), and GapLengthRatio (R_(L)) have default values of 100, 0.33, and 0.1 respectively, but can be changed with -GAPCoefficient, -GAPRatio, and -LENGTHRatio.

You can edit the profile with a text editor and change the gap coefficients to any values you wish.

CONSIDERATIONS

[ Previous | Top | Next ]

If you edit a profile, the "length:" entry must agree with the actual length of the profile (number of rows).

If you create a profile from a single peptide sequence, you should use -STRINgent to give a weight of 0 to all symbols not occurring at each position in the sequence.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % profilemake [-INfile=]hsp70.msf{*} -Default

Prompted Parameters:

[-OUTfile=]hsp70.prf     names output file containing profile

Local Data Files:

-MATRix=blosum62.cmp     assigns the scoring matrix for proteins

-MATRix=profiledna.cmp   assigns the scoring matrix for nucleic acids

Optional Parameters:

-BEGin=1                 sets the beginning position in the aligned

                           sequences

-END=738                 sets the ending position in the aligned

                           sequences

-WEIGHT=1                sets the weight for all input sequences

-GAPCoefficient=100      sets the maximum gap creation penalty in a

                           region WITH NO gaps

-LENGTHCoefficient=100   sets the maximum gap extension penalty in a region

                           WITH NO gaps

-GAPRatio=0.33           GAPRatio multiplied by GAPWeight sets the

                           maximum gap creation and extension penalties

                           in a region WITH gaps

-LENGTHRatio=0.1         determines how rapidly gap creation and extension

                           penalties decrease with increasing gap size

-NOLOGwgt                uses linear weighting for symbols to produce

                           the profile score.  The default is exponential

                           weighting

-STRINgent               gives a weight of 0 to symbols not occurring at

                           a particular position in the aligned sequences

-SEQout[=pretty.pep]     writes the consensus into a sequence file

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.

ProfileMake reads a scoring matrix file called blosum62.cmp for peptide alignments or profiledna.cmp for nucleotide alignments. The peptide scoring matrix is based on substitutions between amino acid pairs in ungapped blocks of aligned protein segments as measured by Henikoff and Henikoff. The nucleotide scoring matrix has 10 for matches, -6 for mismatches, and intermediate positive values for overlaps between IUPAC-IUB ambiguity symbols. All comparisons to four-way ambiguity symbols N, X, or gap (. or ~) are given a value of 0. Read the header of the matrix files for more information about their construction. (See Appendix VII for more information about scoring matrices.)

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-MATRix=mymatrix.cmp

Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-BEGin=1

Sets the beginning position for all input sequences. When the beginning position is set from the command line, ProfileMake ignores beginning positions specified for individual sequences in a list file.

-END=100

Sets the ending position for all input sequences. When the ending position is set from the command line, ProfileMake ignores ending positions specified for sequences in a list file.

-WEIGHT=1.0

Sets the sequence weight for all input sequences. When the weight is set with this parameter, ProfileMake ignores weights specified for individual sequences in a list file, MSF file, or RSF file.

-GAPCoefficient=100

Sets the maximum gap coefficient for the profile. This coefficient is expressed as a percentage and has a default maximum value of 100 percent. This value is found in each row of the profile where the corresponding alignment has no gaps at all. The gap coefficient is reduced from 100 percent at positions in the alignment that have gaps. In the other profile programs, the gap coefficient in each row of the profile is multiplied by an interactively specified gap creation penalty to calculate the penalty for creating a gap at that position.

-LENGTHCoefficient=100

Sets the maximum gap length coefficient for the profile. This coefficient is expressed as a percentage and has a default maximum value of 100 percent. This value is found in each row of the profile where the corresponding alignment has no gaps at all. The gap length coefficient is reduced from 100 percent at positions in the alignment that have gaps. In the other profile programs, the gap length coefficient in each row of the profile is multiplied by an interactively specified gap extension penalty to calculate the penalty for extending a gap at that position.

-GAPRatio=0.33

Is used to calculate the gap and gap length coefficients for a row of the profile where the multiple sequence alignment has gaps. GAPRatio multiplied by GAPCoefficient is approximately equal to the maximum gap coefficient in a region with gaps. Similarly, GAPRatio multiplied by LENGTHCoefficient is approximately equal to the maximum gap length coefficient in a region with gaps.

-LENGTHRatio=0.1

Determines how rapidly the gap coefficient and gap length coefficient decrease with increasing gap size. With a gap of lengthGapLength, both of these coefficients decrease from their maximum values by a factor of

GAPRatio / ( 1 + (LENGTHRatio x GapLength) )

-NOLOGwgt

Uses linear weighting of the residues at each position in the aligned sequences. The weight of each residue is directly proportional to the number of times the residue occurs at a given position in the aligned sequences. The default is exponential weighting that causes positions in the aligned sequences with many identical residues to bias the profile more strongly towards the identical residues than does linear weighting.

-STRINgent

Gives a weight of 0 to all symbols not occurring at a given position in the aligned sequences. This is the default for nucleic acids. For proteins, residues not occurring at a position in the aligned sequences are given a small weight by default.

-SEQout=hsp70.pep

Writes the consensus from the profile into a new sequence file. This sequence output file is written in addition to the file with the profile. The sequence file can be named by you or ProfileMake gives it the same name as the profile, but with the extension .seq for DNA or .pep for protein.

Printed: May 27, 2005 14:11

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.