Profile Analysis

Associating Distantly Related Proteins and Finding Structural Motifs

by John Devereux

Introduction

Profile analysis is a sequence comparison method for finding and aligning distantly related sequences. The comparison allows a new sequence to be aligned optimally to a family of similar sequences. The comparison uses a scoring matrix (a derivative of the Dayhoff evolutionary distances table or PAM matrix) and an existing optimal alignment of two or more similar protein sequences. The group or "family" of similar sequences are first aligned together to create a multiple sequence alignment. The information in the multiple sequence alignment is then represented quantitatively as a table of position-specific symbol comparison values and gap penalties. This table is called a profile.

The similarity of new sequences to an existing profile can be tested by comparing each new sequence to the profile with the same algorithm used to make optimal alignments. To understand how this is done we must first recall what alignment algorithms do. Alignment algorithms find alignments between two sequences that maximize the number of matches and minimize the number of gaps. The match, for any pair of symbols being compared, is really a value that comes from a scoring matrix that contains a value for every possible pair of sequence symbols; see Appendix VII for more information. (Scoring matrices were referred to as symbol comparison tables in previous releases of the Accelrys GCG (GCG)) Gaps are given penalties in the same units as the values in the scoring matrix. The best alignment is then simply defined as the alignment for which the sum of the scoring matrix values minus the gap penalties is maximal.

So how does alignment work when a sequence is being aligned to a profile? Each row in the profile corresponds to a position in the original multiple sequence alignment. Each possible sequence symbol has a value (a column) in each row of the profile. The comparison of a sequence symbol to any row of the profile defines a specific value or "profile comparison value." The best alignments of a sequence to a profile are found by aligning the symbols of the sequence to the profile in such a way that the sum of the profile comparison values minus the gap penalties is maximal. The profile also contains gap coefficients that are specific for each position so the penalty for inserting a gap in one part of the alignment might be more or less than in another part. The position-specific gap coefficients penalize gaps in conserved regions more heavily than gaps in more variable regions.

The profile contains a consensus sequence for the display of alignments of other sequences to the profile. The consensus sequence character corresponds to the highest value in the row. Since the table on which the profile is based is usually the Dayhoff evolutionary distance table, the consensus residue is the residue that has the smallest evolutionary distance from all of the residues in that position of the alignment rather than simply the most frequent residue at that position.

Looking for Structural Motifs with Profiles

Gribskov, et al. (CABIOS 4; 61-66 (1988)) have aligned the sequences from a number of known protein structural motifs and calculated a group of profiles from these alignments. ProfileScan compares any new protein sequence to each of the profiles in this motif database to find out if any of these known motifs occur in the protein. This is one of the few techniques that can reliably predict the location of structural features in protein sequences.

Database Searching with Profiles

A search of the database using a profile as a probe involves making an optimal alignment of every sequence in the database to the profile and listing the alignments for which the alignment score is outstanding.

The profile method has several advantages over most sequence comparison methods. A profile represents the common characteristics of a family of similar sequences where any single sequence is just one realization of the family's characteristics. Since the profile represents the alignment of a number of known sequences, it contains information that defines where the family of sequences is conserved and where it is variable. The comparison of a new sequence to a profile search can emphasize similarity to conserved regions while tolerating diversity in variable regions. A database search can be more sensitive since each sequence in the database is compared to more generalized information than is possible in searches based on pairwise comparisons between two sequences.

Conventional database searching methods require some minimal level of sequence identity between the sequences for any signal to be generated. The profile search, since it is based on quantitative symbol comparisons, can find similarities between sequences with little or no sequence identity.

The alignment of a sequence to a profile is inherently more sensitive since the whole surface of comparison can be used to find the optimal alignment. Conventional methods of searching like the Wilbur and Lipman method use scores that come from one or a small number of adjacent diagonals. The aligned sequences of many protein families suggest that gaps are frequent even in very similar proteins.

Experiments Confirm the Sensitivity of Profile Searching

Experiments reported by Gribskov et al. (Proc. Natl. Acad. Sci. USA 84; 4355-4358 (1987)) show that searching the database with a globin profile creates a distribution of alignment scores that more clearly distinguishes known globins from unrelated sequences. Even globins distantly related to the group used to make the profile were clearly distinguished from non-globin sequences. The non-random part of the distribution of the alignment scores also contained a large number of credibly "globin-like" sequences that were not identified when conventional database searching algorithms were used.

For comparison, the authors searched the PIR protein sequence database with the Lipman-Pearson FASTP program (almost identical to FastA) using human alpha hemoglobin as a probe. The FASTP program selected 244 of the 271 globins in the database. The leghemoglobins could not be clearly distinguished from non-globin sequences.

Steps in Profile Searching

Profile searching has four steps: assembly of a family of related sequences into a multiple sequence alignment with PileUp, construction of a profile from the alignment with the program ProfileMake, comparison of the profile to a database of sequences with ProfileSearch, and finally display of the best similarities found with ProfileSegments. The starting point for the creation of a profile is a sequence or group of aligned sequences. This probe is generally a group of functionally related proteins that have been aligned with tools such as PileUp. A profile, however, can be created from a single sequence.

The profile is then calculated from the multiple sequence alignment with the program ProfileMake. The profile contains position-specific gap coefficients based on the position and length of the gaps in the aligned sequences. The gap and gap length penalty coefficients are higher in regions in which no gaps are observed in the aligned sequences, and lower where gaps are observed. When a sequence is aligned to a profile, gaps will tend to be placed in the same regions they occur in the aligned sequences used to generate the profile.

Profiles, once generated, are provided as the input to ProfileSearch along with a sequence specification like SwissProt:* (the search set). ProfileSearch aligns each sequence in the search set to the profile and makes a list of the sequences with the best alignment scores.

The list is a file of sequence names suitable for input to ProfileSegments which will make and display an optimal alignment of each sequence in the list to the profile consensus sequence. When you have identified a new sequence that belongs to the sequence family from which your profile was calculated, you can align it to the whole multiple sequence family with ProfileGap.

A sequence may be compared to a library of defined profiles, representing known sequence and structural features, with ProfileScan.

References

1. Gribskov, M., McLachlan, A. D., and Eisenberg, D. (1987). Profile Analysis: Detection of Distantly Related Proteins. Proceedings of the National Academy of Sciences USA 84; 4355-4358.

2. Gribskov, M., Homyak, M., Edenfield, J., and Eisenberg, D. (1988). Profile Scanning for

Three-Dimensional Structural Patterns in Protein Sequences. Computer Applications in the Biosciences 4; 61-66.

3. Gribskov, M. and Eisenberg, D. (1989). Detection of Protein Structural Features With Profile

Analysis. In Techniques in Protein Chemistry, (pp; 108-117), Academic Press, San Diego, California, USA.

4. Gribskov, M., Luethy, R., and Eisenberg, D. (1989). Profile Analysis. In Methods in Enzymology,

183; (pp. 146-159), Academic Press, San Diego, California, USA.

Printed: May 27, 2005 14:08

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.