PROFILESEARCH

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

ALGORITHM

NORMALIZATION OF SCORES

CONSIDERATIONS

PROFILE FILE FORMAT

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

ProfileSearch uses a profile (representing a group of aligned sequences) as a query to search the database for new sequences with similarity to the group. The profile is created with the program ProfileMake.

DESCRIPTION

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

Using the method of Gribskov, et al. (Methods in Enzymology, 183; 146-159 (1989)), ProfileSearch accepts a profile from ProfileMake and uses it to search a database (or any set of sequences you specify) for sequences that are similar to the aligned probe sequences used to create the profile. The algorithm calculates the score (quality) of the optimal alignment between the profile and each sequence in the database and creates a list of all of the sequences in the database with an alignment score above some threshold. The results of ProfileSearch are corrected for systematic effects of the sequence length on the score. The output list can be displayed as optimal alignments with ProfileSegments.

The gap creation and gap extension penalties specified for ProfileSearch are maximum values. The actual position-specific gap penalties at any position are determined by multiplying the gap creation penalty by the percent value in the second to the last column of the profile, and the gap extension penalty by the percent value in the last column of the profile.

ProfileSearch does a lot of computing so you will probably want to run it in the batch queue (see the CONSIDERATIONS topic below).

EXAMPLE

[ Previous | Top | Next ]

Here is a session using ProfileSearch to search through the entire protein sequence database with a profile generated from a group of aligned 70 kd heat shock and heat shock cognate protein sequences. The profile was generated in the example session from the ProfileMake program.

 
 
% profilesearch
 
  PROFILESEARCH with what query profile ?  hsp70.prf
 
  "hsp70.prf" is a profile of length: 718
 
  Search for query in what sequences(s) (* PIR:* *) ?
 
  What is the gap creation penalty (* 24.00 *) ?
 
  What is the gap extension penalty (* 0.27 *) ?
 
  What should I call the output file (* hsp70.pfs *) ?
 
       1 Sequences          105 aa searched      PIR1:CCHU
     501 Sequences       93,899 aa searched      PIR1:S25690
   1,001 Sequences      275,279 aa searched      PIR1:DEBSSC
 
  ////////////////////////////////////////////////////////////
 
  98,501 Sequences   34,365,060 aa searched      PIR2:H70346
  99,001 Sequences   34,521,439 aa searched      PIR4:A33152
 
 Writing results...
 
              Sequences searched: 99165
 
   CPU time (seconds) for search: 5745.62 sec.
        Total CPU time (seconds): 5768.57 sec.
            Output file: hsp70.pfs
 
%

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:

 
 
!!SEQUENCE_LIST 1.0
(Peptide) PROFILESEARCH of: hsp70.prf Length: 718 to: PIR:*
 
         Scores are not corrected for composition effects
 
                 Gap Weight: 24.00
          Gap Length Weight: 0.27
         Sequences Examined: 99165
         CPU time (seconds): 3040
*    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
Profile information:
(Peptide) PROFILEMAKE v4.50 of: Hsp70.MSF{*}  Length: 718
  Sequences: 28  MaxScore: 2172.36  October 11, 1996 11:41
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
         Hsp70.MSF{Hs70_Plafa}  From: 1         To: 718       Weight: 1.00
         Hsp70.MSF{Hs70_Thean}  From: 1         To: 718       Weight: 1.00 . . .
*    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
Normalization:                                October 16, 1998 11:26
 
         Curve fit using 48 length pools
         2 of 50 pools were rejected
         Normalization equation:
 
                Calc_Score = 138.82 * ( 1.0 - exp(-0.0020*SeqLen - 0.3503) )
 
         Correlation for curve fit: 0.911
 
         Z score calculation:
         Average and standard deviation calculated using 98922 scores
         243 of 99165 scores were rejected
 
                Z_Score = ( Score/Calc_Score - 0.992 ) / 0.107
 
          Sequence  Strd ZScore   Orig Length ! Documentation  ..
PIR2:JC4853           +  136.40 1740.39    646 ! dnaK-type molecular chape ...
PIR2:S07197           +  136.40 1740.39    646 ! dnaK-type molecular chape ...
PIR2:A27077           +  136.39 1740.20    646 ! dnaK-type molecular chape ...
 
/////////////////////////////////////////////////////////////////////////// ...
 
PIR2:B70338           +    2.52  80.21    133 ! general secretion pathway  ...
PIR2:I40149           +    2.51 100.70    257 ! outer surface protein D -  ...
PIR2:JC6163           +    2.51  83.98    154 ! ubiquitin-conjugating enzy ...

The output file from ProfileSearch is an ordered list of the sequences with the highest alignment scores when compared to the profile. Unless the normalization procedure is disabled or fails, the file is ordered according to the Z scores (see the NORMALIZATION OF SCORES topic below).

The documentation section of the file (before the sequence listing) is divided by rows of asterisks into three parts. The first section describes the conditions used in running ProfileSearch, the number of sequences examined, and the amount of CPU time used. The second section reports the documentary information from the profile. The third section of the documentation records the process of the normalization (see the NORMALIZATION OF SCORES topic below).

INPUT FILES

[ Previous | Top | Next ]

ProfileSearch requires a profile as one of its input files. You can create profiles from aligned sequences by means of the ProfileMake program. In the ProfileDir directory, Accelrys GCG (GCG) provides a large number of amino acid profiles derived from the PROSITE database.

ProfileSearch accepts multiple (two or more) sequences of the same type as the search set. (They should be of the same type as the sequences that were used to create the profile.) You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*. The function of ProfileSearch depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

HmmerSearch uses a profile hidden Markov model as a query to search a sequence database to find sequences similar to the family from which the profile HMM was built. Profile HMMs can be created using HmmerBuild.

RESTRICTIONS

[ Previous | Top | Next ]

We have little experience using nucleotide sequences with profile analysis.

Because of memory constraints, ProfileSearch will only use the first 100,000 sequences for performing normalization (see the NORMALIZATION OF SCORES topic, below).

ALGORITHM

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

NORMALIZATION OF SCORES

[ Previous | Top | Next ]

The scores for comparison of sequences and a profile are systematically correlated with the lengths of the sequences. ProfileSearch corrects the observed alignment scores for these systematic affects of sequence length to give normalized scores.

The relationship between the length of a sequence (SeqLen) and the observed score for comparison of the profile and sequence (Score) is usually very close to

 
 
 
Score = C * (1 - e(A*SeqLen+B))
 

where A, B, and C are some constants. These constants can be empirically determined by fitting the alignment scores of the profile and the search set sequences to the above equation.

The scores for each sequence are sorted by the sequence length and pooled together in groups. Each pool contains at least 50 sequences (20 sequences if there are fewer than 1,500 total scores to normalize), but if the sequences are very similar in length, the program continues to add sequences to the pool until a sequence is encountered that is more than 50 residues longer than the shortest sequence in the pool. For each pool, the mean length, score, and the standard deviation of the scores are determined. If a pool contains more than the minimum number of sequences (50 or 20), these calculations are derived from an evenly-spaced sample of 50 (or 20) sequences from the pool rather than from all sequences in the pool.

The best fit of these values to the above equation is determined by non-linear fitting by the Levenberg-Marquardt method. The fit is weighted by the standard deviations of the pools; pools that contain values that represent true similarity between the sequence and profile have large standard deviations and, therefore, make relatively small contributions to the fit.

The assumption of the normalization is that the alignment scores being normalized represent sequences unrelated to the profile. Because there are usually some sequences that are related to the profile, a second pass of normalization is made. In the second pass, the average score of each length pool is compared to the calculated score for a sequence of the same length. Pools whose average scores differ from their expected scores by more than 5.0 times the average standard deviation of all pools are eliminated from consideration. This effectively eliminates most pools that contain many sequences related to the profile, since these pools have average scores with large deviations from the predicted scores. The curve fitting procedure is now repeated and new values for A, B, and C calculated.

Each normalized score (Score(N)) is then calculated

 
 
 
Score(N) = Score(orig) / (C * (1 - e(A*SeqLen+B)))
 

The average (Score(N)(avg)) and standard deviation (SD(N)) of all normalized scores are then calculated. Normalized scores whose original scores differ from the calculated score for a sequence of the same length by more than 5.0 times the average standard deviation of all length pools are omitted from this calculation. This ensures that sequences similar to the profile will not affect the calculation of the mean and standard deviation.

The Z scores are the differences in standard deviation units between each normalized comparison score and the mean normalized comparison score for sequences unrelated to the profile. Therefore, a Z score of 5.0 means that a comparison score is significant at the 5.0 sigma level. The Z score is calculated as

 
 
 
Z_Score = (Score(N) - Score(N)(avg)) / SD(N)
 

The Z score has a mean of 0.0 and a standard deviation of 1.0.

The third section of the documentation in the ProfileSearch output file records the process of the normalization described above. First, the number of length pools used in the normalization, and the number of pools rejected because of high standard deviation are reported. The empirically derived equation of the curve is then presented, along with the correlation coefficient for the agreement between the calculated curve and the observed results. This value is typically about 0.95. If it is much lower (e.g., less than 0.90), the reported Z scores are not accurate. Finally, the equation used for the calculation of Z scores is reported, along with the number of scores omitted from the calculation of the normalized mean and standard deviation.

Optionally, a file recording the observed and calculated scores for the length pools used in the normalization procedure may be produced with -FITfile. For each pool used in the calculation, the file contains the average and standard deviation of the lengths and observed scores, and the calculated score for a sequence of the average length of the pool.

Failure of Normalization

The normalization procedure described above appears to be robust, but it must be remembered that it is based on an empirically derived description of the distribution of scores. In particular, this normalization may not be appropriate for very long profiles with large numbers of rows with low gap creation penalties. It also fails if the sequences in the search set are all very similar in length.

If the first pass of the curve fitting procedure is unsuccessful, the Z scores are all reported as zero. If the first pass is successful, but the second pass fails, Z scores are calculated and the entries in the output file are sorted based on the Z scores; however, a warning message is produced in the documentation of the output. If fewer than 400 entries are saved in the output or if fewer than three length pools are present, normalization will not be attempted and all Z scores will be reported as zero.

CONSIDERATIONS

[ Previous | Top | Next ]

ProfileSearch attempts to choose default maximum gap creation and extension penalties that are appropriate for the query profile it reads. You can use -GAPweight and -LENgthweight or respond to the program prompts to specify alternative maximum gap penalties if you don't want to accept the default values.

ProfileSearch requires the computer to make a calculation that is proportional to the product of the profile length times the length of the database. On slower computers, large profiles searched against large databases (e.g. all of PIR) may require hours of CPU time. Searches of the nucleotide sequence database with profiles prepared from nucleotide sequence alignments may take substantially longer because of the larger size of the database and the necessity to search both forward and reverse strands of each database sequence.

Since ProfileSearch can take such a long time to run, most searches should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide.

ProfileSearch/ProfileSegments finds only the best fit of the profile to any sequence. Be aware that other regions with a lower degree of similarity to the profile may also exist in the same sequence, especially in nucleic acid sequences.

PROFILE FILE FORMAT

[ Previous | Top | Next ]

Look at the entry for ProfileMake for a description and an example of what profile files look like.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % profilesearch [-INfile1=]hsp70.prf -Default
 
Prompted Parameters:
 
[-INfile2=]pir:*         specifies the search set
-GAPweight=24.00         sets maximum position-specific gap creation penalty
-LENgthweight=0.27       sets maximum position-specific gap extension penalty
[-OUTfile=]hsp70.pfs     names the output file
 
Local Data Files: None
 
Optional Parameters:
 
-LIStsize=10000         sets a limit to the size of the output list
-SEQLimit=100000000     sets maximum number of sequences to search
-MINList=2.5            sets lowest Z score to report in output list
-MINSeq=50              sets minimum length sequence to examine in the search
-MINNorm=50             sets minimum length sequence to use in normalization
-NONORmalize            turns off normalization of comparison scores
-NOSEArch               normalizes a pre-existing file of search scores
-FITfile[=hsp70.fit]    names output file with curve fitting information
                          for normalized scores
-CPUlimit=86400         limits the search to 86,400 seconds
-SINce=6.90             limits search to sequences dated on or after
                             June 1990 (does not work for PIR databases)
-NOMONitor              suppresses the screen trace for the search set
                             sequences
-BATch                  submits program to the batch queue

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-GAPweight=24.00

Sets the maximum position-specific gap creation penalty that is subtracted from the alignment score whenever a gap is created.

-LENgthweight=0.27

Sets the maximum position-specific gap extension penalty that is subtracted from the alignment score for each gapped symbol.

-LIStsize=10000

Sets a limit to the number of entries in the output list.

-SEQLimit=100000000

Sets a limit on the number of sequences to search.

-MINList=2.5

Sets the lowest Z score for an entry to be reported in the output list. If Z scores aren't calculated, then all entries (up to the maximum number determined by -LIStsize) are reported. The default reports an entry if its score is 2.5 standard deviations above the expected score for a sequence of the same length.

-MINSeq=50

Sets the minimum length for a sequence to be searched.

-MINNorm=50

Set the minimum length of a sequence to be used in the normalization.

-NONORmalize

Turns off the normalization of alignment scores. The listed sequences are sorted by their original comparison scores, rather than the scores adjusted for systematic effects of sequence length. In the output file, all Z scores are 0.

-NOSEArch

Turns off the database search and normalizes a previously existing file of ProfileSearch scores, instead. The normalization occurs only if the existing file contains over 400 saved entries.

-FITfile=hsp70.fit

Writes an output file containing the observed and calculated scores for the length pools used in the normalization procedure.

-CPUlimit=86400

Sets a limit in seconds beyond which the search is aborted. The limit defaults to 86,400 seconds (24 hours). ProfileSearch reports the results for the sequences searched before the time limit is exceeded.

-SINce=6.1990

Limits the search to sequences that have been entered into the database or modified since June 1990. As this is being written, only the EMBL, GenBank, and SWISS-PROT databases support this parameter.

-MONitor=500

Monitors this program's progress on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it with -NOMONitor.

The monitor is updated every time the program processes 500 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.

-BATch

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

Printed: May 27, 2005  14:12


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio