HMMERCALIBRATE

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[ Top | Next ]

HmmerCalibrate "calibrates" a profile hidden Markov model in order to increase the sensitivity of database searches performed using that profile HMM as a query. The program compares the original profile HMM with a large number of randomly generated sequences and computes the extreme value distribution (EVD) parameters for this simulated search. The original profile HMM is replaced with a new one that contains these EVD parameters.

DESCRIPTION

[ Previous | Top | Next ]

HmmerCalibrate provides a Accelrys GCG (GCG) interface to the hmmcalibrate program of Dr. Sean Eddy's HMMER package. It allows you to access most of hmmcalibrate's parameters from GCG command line.

HmmerCalibrate "calibrates" a profile HMM by generating random sequences and computing a raw score for the comparison between each sequence and the profile HMM. The program then fits the distribution of these scores to an extreme value distribution, and estimates the EVD parameters mu and lambda for this distribution. These parameters are inserted into the profile HMM file.

When using a profile HMM to search a sequence database or using a sequence query to search a profile HMM database, the search is more sensitive if the profile HMM has been calibrated. Calibration is time-consuming because it is, in a sense, equivalent to a database search in itself. However, the calibration only has to be performed once, as the extreme value distribution parameters are stored with the profile HMM and will be used whenever you perform profile HMM-sequence database searches.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using HmmerCalibrate to calibrate the hidden Markov model profile made in the HmmerBuild example session:

% hmmercalibrate

HMMERCALIBRATE what profile HMM ? hsp70.hmm_g

Creating temp file for input to hmmcalibrate.

Calling hmmcalibrate to perform analysis ...

hmmcalibrate -- calibrate HMM search statistics

HMMER 2.1.1 (Dec 1998)

HMMER is freely distributed under the GNU General Public License (GPL).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

HMM file: /usr/users/share/smith/hsp70.hmm_g

Length distribution mean: 325

Length distribution s.d.: 200

Number of samples: 5000

random seed: 966630244

histogram(s) saved to: [not saved]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

HMM : hsp70

mu : -492.892303

lambda : 0.088041

max : -432.713989

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:

HMMER2.0

NAME  hsp70

LENG  677

ALPH  Amino

RF    no

CS    no

MAP   yes

COM   gcg_hmmbuild /usr/users/share/smith//hsp70.hmm__g hsp70.msf

COM   gcg_hmmcalibrate /usr/users/share/smith/hsp70.hmm__g

NSEQ  25

DATE  Fri Jul  9 14:09:29 1999

CKSUM 3252

XT      -8455     -4  -1000  -1000  -8455     -4  -8455     -4

NULT      -4  -8455

NULE     595  -1558     85    338   -294    453  -1158  ...  -1998   -644

EVD   -491.783325   0.094483

HMM        A      C      D      E      F      G      H  ...      W      Y

         m->m   m->i   m->d   i->m   i->i   d->m   d->d

   b->m   m->e

         -368      *  -2150

     1    482  -1538   -242    709  -1777   -345     14 ...  -1783  -1160    27

     -   -149   -500    233     43   -381    399    106 ...  -294   -249

     -    -18  -6880  -7922   -894  -1115   -701  -1378   -368      *

     2    209  -1369   -338    519  -1590   -486    -48 ...  -1681  -1096    28

     -   -149   -500    233     43   -381    399    106 ...  -294   -249

     -    -18  -6880  -7922   -894  -1115   -701  -1378      *      *

///////////////////////////////////////////////////////////////////////////////

   675   -664  -1701     33   2728  -1813  -1361   -295 ...  -1992  -1378   717

     -   -149   -500    233     43   -381    399    106 ...  -294   -249

     -    -24  -6511  -7553   -894  -1115   -701  -1378      *      *

   676   -576   -717  -2213  -1989  -1191  -1755  -1601 ...  -1918  -1514   718

     -   -149   -500    233     43   -381    399    106 ...  -294   -249

     -    -24  -6511  -7553   -894  -1115   -701  -1378      *      *

   677  -1248  -2485   3484    210  -3124   -286   -821 ...  -3039  -2364   719

     -      *      *      *      *      *      *      * ...      *      *

     -      *      *      *      *      *      *      *      *      0

//

The profile HMM itself wasn't changed but several new lines were added to the file. The two lines labeled COM are present for documentation purposes; they contain the command lines that were used to create the original profile HMM and to perform the calibration. The line labeled EVD contains the parameters mu and lambda that describe the extreme value distribution of the calibration search. The values on this line are used by HmmerSearch and HmmerPfam.

INPUT FILES

[ Previous | Top | Next ]

HmmerCalibrate requires as its only input a file containing one or more profile HMMs.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

HmmerBuild makes a profile hidden Markov model from a multiple sequence alignment. HmmerAlign aligns one or more sequences to a profile HMM. HmmerPfam searches a database of profile HMMs with a sequence query in order to identify known domains within the sequence. HmmerSearch uses a profile HMM as a query to search a sequence database for sequences similar to the original aligned sequences. HmmerCalibrate calibrates a hidden Markov model so that database searches using it as a query will be more sensitive. HmmerIndex creates a binary GSI ("generic sequence index") for a database of profile HMMs. HmmerFetch retrieves a profile hidden Markov model by name from an indexed database of profile HMMs. HmmerEmit randomly generates sequences that match a profile HMM. HmmerConvert converts between different profile HMM file formats and from profile HMM to GCG profile file format.

MEME finds conserved motifs in a group of unaligned sequences and saves these motifs as a set of profiles. You can search a database of sequences with these profiles using the MotifSearch program.

RESTRICTIONS

[ Previous | Top | Next ]

Unknown.

ALGORITHM

[ Previous | Top | Next ]

See the Profile HMM Analysis Essay for an introduction to profile hidden Markov models and the terminology associated with them.

In effect, HmmerCalibrate performs a preprocessing step that enables HmmerSearch and HmmerPfam to determine more accurately the significance of the matches obtained from a database search. It pre-calculates certain parameters that HmmerSearch and HmmerPfam can use when computing E-values for the scores between the profile HMM query and each sequence in the database.

What Are E-Values and How Are They Determined?

Like many other database searching programs, HmmerSearch and HmmerPfam give you an estimate of the significance of a score X obtained from the search by calculating an E-value (expectation) from the score. Simply put, the E-value is the number of pairwise matches expected to score X or higher purely by chance when searching a sequence database of this size.

Calculation of E-values is possible because the limiting distribution for the scores obtained in a search of a database of mostly unrelated sequences is an extreme value distribution. Parameters that describe a particular extreme value distribution can be determined empirically. These EVD parameters are then substituted into the equation that is used to calculate E-values from the pairwise alignment scores.

In order to obtain these EVD parameters, some programs (such as FastA) fit the observed distribution of scores to an extreme data distribution on the fly during the database search. For example, some use the distribution of scores from the first 50,000 or 100,000 sequences of the database and fit this distribution to an EVD. Other programs randomly sample the scores from the entire database and fit the distribution of these scores to an EVD. Both of these approaches can be problematical because if the database sequences (or the subset of sampled database sequences) are closely related, the distribution of scores will deviate from an extreme value distribution and the E-values calculated from the EVD will not be accurate.

How Does HmmerCalibrate Calculate the EVD Parameters?

HmmerCalibrate gets around this problem by "searching" an artificial database of sequences that are not closely related in order to estimate the EVD parameters. It randomly generates a large set of sequences, computes the score between each sequence and the profile HMM, fits the distribution of these scores to an extreme value distribution, and saves the parameters that describe this EVD in the profile HMM file. If a large enough number of random sequences were used, the parameters will be very accurate.

Instead of having to calculate the EVD parameters every time a profile HMM is used for a database search, HmmerSearch and HmmerPfam can simply read these EVD parameters from calibrated profile HMM files.

By default, HmmerCalibrate generates 5000 sequences with a variety of lengths. The length distribution follows a Gaussian distribution centered at 350 residues with a standard deviation of 350. (Since sequences must have a length of at least one, the left side of the length distribution may be truncated.) You can change all of these values by means of command-line parameters.

What Happens If You Use Uncalibrated Profile HMMs?

In general, when using calibrated profile HMMs, E-values of 0.1 or less are very significant hits. If the query profile HMM has not been calibrated, HmmerSearch instead computes E-values using an analytic upper bound calculation that errs on the side of caution to avoid false positives. These E-values are also reliable, but because of the conservative way they are calculated, remote homolog in the sequence database may be missed.

For more information on extreme value distributions and the significance of database search scores, see topic 2.7 in Section 2 of Biological Sequence Analysis by Richard Durbin, et al. (Cambridge University Press, 1998).

CONSIDERATIONS

[ Previous | Top | Next ]

Unless you specify a different number with the -SEQNUM parameter, 5000 sequences are used to obtain the score distribution. This number was selected as a trade-off between precision and computation time. You can obtain more accurate estimates of the EVD parameters by increasing the number of sequences, at the cost of increased computation time. You should not reduce the number of sequences below about 1000 -- the curve-fitting calculations for the extreme value distribution may fail.

When using profile HMMs as queries for database searches, E-values of 0.1 or less usually represent very significant hits. However, be aware that database searches performed using uncalibrated profile HMMs may miss remote homologs.

To generate sequences, HmmerCalibrate uses a random number generator that is initialized by a seed value. By default, this seed value is derived from the system clock of your computer, so each program run will use a different seed and thus generate a somewhat different set of sequences. If you want HmmerCalibrate to generate the same set of sequences each time, you can set the seed by means of the -SEED parameter.

If you would like to save the uncalibrated HMM, you should copy it to another file, since HmmerCalibrate will modify the input HMM instead of creating a new HMM.

Increasing Program Speed Using Multithreading

This program is multithreaded. It has the potential to run faster on a machine equipped with multiple processors because different parts of the analysis can be run in parallel on different processors. By default, the program assumes you have one processor, so the analysis is performed using one thread. You can use -PROCessors to increase the number of threads up to the number of physical processors on the computer.

Under ideal conditions, the increase in speed is roughly linear with the number of processors used. But conditions are rarely ideal. If your computer is heavily used, competition for the processors can reduce the program's performance. In such an environment, try to run multithreaded programs during times when the load on the system is light.

As the number of threads increases, the amount of memory required increases substantially. You may need to ask your system administrator to increase the memory quota for your account if you want to use more than two threads.

Never use -PROCessors to set the number of threads higher than the number of physical processors that the machine has -- it does not increase program performance, but instead uses up a lot of memory needlessly and makes it harder for other users on the system to get processor time. Ask your system administrator how many processors your computer has if you aren't sure.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % hmmercalibrate [-INfile1=]hsp70.hmm_g -Default

Local Data Files: None

Optional Parameters:

-SEQNUM=5000 sets the number of synthetic sequences to 5000

-LENgth=300 sets the length of the random sequences to 300

-MEAnlength=350 sets the mean length of the synthetic sequences to 350

-STDdevlength=350 sets the standard deviation of the sequence length

distribution to 350

-SEED=317 sets the seed for the random number generator to 317

-HIStogram[=hsp70.hist] saves the histogram of the scores to hsp70.hist

-PROCessors=2 sets the maximum number of threads that the program

will use to 2

-NOMONitor doesn't display information about analysis parameters

-BATch submits program to the batch queue

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

Following some of the parameters described below is a short expression in parentheses. These are the names of the corresponding parameters used in the native HMMER package. Some of the GCG parameters are identical to the original HMMER parameters, others have been changed to fit GCG's conventions.

-SEQNUM=5000 (--num 5000)

Sets the number of synthetic sequences to 5000 (must be a positive integer). The default value is 5000; higher numbers will give better estimates of the EVD parameters, lower values result in faster computation time. Values below 1000 may cause the curve-fitting calculations to fail.

-LENgth=300 (--fixed 300)

Sets the length of all of the randomly generated sequences to 300 residues (must be a positive integer). For meaningful results, the length must fall within reasonable biological limits. If this parameter is not used, the random sequences will have a length distribution that follows a Gaussian (normal) distribution. You cannot use parameters -MEAnlength and -STDdevlength with this option, they will be ignored.

-MEAnlength=350 (--meanlength 350)

Sets the mean length of the synthetic sequences to 350 (must be a positive real number). The default is 350. If option -LENgth is specified as well, this option will be ignored.

-STDdevlength=350 (--sd 350)

Sets the standard deviation of the synthetic sequence length distribution to 350. The value must be a positive real number. The default is 350. If option -LENgth is specified as well, this option will be ignored.

-SEED=317 (--seed 317)

Sets the seed for the random number generator to 317 (must be a positive integer) instead of using the system clock to create the seed.

-HIStogram=hsp70.hist (--histfile hsp70.hist)

Saves a histogram of the scores and the fitted theoretical curve to a file named hsp70.hist.

-PROCessors=2 (--cpu 2)

Tells the program to use 2 threads on a multiprocessor computer. The default is 1.

-NOMONitor

Suppresses the display of the program's progress on the screen.

-BATch

Submits the program to the batch queue for processing after prompting you for all required user inputs. All output files are written to your current directory, unless you direct the ouput to another directory when you specify the output file.

Printed: May 27, 2005 12:41

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.