TransMem

TransMem builds on the method of Sonnhammer et al. (Proc. of Sixth Int. Conf. on Intelligent Systems for Molecular Biology, 175-182 (1998)) to predict likely transmembrane helices in one or more input proteins. The method is based upon a Hidden Markov Model (HMM) that has been trained on a set of membrane proteins with helical membrane spanning regions.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using TransMem to generate two predictions for the delta subunit of the mouse GABA(A) receptor.

% transmem sw:gad_mouse

TransMem scans for likely transmembrane helices in one or more input protein

sequences.

Number of different annotations of each sequence (* 1 *) ? 2

Proximity of feature boundaries to consider annotations equivalent

(* 0 *) ?

What should I call the output file (* gad_mouse.transmem *) ?

Helix Inside Outside Relative Score

GAD_MOUSE0 4 2 3 0.0000

GAD_MOUSE1 4 2 3 0.4922

CPU time: 0.560000

Sequences examined: 1

Sequences written: 2

Results written to "gad_mouse.transmem"

OUTPUT

[ Previous | Top | Next ]

The output from TransMem is a list file, and is suitable for input to any GCG program that allows indirect file specifications. (For information about indirect file specification, see Section 2, Using Sequence Files and Databases of the User's Guide.)

!!SEQUENCE_LIST 1.0

TransMem of sw:gad_mouse

 -MINHelix = 1

 -MEthod=Nbest

   -NBest = 2

   -PROXimity = 0

 August 13, 2001 16:36

                                Helix       Inside      Outside  Relative Score

..

sw:gad_mouse         !            4            2            3     0.0000

sw:gad_mouse         !            4            2            3     0.4922

\\End of List

>>SW:GAD_MOUSE

P22933 mus musculus (mouse). gamma-aminobutyric-acid receptor delta subunit prec

                       Begin     End

Outside                    1     249

Helix                    250     272

Inside                   273     278

Helix                    279     297

Outside                  298     311

Helix                    312     334

Inside                   335     425

Helix                    426     448

Outside                  449     449

Outside                    1     248

Helix                    249     271

Inside                   272     277

Helix                    278     296

Outside                  297     310

Helix                    311     333

Inside                   334     425

Helix                    426     448

Outside                  449     449

         OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO

         OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO

     1   MDVLGWLLLP LLLLCTQPHH GARAMNDIGD YVGSNLEISW LPNLDGLMEG

         OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO

         OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO

    51   YARNFRPGIG GAPVNVALAL EVASIDHISE ANMEYTMTVF LHQSWRDSRL

         OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO

         OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO

   101   SYNHTNETLG LDSRFVDKLW LPDTFIVNAK SAWFHDVTVE NKLIRLQPDG

         OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO

         OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO

   151   VILYSIRITS TVACDMDLAK YPLDEQECML DLESYGYSSE DIVYYWSENQ

         OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOH

         OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOHH

   201   EQIHGLDRLQ LAQFTITSYR FTTELMNFKS AGQFPRLSLH FQLRRNRGVY

         HHHHHHHHHH HHHHHHHHHH HHIIIIIIHH HHHHHHHHHH HHHHHHHOOO

         HHHHHHHHHH HHHHHHHHHH HIIIIIIHHH HHHHHHHHHH HHHHHHOOOO

   251   IIQSYMPSVL LVAMSWVSFW ISQAAVPARV SLGITTVLTM TTLMVSARSS

         OOOOOOOOOO OHHHHHHHHH HHHHHHHHHH HHHHIIIIII IIIIIIIIII

         OOOOOOOOOO HHHHHHHHHH HHHHHHHHHH HHHIIIIIII IIIIIIIIII

   301   LPRASAIKAL DVYFWICYVF VFAALVEYAF AHFNADYRKK RKAKVKVTKP

         IIIIIIIIII IIIIIIIIII IIIIIIIIII IIIIIIIIII IIIIIIIIII

         IIIIIIIIII IIIIIIIIII IIIIIIIIII IIIIIIIIII IIIIIIIIII

   351   RAEMDVRNAI VLFSLSAAGV SQELAISRRQ GRVPGNLMGS YRSVEVEAKK

         IIIIIIIIII IIIIIIIIII IIIIIHHHHH HHHHHHHHHH HHHHHHHHO

         IIIIIIIIII IIIIIIIIII IIIIIHHHHH HHHHHHHHHH HHHHHHHHO

   401   EGGSRPGGPG GIRSRLKPID ADTIDIYARA VFPAAFAAVN IIYWAAYTM

 CPU time: 0.560000

 Sequences examined: 1

 Sequences written:  2

INTERPRETING OUTPUT

[ Previous | Top | Next ]

The first part of the output file contains a list of all the sequences searched and the predictions generated for a given sequence. When multiple predictions are generated for each sequence, the predictions are listed in order of prediction quality, with the best prediction on top and the sub-optimal predictions below.

Next to each sequence, the file contains the raw counts of how many transmembrane helices, inner loops, and outer loops were found. If you have generated more than one prediction per sequence, there is also a score reported for comparing the quality of the prediction with the best prediction for each sequence. This is a relative measure only and should not be used to compare the quality of predictions between different sequences. In general, a score of 10 or more indicates that the prediction is significantly different from the best prediction.

Following this list of sequences, TransMem displays a table listing the specific boundaries of each feature predicted, followed by the sequence aligned with the predicted labels.

INPUT FILES

[ Previous | Top | Next ]

TransMem takes any valid GCG specification for one or more protein sequences.

RELATED PROGRAMS

[ Previous | Top | Next ]

TransMem+ scans for likely transmembrane helices in one or more input protein sequences.

SPScan scans protein sequences for the presence of secretory signal peptides (SPs).

HTHScan scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.

HelicalWheel plots a peptide sequence as a helical wheel to help you recognize amphiphilic regions.

PeptideStructure makes secondary structure predictions for a peptide sequence. The predictions include (in addition to alpha, beta, coil, and turn) measures for antigenicity, flexibility, hydrophobicity, and surface probability. PlotStructure displays the predictions graphically.

PepPlot plots measures of protein secondary structure and hydrophobicity in parallel panels of the same plot.

CoilScan locates coiled-coil segments in protein sequences.

SPScan+ scans protein sequences for the presence of secretory signal peptides (SPs).

HTHScan+ scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.

CoilScan+ locates coiled-coil segments in protein sequences.

RESTRICTIONS

[ Previous | Top | Next ]

TransMem only works on protein sequences.

ALGORITHM

[ Previous | Top | Next ]

TransMem is based upon a Hidden Markov Model (HMM) architecture. The architecture is made up of 7 types of states corresponding to the core of the transmembrane helix, helix caps, cytoplasmic loops, short and long cytoplasmic loop states, and globular domains that are part of each loop.

The states have a close relationship with the biology of membrane proteins; loop states connection to other loops through a helix cap, helix core, and another helix cap. These states correspond to one of three different labels, Inside (cytoplasmic), Helix (membrane spanning helix), and Outside (non-cytoplasmic).

The prediction of transmembrane helices is done by finding an optimal alignment of the sequence with the model using the N-Best algorithm. In the N-Best algorithm, the algorithm uses the model architecture to find the best labeling of the sequence, given the model.

Alternatively, you can run TransMem using the Viterbi algorithm, which finds the optimal alignment of the sequence with the model, then uses that alignment to read the labels. In general, the Viterbi algorithm will give the same results as the N-Best, but in some cases the predictions will differ.

The output of the raw probabilities is based upon the forward-backward algorithm, in which TransMem finds the probability of each labeling (Inside, Outside, or Helix) summed over all the possible alignments of the sequence to the model. Because these values are based upon all possible alignments of the model instead of a single optimal alignment, occasionally the raw probabilities will contradict the final labeling.

CONSIDERATIONS

[ Previous | Top | Next ]

When no transmembrane helices are predicted, it is not a good idea to treat the Inside/Outside prediction as an accurate measure of whether or not the peptide is secreted. The inner and outer labeling is only meaningful for integral membrane proteins.

When using the N-Best algorithm, you can also choose to merge predictions with a given overlap. The boundaries of transmembrane helices have an experimental error of a few residues, a fact which was incorporated into the training of the model architecture. By allowing a merging of overlapping predictions, TransMem allows you to blur edges of the predicted helices, which in turn will cause the N-Best algorithm to generate predictions with significant differences.

The N-Best algorithm will always try to find the best labeling of your sequence that matches the parameters of minimum and maximum number of helices, even if this is not the best overall labeling. For example, if you have some other experimental evidence that suggests you are working with a 7 transmembrane protein, yet the algorithm gives you a prediction of 8 transmembrane helices, you can specify a minimum and maximum helix range of 7, which will force the algorithm to find this prediction. If the application is not able to find any matching predictions, try increasing the value of N-Best, which will increase the number of different predictions that the algorithm will consider.

By increasing the number of different predictions generated, you are increasing the number of different predictions that TransMem analyzes. Consequently, you may see weakly predicted helices that would otherwise not be visible, as well as many more false predictions. Additionally, if a helix is visible in a large number of predictions, it is more likely to be an actual helix and not a false positive. Since you are increasing the number of predictions considered, computation time will also increase dramatically with increases in the value for N-Best.

Because of the N-Best algorithm's ability to try to find a prediction that matches the restrictions, it may not be useful for screening protein sequences for a given number of transmembrane helices. Instead, we recommend using the Viterbi algorithm, which is more discriminating and runs faster.

If you have a sequence for which you have experimental evidence of a particular number of transmembrane helices, yet the algorithm does not predict the correct number, specify this number with -MINHelix and -MAXHelix, then try increasing the value for N-Best and the tolerance for merging overlapping predictions. In some cases, this will allow the algorithm to find the helices.

If you are screening large amounts of data for 7 transmembrane proteins, for example, it probably isn't a good idea to limit the search for predictions of only 7 transmembrane regions. Instead, more complete searches can be generated from searching for anything containing 6-8 transmembrane regions.

TransMem only recognizes transmembrane alpha helices. All other types of membrane spanning regions are not recognized.

A Since TransMem will produce a self-consistent topology prediction, if it misses any transmembrane helices, the topology will be wrong.

Predicted transmembrane helices in the n-terminal region sometimes turn out to be signal peptides.

SCIENTIFIC VALIDITY

[ Previous | Top | Next ]

A non-redundant data set of 148 sequences, composed of all known transmembrane proteins (Möller et al, 2001), was used for validation of this program. The data set was run through the public server and through this implementation.

All except 8 sequences showed identical results (95% identical). NB: When the predictions differed, this program found the other prediction as the second best answer.

Of these 8 differences, 4 (COX2_BOVIN, IMMA_CITFR, RCEL_RHOVI, and TCR2_ECOLI) only differed in the exact positions of the helix boundaries. All predicted helices from the two implementations overlapped by at least 16 residues, and the topology predictions were identical.

There were 2 of the 8 proteins (COXH_BOVIN and CYB_RHOSH) where the topology predictions of the two implementations were reversed in addition to minor helix boundary differences. This implementation was correct for COXH_BOVIN and the public server was correct for CYB_RHOSH.

In the final 2 sequences, (CITN_KLEPN and CYOB_ECOLI), the two predictions differed in the pressence or absence of a given TM helix. This implementation correctly found an additional helix in CITN_KLEPN. For CYOB_ECOLI, the public server correctly found an additional TM helix that this implementation did not find.

In conclusion, the two implementations are scientifically comparable. Half of the differences could be attributed to minor variation in TM helix boundaries, which are not significant differences, due to the inherent uncertainty in experimental determination of the helix boundaries. When the different implementations gave significant differences, there was an even split between which answer was correct.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % transmem [-INfile=]sw:gad_mouse -Default

Prompted Parameters:

[-OUTfile=]gad_mouse.transmem  names the output file

-NBest=1                Number of different annotations of each sequence

-PROXimity=0            Proximity of feature boundaries to consider annotations

                         equivalent

Optional Parameters:

-RSF[=transmem.rsf]      save predicted domains as features in an RSF file

-MEthod=Nbest,Viterbi    selects which method to use to generate the prediction.

                           By default, Nbest is selected.  (selecting Viterbi

                           suppresses -NBest and -PROXimity)

-RAWProb                 writes out the raw probabilities of each label for

                           each sequence character

-MAXHelix=10             only show proteins with at most this many

                           transmembrane helices (default is unlimited)

-MINHelix=1              only show proteins with at least this many

                           transmembrane helices (default is 1)

-MONitor                 displays screen trace of progress

-NOSUMmary               suppresses screen summary at the end of the program

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-NBest

Specify how many predictions you want to see. The most likely predictions is the first one listed. Note that larger numbers of predictions can greatly reduce program speed.

-TOLerance

When using the N-Best algorithm, merge predictions that have this much overlap or less. This allows you to avoid lists of predictions that are not functionally different.

-VITerbi

Use the Viterbi algorithm for predictions instead of the N-Best. This algorithm is faster than the N-Best, but can only generate a single prediction per sequence. If -VITerbi is used, the values for -NBest and for -TOLerance are ignored.

-RAWProb

Output the raw probabilities for observing each label at each sequence character. These values are based upon the forward-backward algorithm and may not agree with the final predicted label.

-MAXHelix

Limit the output to include only proteins with this many transmembrane helices or fewer. By default, the maximum number of helices is unlimited.

-MINHelix

Limit the output to include only proteins with this many or more transmembrane helices. If this value is greater than specified with -MAXHelix, the value for MAXHelix is used. By default, the output only includes proteins with one or more helix.

-SUMmary

Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: May 27, 2005 14:57

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.