MEME

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[ Top | Next ]

MEME finds conserved motifs in a group of unaligned sequences. MEME saves these motifs as a set of profiles. You can search a database of sequences with these profiles using the MotifSearch program.

DESCRIPTION

[ Previous | Top | Next ]

MEME uses the method of Bailey and Elkan to identify likely motifs within the input set of sequences. You may specify a range of motif widths to target, as well as the number of unique motifs to search for. MEME uses Bayesian probability to incorporate prior knowledge of the similarities among amino acids into its predictions of likely motifs. The resulting motifs are output as profiles. A profile is a log-odds matrix used to judge how well an unknown sequence segment matches the motif.

EXAMPLE

[ Previous | Top | Next ]

Here is a session with MEME that was used to find motifs in a group of calcium-transporting membrane proteins listed in the file pircat.list.

% meme

 Find motifs in what sequences? @pircat.list

 How many motifs should I search for (* 6 *) ?

 What should I call the profile file (* meme.prf *) ?

 What should I call the report file (* meme.meme *) ?

 Reading sequences ...

  PIR2:A42764                (     919 aa)

  PIR2:S71168                (     946 aa)

  PIR1:PWBYR1                (     950 aa)

  PIR2:S24359                (     994 aa)

  PIR2:A32792                (     994 aa)

  PIR2:A48849                (     994 aa)

  PIR2:B31981                (     997 aa)

   Identifying motifs in: 7 sequences

  Shortest sequence (aa): 919

   Longest sequence (aa): 997

                Total aa: 6794

   Finding 1st motif

   Testing starts of width 8 ... done

   Testing starts of width 11 ... done

   Testing starts of width 15 ... done

   Testing starts of width 21 ... done

   Testing starts of width 29 ... done

   Testing starts of width 41 ... done

   Testing starts of width 57 ... done

     Running EM from 21 starting motifs ......... done

   Finding 2nd motif

   Testing starts of width 8 ... done

///////////////////////////////////////////////////////////

 Search completed after finding the 6 motifs requested.

               Sequences searched: 7

      Number of motifs identified: 6

              Output profile file: meme.prf

                    Output report: meme.meme

OUTPUT

[ Previous | Top | Next ]

MEME generates a report and a file containing one or more ungapped GCG profiles. (See RELATED PROGRAMS for notes on how this "multiple profile file" differs from earlier versions of profile files).

MEME's report file gives details about the motifs that help you analyze the validity and usefulness of the results. The file first lists the training set, or input sequences. ("Training set" is a common term for a set of examples from which an intelligent program learns a general concept.) After echoing the parameters you specified, the file gives a detailed description of each motif found. This report includes three different representations of the motif: Two versions of a letter-probability matrix, and a consensus sequence showing all likely letters for each position. (A fourth representation is the ungapped profile that is written to the other output file.) There are six different types of information presented:

- The simplified letter-probability matrix shows probabilities for each letter at each position of the

motif (Probabilities are multiplied by 10, and displayed as integers. Values below 0.5 are displayed as ':'. Values above 9.5 are displayed as 'a'.)

- The information content bar graph shows how many bits of information are provided by each position

in the motif. This is a measure of how well-conserved the positions of the motif are.

- The multilevel consensus sequence shows, for each position, all letters with a probability >= 0.2 of

appearing in that position

- The BLOCKS format section uses Henikoff's BLOCKS format to display occurrences of the motif

within the sequences of the training set.

- The list of possible examples shows the highest scoring matches to the motif, with scores and

sequence context included.

- The letter probability matrix shows the probabilities for each letter at each position of the matrix.

Note that this matrix is transposed with respect to the simplified letter-probability matrix. That is, the first row of the simplified matrix corresponds to the first column of this matrix.

For more details about the output, consult Tim Bailey's MEME website at http://www.sdsc.edu/MEME. (Note that the log-odds matrices referred to at the website correspond to the profiles that appear in a separate output file from Accelrys GCG (GCG))

Here is some of the output from the EXAMPLE:

********************************************************************************

TRAINING SET

********************************************************************************

DATAFILE= @gendocdata:pircat.list

ALPHABET= ACDEFGHIKLMNPQRSTVWY

Sequence name           Weight Length  Sequence name           Weight Length

-------------           ------ ------  -------------           ------ ------

PIR2:A42764             1.0000    919  PIR2:S71168             1.0000    946

PIR1:PWBYR1             1.0000    950  PIR2:S24359             1.0000    994

PIR2:A32792             1.0000    994  PIR2:A48849             1.0000    994

PIR2:B31981             1.0000    997

********************************************************************************

meme @gendocdata:pircat.list

********************************************************************************

MOTIF  1               width =  14    sites =  7.0

********************************************************************************

Simplified     A  ::::::::::::::

motif letter-  C  :a::::::::::::

probability    D  :::9::::::::::

matrix         E  ::::::::::::1:

               F  ::::::::::::::

               G  ::::::9:::::::

               H  ::::::::::::1:

               I  8:::::::::::::

               K  ::::9:::::1:::

               L  ::::::::9:::::

               M  :::::::::::::9

               N  :::::::::::9::

               P  ::::::::::::::

               Q  ::::::::::::7:

               R  ::::::::::::::

               S  ::9:::::::1:::

               T  :::::9:9:97:::

               V  1:::::::::::::

               W  ::::::::::::::

               Y  ::::::::::::::

         bits 6.2

5.6

              5.0  *

              4.4  *           *

Information   3.7  *           *

content       3.1  * ***** * * *

(46.1 bits)   2.5 ********** ***

              1.9 **************

              1.2 **************

              0.6 **************

              0.0 --------------

Multilevel        ICSDKTGTLTTNQM

consensus

sequence

--------------------------------------------------------------------------------

        Motif 1 in BLOCKS format

--------------------------------------------------------------------------------

BL   MOTIF 1 width=14 seqs=7

PIR2:A42764 (  347) ICSDKTGTLTKNEM  1

PIR2:S71168 (  453) ICSDKTGTLTTNHM  1

PIR1:PWBYR1 (  368) ICSDKTGTLTSNHM  1

PIR2:S24359 (  348) ICSDKTGTLTTNQM  1

PIR2:A32792 (  348) ICSDKTGTLTTNQM  1

PIR2:A48849 (  348) ICSDKTGTLTTNQM  1

PIR2:B31981 (  348) ICSDKTGTLTTNQM  1

//

---------------------------------------------------------------------------

        Possible examples of motif 1 in the training set

---------------------------------------------------------------------------

Sequence name             Start  Score                 Site

-------------             -----  -----            --------------

PIR2:A42764                 347  49.31 IVETLGCCNV ICSDKTGTLTKNEM TVTHILTSDG

PIR2:S71168                 453  54.45 ACETMGSATT ICSDKTGTLTTNHM TVVKACICEQ

PIR1:PWBYR1                 368  51.67 SVETLGSVNV ICSDKTGTLTSNHM TVSKLWCLDS

PIR2:S24359                 348  57.17 SVETLGCTSV ICSDKTGTLTTNQM SVCRMFVIDK

PIR2:A32792                 348  57.17 SVETLGCTSV ICSDKTGTLTTNQM SVCKMFIVDK

PIR2:A48849                 348  57.17 SVETLGCTSV ICSDKTGTLTTNQM SVCKMFIIDK

PIR2:B31981                 348  57.17 SVETLGCTSV ICSDKTGTLTTNQM SVCRMFILDR

---------------------------------------------------------------------------

letter-probability matrix: alength= 20 w= 14 n= 6703

 0.008087  0.002302  0.002373  0.002713  0.006260  0.002731  . . . 0.003240

 0.005496  0.960896  0.001406  0.002052  0.001540  0.001695  . . . 0.000837

 0.018012  0.003426  0.003942  0.003053  0.002362  0.007243  . . . 0.002152

 0.007451  0.001425  0.883780  0.029280  0.002250  0.005306  . . . 0.002494

 0.006351  0.001550  0.002156  0.003411  0.001104  0.002936  . . . 0.001393

 0.011903  0.002867  0.003468  0.003026  0.002450  0.004654  . . . 0.001764

 0.015497  0.001668  0.007345  0.005243  0.001717  0.913036  . . . 0.001877

 0.011903  0.002867  0.003468  0.003026  0.002450  0.004654  . . . 0.001764

 0.006462  0.002132  0.001649  0.002919  0.012373  0.002479  . . . 0.003691

 0.011903  0.002867  0.003468  0.003026  0.002450  0.004654  . . . 0.001764

 0.017689  0.004707  0.006756  0.006618  0.004024  0.005798  . . . 0.002820

 0.005775  0.001939  0.011769  0.003912  0.002961  0.006143  . . . 0.002930

 0.014075  0.002380  0.007727  0.053144  0.002750  0.004727  . . . 0.002803

 0.003448  0.001568  0.001422  0.001525  0.002821  0.001914  . . . 0.003225

********************************************************************************

MOTIF  2               width =  14    sites =  7.0

********************************************************************************

////////////////////////////////

 Search completed after finding the 6 motifs requested.

And here is an excerpt from the profile file:

!!AA_PROFILE 2.0

(Peptide) ..

MEME v2.2 of: @gendocdata:pircat.list  Length: 14

!  Sequences: 7  MaxScore: 1.00  January 7, 2002 15:37

!PIR2:A42764  From: 347       To: 360       Weight: 1.000000

!PIR2:S71168  From: 453       To: 466       Weight: 1.000000

!PIR1:PWBYR1  From: 368       To: 381       Weight: 1.000000

!PIR2:S24359  From: 348       To: 361       Weight: 1.000000

!PIR2:A32792  From: 348       To: 361       Weight: 1.000000

!PIR2:A48849  From: 348       To: 361       Weight: 1.000000

!PIR2:B31981  From: 348       To: 361       Weight: 1.000000

                                               Gap: 1.00              Len: 1.00

                     GapRatio: 0.0          LenRatio: 0.0

Cons   A      C      D      E      F      G      H  . . .    W      Y   Gap  Len

 I   -317   -297   -444   -452   -268   -466   -424 . . .  -383   -333  100  100

! 1

 C   -373    572   -520   -492   -470   -535   -454 . . .  -536   -528  100  100

 S   -202   -240   -371   -435   -409   -325   -357 . . .  -412   -392  100  100

 D   -329   -367    409   -109   -416   -370   -252 . . .  -391   -371  100  100

 K   -352   -355   -458   -419   -518   -456   -345 . . .  -400   -455  100  100

 T   -261   -266   -389   -436   -403   -389   -351 . . .  -403   -421  100  100

 G   -223   -344   -281   -357   -455    371   -327 . . .  -379   -412  100  100

 T   -261   -266   -389   -436   -403   -389   -351 . . .  -403   -421  100  100

 L   -350   -309   -497   -441   -170   -480   -360 . . .  -312   -314  100  100

 T   -261   -266   -389   -436   -403   -389   -351 . . .  -403   -421  100  100

 T   -204   -194   -293   -323   -332   -357   -258 . . .  -335   -353  100  100

! 11

 N   -366   -322   -213   -399   -376   -349    -96 . . .  -331   -347  100  100

 Q   -237   -293   -274    -23   -387   -387    147 . . .  -285   -354  100  100

 M   -440   -353   -518   -535   -383   -517   -451 . . .  -318   -334  100  100

*       0      7      7      1      0      7      2 . . .     0      0

MEME v2.2 of: @gendocdata:pircat.list  Length: 14

!  Sequences: 7  MaxScore: 1.00  January 7, 2002 15:38

///////////////////////////////////////////////////////////////////////////////

*      12      1     21      4      1     16      2 . . .     0      0

INPUT FILES

[ Previous | Top | Next ]

The input to MEME is a set of either nucleotide or protein sequences (not both). The function of MEME depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

MEME respects the begin and end attributes for controlling the range of interest for sequences in list files (but see RESTRICTIONS, below). MEME also respects the strand list file attribute for nucleotide sequences.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap). ProfileSearch uses a profile (representing a group of aligned sequences) as a query to search the database for new sequences with similarity to the group. The profile is created with the program ProfileMake. ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences. ProfileGap makes an optimal alignment between a profile and one or more sequences.

MEME's output can best be appreciated by running the output profiles through MotifSearch, another program in GCG. You will probably want to run MotifSearch at least twice. First, you should use the profiles to search the original training set of sequences. Second, you may wish to search a larger database to identify similar sequences. See the documentation for MotifSearch for details.

MEME+ finds conserved motifs in a group of unaligned sequences. MEME+ saves these motifs as a set of profiles. You can search a database of sequences with these profiles using the MotifSearch program.

RESTRICTIONS

[ Previous | Top | Next ]

You can analyze at most 1,000,000 residues at one time.

If you wish to use both strands of nucleotide sequences, you must specify the one-per model (described in ALGORITHM, below) via the -ONEEXactly parameter.

MEME cannot process multiple sequences with the same name. If MEME encounters a second sequence with the identical name as a previous one, it will ignore the second. Thus, you cannot analyze several segments of a single sequence by creating several list file entries of that sequence and specifying different begin and end attributes for each entry.

ALGORITHM

[ Previous | Top | Next ]

MEME implements the method of Bailey and Elkan (see ACKNOWLEDGMENTS), to find one or more motifs that characterize a family of sequences. The core of MEME is Expectation Maximization (EM), an unsupervised learning algorithm guaranteed to converge to a local maximum. That is, any motif found by MEME will be "better" (according to MEME's statistical criteria) than any other motif that differs infinitesimally from the first.

One of the criteria applied by MEME depends on your choice of a model. MEME can either a) favor motifs that appear exactly once in each sequence in the training set (the one-per model); b) favor motifs that appear zero or one time in each sequences in the training set (the zero-or-one-per model); or c) give no preference to the number of occurrences (the zero-or-more-per model).

MEME makes use of Dirichlet priors in its EM calculations for protein sequences. These are empirical statistical measures of the interchangeability of amino acids within subsequences of similar function. Suppose there are two amino acid sequences, S1 and S2, having the same length. If the first residue in S1 is I, and the first residue in S2 is V, then there is some likelihood that S1 and S2 have the same function, given their similarity in the first position. We can estimate that likelihood by analyzing the set of subsequences whose functionality is established.

A drawback to EM is that the maximum it finds is only local. There may be better solutions that were overlooked due to an unlucky choice of the starting point -- EM's initial guess at the solution. This is a nontrivial and heavily studied problem. One approach is to run the algorithm from a large subset of the possible starting points. You may choose the subset to be evenly distributed across the solution space, or to be randomly selected. In any case, this may take a daunting amount of time.

MEME refines this approach by taking a carefully chosen subset of possible solutions and running a single iteration of EM on each. It then chooses one from among these as its best candidate, and runs EM to convergence from there. When searching for a starting point, MEME does not consider all possible starting points within the range of widths it is given; rather, it surveys starting points at particular steps within the range given. Thus, if using the default range of 8 to 57, MEME will only consider initial motifs whose widths are in the set {8, 11, 15, 21, 28, 41, 57}.

Despite limiting the initial set of widths under consideration, MEME can find a motif of any width in the given range. This is due to a shortening technique that trims low-information columns from the ends of the motif. However, the motif will never be shortened below the minimum width specified for the search.

CONSIDERATIONS

[ Previous | Top | Next ]

Version 2.0 profile files

MEME generates a version 2.0 profile file, which permits multiple profiles to be included in one file. Version 2.0 profile files include an auxiliary data block (encased in {}'s) prior to each profile. This block contains parsable information, including the width of the profile and the column labels for the log-odds matrix.

When reading version 2.0 profile files generated by MEME, most GCG programs (e.g. ProfileSearch, ProfileGap) will read only the first profile found. At this time, the only exception is MotifSearch, which reads and processes all of the profiles.

Also note that MEME's profiles always have Gap and Len values of 100 -- MEME's profiles should always be thought of as ungapped. This is a characteristic of MEME, not of the version 2.0 profile file format.

For more details about version 2.0 profiles, see Appendix VII.

Time-complexity of the algorithm

MEME's algorithm for finding the best initial motifs of width W requires k * W * n⁽²⁾ calculations, where k is an unknown constant (probably between 10 and 100) and n is the total number of residues in the input set. If you allow a large range of widths, this becomes very time-consuming. Searching with the default range of widths requires (8 + 11 + 15 + 21 + 20 + 41 + 57) = 173 iterations of k * n⁽²⁾ calculations.

In any event, running on a training set of more than 20 or 30 typical proteins will require a lot of processor cycles.

Effects of the choice of model

By default, MEME assumes the zero-or-one-per model; that is, it assumes that each motif occurs at most once in each sequence in the search set, but may not occur at all in some sequences. This runs MUCH faster than the zero-or-more model, in which a motif may occur any number of times in a sequence. It is important to understand that using the zero-or-one-per model does not necessarily prevent MEME from finding motifs that are duplicated within a sequence; however, the zero-or-more model may rank such motifs higher relative to other candidates.

Multiple motifs

When told to look for more than one motif, MEME attempts to minimize the overlap between the current motif and any previously identified motifs.

SUGGESTIONS

[ Previous | Top | Next ]

Choosing the minimum and maximum search widths

As noted under CONSIDERATIONS, the algorithm slows down when searching large ranges of widths. If you have some idea of the width of the target motifs, you can (and should) restrict the range of allowable widths. This will save a lot of computation, especially if you can forego searching beyond a width of 25 or 30.

If the training set may include proteins that are not related to the family of interest, you might first run with -MINWidth and -MAXWidth both set to the same small number (perhaps 10 for proteins), and NMOtifs set to 1 or 2. (Be sure to use the default one-or-zero-per model!) This may find a motif (possibly part of a larger motif) that discriminates between family and non-family members, allowing you to remove the unrelated proteins before running a more exhaustive MEME over a larger range of widths.

Finding repeats in a sequence

You can identify motifs within a single sequence by specifying -ZEROORMore to choose the zero-or-more-per model (described in ALGORITHM).

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % meme [-INfile=]@pircat.list -Default

Prompted Parameters:

-BEGin=1 -END=100        sets the range of interest for all sequences

-REVerse                 uses the reverse strand of all sequences

[-OUTfile1=]meme.prf     specifies the output file of profiles

[-OUTfile2=]meme.meme    specifies the output report file

-NMOTifs=6               sets the maximum number of motifs to search for

Local Data Files:

-DATa=prior30.plib       specifies Dirichlet priors for proteins

Optional Parameters:

-ONEEXactly         requires each motif to occur exactly once in each sequence

-ONEORZero          allows each motif to occur up to once per sequence (default)

-ZEROORMore         allows motifs to occur any number of times in any sequence

-TWOStrands         searches both strands of nucleotide sequence

-MINWidth=8         requires motifs to be at least this wide

-MAXWidth=57        limits motifs to a maximum of this width

-EMTHReshold=.001   sets the convergence criterion for EM

-MAXEMiterations=50 stops EM after this many iterations without convergence

-NOSUMmary          suppresses report of run information to screen at exit

-NOMONitor          suppresses screen trace during processing

-NOREPort           suppresses creation of report file

-BATch              submits program to the batch queue

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide. When processing proteins, MEME uses a data file of Dirichlet priors for its Bayesian statistics. By default, the file is GenRunData:prior30.plib. Although it is possible to specify your own priors, it not advised unless you have a very strong understanding of MEME's inner workings.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-BEGin=1

Sets the beginning position for all input sequences. When the beginning position is set from the command line, MEME ignores beginning positions specified for individual sequences in a list file.

-END=100

Sets the ending position for all input sequences. When the ending position is set from the command line, MEME ignores ending positions specified for sequences in a list file.

-REVerse

Sets the program to use the reverse strand for each input sequence. When -REVerse or -NOREVerse is on the command line, MEME ignores any strand designation for individual sequences in a list file.

-NMOtifs=6

Gives the number of unique motifs for which to search.

-ONEEXactly

Specifies a model in which each motif should occur exactly once in every sequence in the training set. If a given motif gets a low score in any sequence, it is very unlikely to be chosen. This is the fastest model.

-ONEORZero

Specifies a model in which each motif should occur zero or one times in any sequence in the training set. If a given motif scores well at more than one position in a sequence, the motif might still be chosen, but the additional scores "hits" will not contribute to its score. This is the default model. This model is about two times slower than the -ONEEXactly model.

-ZEROORMore

Specifies a model in which each motif may occur any number of times in any sequence in the training set. In this case, additional "hits" after the first within a sequence will contribute to the motif's score. This model is about ten times slower than the -ONEEXactly model.

-TWOStrands

Searches forward and reverse strands of nucleotide sequences. This parameter may be used only with the -ONEEXactly parameter!

-MINWidth=8

Specifies the smallest acceptable motif for the search. When shortening the chosen motif, MEME will NOT shorten below this value.

-MAXWidth=57

Specifies the largest acceptable motif for the search. If -MINWidth is equal to -MAXWidth, MEME will either find a motif of that width, or find nothing at all.

-EMTHReshold=.001

Gives a convergence criterion for the EM phase of the algorithm. Raising this criterion will make MEME run faster, but give inferior results.

-MAXEMiterations=50

Overrules the convergence criterion given by EMTHReshold. That is, if EM has failed to converge to the EMTHReshold after MAXEMiterations, the program will cut off the calculation and settle for its result to that point.

-SUMmary

Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-NOREPort

Tells the program not to generate a report file.

-BATch

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

Printed: May 27, 2005 12:57

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.