MOTIFS

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.

DESCRIPTION

[ Previous | Top | Next ]

Motifs looks for protein motifs by searching protein sequences for regular-expression patterns described in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not with gaps. Motifs can only be used to search for patterns in protein sequences.

There is a very informative abstract on every motif in the PROSITE Dictionary. These abstracts are included in the output if any motif is found in your sequence.

The PROSITE Dictionary was compiled and is maintained by Dr. Amos Bairoch of the University of Geneva.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Motifs to look for sequence motifs in PIR:Kihua:

% motifs

 MOTIFS from what protein sequence(s) ?  PIR:Kihua

 What should I call the output file (* kihua.motifs *) ?

               KIHUA len:        194 ................................

             Total finds:          1

            Total length:        194

         Total sequences:          1

          CPU time (sec):       3.60

             Output file:"kihua.motifs"

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:

 MOTIFS from: PIR:Kihua

 Mismatches: 0                September 25, 1998 11:39  ..

               KIHUA  Check: 1665  Length: 194   ! adenylate kinase (EC 2.7.4.3)

 1 - human

______________________________________________________________________________

Adenylate_Kinase      (L,I,V,M,F,Y,W)3DG(F,Y,I)PRx3(N,Q)

                           (L,I,F){3}DG(Y)PRx{3}(Q)

            90: NTSKG            FLIDGYPREVQQ            GEEFE

******************************

* Adenylate kinase signature *

******************************

Adenylate kinase  (EC 2.7.4.3) (AK) [1]  is  a  small  monomeric  enzyme  that

catalyzes the reversible transfer of MgATP to AMP (MgATP + AMP = MgADP + ADP).

In mammals there are three different isozymes:

 - AK1 (or myokinase), which is cytosolic.

 - AK2, which is located in the outer compartment of mitochondria.

 - AK3 (or GTP:AMP phosphotransferase),  which is located in the mitochondrial

   matrix and which uses MgGTP instead of MgATP.

The sequence of  AK has also  been  obtained from different  bacterial species

and from plants and fungi.

Two other enzymes have been found to be evolutionary related to AK. These are:

 - Yeast uridylate kinase  (EC 2.7.4.-) (UK)  (gene URA6) [2]  which catalyzes

   the transfer of a phosphate group from ATP to UMP to form UDP and ADP.

 - Slime mold UMP-CMP kinase (EC 2.7.4.14) [3] which catalyzes the transfer of

   a phosphate group from ATP to either CMP or UMP to form CDP or UDP and ADP.

Several regions of  AK  family enzymes  are well conserved, including the ATP-

binding domains.  We have  selected the  most conserved  of  all  regions as a

signature for this type  of  enzyme.   This  region includes  an aspartic acid

residue that is  part of the  catalytic  cleft  of  the  enzyme  and  that  is

involved in  a salt  bridge.    It  also  includes an  arginine  residue whose

modification leads to inactivation of the enzyme.

-Consensus pattern: [LIVMFYW](3)-D-G-[FYI]-P-R-x(3)-[NQ]

-Sequences known to belong to this class detected by the pattern: ALL,  except

 for Schistosoma mansoni (blood fluke) and Yersinia enterocolitica AK.

-Other sequence(s) detected in SWISS-PROT: NONE.

-Note: archaebacterial AK do not belong to this family [4].

-Last update: November 1997 / Pattern and text revised.

[ 1] Schulz G.E.

     Cold Spring Harbor Symp. Quant. Biol. 52:429-439(1987).

[ 2] Liljelund P., Sanni A., Friesen J.D., Lacroute F.

     Biochem. Biophys. Res. Commun. 165:464-473(1989).

[ 3] Wiesmueller L., Noegel A.A., Barzu O., Gerisch G., Schleicher M.

     J. Biol. Chem. 265:6339-6345(1990).

[ 4] Kath T.H., Schmid R., Schaefer G.

     Arch. Biochem. Biophys. 307:405-410(1993).

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Above each find, the regular expression found by the program is displayed ((L,I,V,M,F,Y,W)3DG(F,Y)PRx3(N,Q)). Below this is a simplification of the expression showing selected amino acids and ranges ((L,I,F){3}DG(Y)PRx{3}(Q)) so that you can better see what was actually found. The find is displayed between five flanking residues to the N-terminus and C-terminus of the protein. The number to the left of the find is the first coordinate of the motif (not of the flanking symbols). In the example above, 90 is the coordinate of the first F in FLIDGYPREVQQ, not of the first N in NTSKG.

PROSITE ABSTRACTS

[ Previous | Top | Next ]

The PROSITE Dictionary contains an extensive abstract summarizing current information for a motif. Motifs displays the abstract below each pattern that is found. If the same pattern is found in more than one sequence, the abstract is only shown below the pattern in the first sequence in which the pattern is found. Several different patterns may share the same abstract. If you want to reduce the size of your output you can suppress these abstracts with -NOREFerence. When abstracts are being suppressed there will be a filename, such as 0179.pdoc, that appears in parentheses below each pattern found. You can use the Fetch program to make a copy of this file in order to look at the abstract.

INPUT FILES

[ Previous | Top | Next ]

Motifs takes as input one or more protein sequence files. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*. If Motifs rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

FindPatterns and all of Accelrys GCG (GCG) mapping programs use the same search algorithm and pattern file format as Motifs. ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences.

RESTRICTIONS

[ Previous | Top | Next ]

The pattern motifs may not be more than 350 characters long.

MISMATCHES

[ Previous | Top | Next ]

Motifs will not introduce gaps, but it can tolerate mismatches when with -MISmatch=n. Mismatched finds are shown in the output in lowercase. Mismatches cannot occur within NOT expressions (see the DEFINING PATTERNS topic below).

PATTERN FILE

[ Previous | Top | Next ]

In addition to your input protein sequence files, Motifs reads a local data file like the one below to find the search patterns. This file is modeled on the enzyme data files for the mapping programs described in Appendix VII. The offset field is not used by Motifs, but the field must have a number in it to make the file compatible with the mapping files.

The exact column used for each field does not matter, only the order of the fields in the line. You may give several patterns the same name, but put all of the entries for that name on adjacent lines of this file. The patterns may not be more than 350 characters long. Blank lines and lines that start with an exclamation point (!) are ignored.

Here is part of the default data file used by Motifs:

PROSITETOGCG of: prosite.doc and prosite.dat  August 30, 1999 14:26

Release 16.0  (7/1999)

Name            Offset Pattern                  ..                  PDoc_Name

11s_Seed_Storage     1 NGx(D,E)2x(L,I,V,M,F)C(S,T)x{11,12}(P,A,G)D 00284.pdoc

1433_1               1 RNL(L,I)SV(G,A)YKN(I,V)                     00633.pdoc

1433_2               1 YK(D,E)STLIMQLL(R,H)DNLTLW(T,A)(S,A)        00633.pdoc

25a_Synth_1          1 GGSx(A,G)(K,R)xTxL(K,R)(G,S,T)xSD(A,G)      00653.pdoc

25a_Synth_2          1 RPVILDPx(D,E)PT                             00653.pdoc

////////////////////////////////////////////////////////////////////////////

Zinc_Finger_C2h2     1 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H       00028.pdoc

Zinc_Finger_C3hc4    1 CxHx(L,I,V,M,F,Y)Cx2C(L,I,V,M,Y,A)          00449.pdoc

Zinc_Protease        1 (G,S,T,A,L,I,V,N)x2HE(L,I,V,M,F,Y,W)~(D,E,H,R,K,P ...

Zn2_Cy6_Fungal       1 (G,A,S,T,P,V)Cx2C(R,K,H,S,T,A,C,W)x2(R,K,H)x2Cx{5 ...

Zp_Domain            1 (L,I,V,M,F,Y,W)x7(S,T,A,P,D,N)x3(L,I,V,M,F,Y,W)x( ...

FREQUENT MOTIFS

[ Previous | Top | Next ]

The PROSITE Dictionary contains a number of short sequence patterns that occur frequently in protein sequences. Most of these frequently found patterns are post-translational modifications, but more specific patterns such as leucine zippers also fall into this category. Such frequently found patterns are not normally shown by Motifs, but you can display them with -FREquent. More so than with other patterns in the PROSITE Dictionary, the presence of these frequently occurring patterns does not assure you that the protein actually contains the corresponding function.

Here are some of the patterns that the PROSITE Dictionary classifies as frequently occurring:

;Amidation           1 xG(R,K)(R,K)                             0009.pdoc

;Asn_Glycosylation   1 N~(P)(S,T)~(P)                           0001.pdoc

;Camp_Phospho_Site   1 (R,K)2x(S,T)                             0004.pdoc

;Ck2_Phospho_Site    1 (S,T)x2(D,E)                             0006.pdoc

;Glycosaminoglycan   1 SGxG                                     0002.pdoc

;Leucine_Zipper      1 Lx6Lx6Lx6L                               0029.pdoc

;Microbodies_Cter    1 (S,A,G,C,N)(R,K,H)(L,I,V,M,A,F)>         0299.pdoc

;Myristyl            1 G~(E,D,R,K,H,P,F,Y,W)x2(S,T,A,G,C,N)~(P) 0008.pdoc

;Pkc_Phospho_Site    1 (S,T)x(R,K)                              0005.pdoc

;Rgd                 1 RGD                                      0016.pdoc

;Tyr_Phospho_Site    1 (R,K)x{2,3}(D,E)x{2,3}Y                  0007.pdoc

SUGGESTIONS

[ Previous | Top | Next ]

The PDoc_Name field in the pattern file prosite.patterns has the name of a PDoc (PROSITE Document) file containing the abstract for each pattern. You can use Fetch to look at any abstracts of interest. If you run Motifs with -NOREFerence, the name of the corresponding PDoc file is shown below each pattern found.

If you specify more than one sequence, Motifs displays each one's name on the screen as it is searched. However, unless you use -SHOw, the output file shows only those sequences in which a motif was actually found.

If you run Motifs with -NAMes, the output file is a list file. (See "Using List Files" in Section 2, Using Sequence Files and Databases of the User's Guide for more information about list files.)

CONSIDERATIONS

[ Previous | Top | Next ]

With the publication of the PROSITE Dictionary, Amos Bairoch has shown that regular expressions can reliably recognize known protein pattern motifs. When new examples of a known motif are discovered, these expressions can usually be modified to recognize the new example. The process of modifying a regular expression so that it covers all of the members of a newly expanded family of similar sequence patterns could be referred to as "ambiguation."

The problem with regular expressions is that they often fail to recognize sequences that are not yet known to be members of the sequence family. You should consider using Profile technology if your aim is to bring together similar sequences whose association has not yet been recognized.

There are a few patterns in PROSITE that are defined with rules rather than regular expressions. Motifs does not look for these patterns.

DEFINING PATTERNS

[ Previous | Top | Next ]

FindPatterns, Map, MapSort, MapPlot, and Motifs all let you search with ambiguous expressions that match many different sequences. The expressions can include any legal GCG sequence character (see Appendix III). The expressions can also include several non-sequence characters, which are used to specify OR matching, NOT matching, begin and end constraints, and repeat counts. For instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of any base, followed by ATG. Following is an explanation of the syntax for pattern specification.

Implied Sets and Repeat Counts

Parentheses () enclose one or more symbols that can be repeated some number of times. Braces {} enclose numbers that tell how many times the symbols within the preceding parentheses must be found.

Sometimes, you can leave out part of an expression. If braces appear without preceding parentheses, the numbers in the braces define the number of repeats for the immediately preceding symbol. One or both of the numbers within the braces may be missing. For instance, both the pattern GATG{2,}A and the pattern GATG{2}A mean GAT, followed by G repeated from 2 to 350,000 times, followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0 to 350,000 times, followed by A; the pattern GAT(TG){,2}A means GAT, followed by TG repeated from 0 to 2 times, followed by A; the pattern GAT(TG){2,2}A means GAT, followed by TG repeated exactly 2 times, followed by A. (If the pattern in the parentheses is an OR expression (see below), it cannot be repeated more than 2,000 times.)

OR Matching

If you are searching nucleic acids, the ambiguity symbols defined in Appendix III let you define any combination of G, A, T, or C. If you are searching proteins, you can specify any of several symbol choices by enclosing the different choices in parentheses and separating the choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed by S. The length of each choice need not be the same, and there can be up to 31 different choices within each set of parentheses. The pattern GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from 1 to 4 times followed by A. The sequence GATTGGA matches this pattern. There can be several parentheses in a pattern, but parentheses cannot be nested.

NOT Matching

The pattern GC~CAT means GC, followed by any symbol except C, followed by AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T, followed by CC.

Begin and End Constraints

The pattern <GACCAT can only be found if it occurs at the beginning of the sequence range being searched. Likewise, the pattern GACCAT> would only be found if it occurs at the end of the sequence range.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % motifs [-INfile=]pir:kihua -Default

Prompted Parameters:

[-OUTfile=]kihua.motifs  names the output file

Local Data Files:

-DATa=prosite.patterns   names the file of protein sequence patterns

Optional Parameters:

-NOREFerence      suppresses the PROSITE abstract for each pattern found

-FREquent         shows motifs that are frequently found in proteins

-MISmatch=1       allows one mismatch

-NAMes            writes the output as a list file

-APPend           appends the pattern data file to your output file

-SHOw             shows every file searched, even if no pattern was found

-MINCuts=2        limits finds to patterns found a minimum of 2 times

-MAXCuts=2        limits finds to patterns found a maximum of 2 times

-ONCe             limits finds to patterns found only once

-EXCLude=n1,n2    excludes patterns found between positions n1 and n2

-LIStfile[=motifs.list] writes names of matching sequences to a list file

-RSF[=motifs.rsf] saves motifs as features in an RSF file

-NOMONitor        suppresses the screen trace showing each file

-NOSUMmary        suppresses the screen summary at the end of the program

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Motifs reads the regular expressions for the motifs of interest from the file prosite.patterns.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-NOREFerence

Suppresses the PROSITE abstract that normally appears below each pattern that is found.

-FREquent

Displays frequently found patterns, such as post-translational modifications.

-MISmatch=1

Causes Motifs to recognize places where patterns are found with one or fewer mismatches. The display uses case to distinguish between matches and mismatches.

-LIStfile[=motifs.list]

Writes the names of matching sequences to a list file suitable for input to other GCG programs that support indirect file specification (see "Using List Files" in Section 2, Using Sequence Files and Databases of the User's Guide). This list file is in addition to the normal output file.

-APPend

Appends the pattern data file to your output file. (See the PATTERN FILE topic above.)

-SHOw

Usually, Motifs shows that a motif was searched only if there were one or more matches in the sequence. With -SHOw, Motifs shows every motif searched whether or not a pattern was actually found in the sequence. ( -SHOw is equivalent to setting -MINCuts=0.)

The descriptions of the exclusionary parameters below were written for GCG mapping programs. A find in these applications is referred to as a cut while a pattern is referred to as a restriction enzyme recognition site.

The -MINCuts, -MAXCuts, -ONCe, and -EXClude parameters suppress the display of selected enzymes. The list of excluded enzymes in the program output includes both selected enzymes that cut within excluded ranges and selected enzymes that did not cut the right number of times.

-MINCuts=2

Excludes enzymes that do not cut at least two times.

-MAXCuts=2

Excludes enzymes that cut more than two times.

-ONCe

Excludes, from the set of enzymes displayed, those enzymes that cut your sequence more than once (equivalent to setting both mincuts and maxcuts to one).

-EXClude=n1,n2[,n3,n4,...]

Excludes enzymes that cut anywhere within one or more ranges of the sequence. If an enzyme is found within an excluded range, then the enzyme is not displayed. The list of excluded enzymes includes enzymes that cut within excluded ranges. The ranges are defined with sets of two numbers. The numbers are separated by commas. Spaces between numbers are not allowed. The numbers must be integers that fall within the sequence beginning and ending points you have chosen. The range may be circular if circular mapping is being done. Exclusion is not done if there are any non-numeric characters in the numbers or numbers out of range or if there is an odd number of integers following the parameter.

-RSF=motifs.rsf

Writes an RSF (rich sequence format) file containing the input sequences annotated with features generated from the results of Motifs. This RSF file is suitable for input to other GCG programs that support RSF files. In particular, you can use SeqLab to view this features annotation graphically. If you don't specify a file name with this parameter, then the program creates one using motifs for the file basename and .rsf for the extension. For more information on RSF files, see "Using Rich Sequence Format (RSF) Files" in Section 2 of the User's Guide. Or, see "Rich Sequence Format (RSF) Files" in Appendix C of the SeqLab Guide.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: May 27, 2005 13:01

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.