PROFILESCAN

ProfileScan uses the method of Gribskov et al. (CABIOS 4(1); 61-66 (1988)) to find structural and sequence motifs in protein sequences. These motifs are represented as profiles in a library. ProfileScan aligns each profile motif to the sequence, and displays all alignments between the profile and sequence that have a normalized score above a set threshold. Because more than one alignment between a sequence and a particular motif can be found, each repeat of a duplicated structure (such as the zinc finger motif) can be presented.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using ProfileScan to search for known structural motifs in the sequence Ygbyad from the PIR database:

% profilescan

 PROFILESCAN of what sequence(s) ?  PIR:Ygbyad

                  Begin (* 1 *) ?

                End (*  1392 *) ?

 What profile library (* profilescan.fil *) ?

 What should I call the alignment output file (* ygbyad.scan *) ?

 What should I call the summary output file (* ygbyad.sum *) ?

Beginning initial scan...

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

.........

Beginning multiple alignment to matching patterns...

OUTPUT

[ Previous | Top | Next ]

Here is some of the ygbyad.scan output file:

 PROFILESCAN of : ygbyad  check: 5237  from: 1  to: 1392

P1;YGBYAD - L-aminoadipate-semialdehyde dehydrogenase (EC 1.2.1.31) - yeast

 (Saccharomyces cerevisiae)

N;Alternate names: alpha-aminoadipate reductase; protein YBR0910; protein

 YBR115c

C;Species: Saccharomyces cerevisiae

C;Date: 31-Dec-1991 #sequence_revision 31-Dec-1991 #text_change 12-Dec-1997

C;Accession: JU0448; S48279; S45983; A25815; S37810; S25367; S34171; S44694

R;Morris, M.E.; Jinks-Robertson, S. . . .

 Compare to profile library: GenRunData:profilescan.fil

..

--------------------------------------------------------------------------------

 Profile: profiledir:amp_binding.prf

   Gap weight:  4.50     Gap Length weight:   0.05

   Ave match:   0.12     Ave mismatch     :  -0.10

(Peptide) PROFILEMAKE v4.40 of: 0455.Msf2{*}  Length: 59

  Sequences: 28  MaxScore: 15.35  December 2, 1992  01:06

This profile is derived from PROSITE release 10.0 and has been tested

by a database search against SWISS-PROT release 26.0.  A comparison

of the SWISS-PROT annotation and the results of the database search

follows.  For further information about this motif, consult the . . .

Profile: profiledir:amp_binding.prf     alignment: 1

 Quality:  10.69       Gaps: 0

   Ratio:   0.21     Length: 51

 Normalized quality:  2.34

                  .         .         .         .         .

S    399 DHYKDTRTGVVVGPDSNPTLSFTSGSEGIPKGVLGRHFSLAYYFNWMSKR 448

         :. .:: :.....::. : | |||||:| |||||  | ::.   . ::::

P      7 EQSEDTETTQPDDPEDLAFIIFTSGTTGKPKGVMLTHKGVVNSVSSLSDR 56

S    449 F 449

P     57 F 57

*****************************************

* Putative AMP-binding domain signature *

*****************************************

It has been shown [1 to 5] that a number of prokaryotic and eukaryotic enzymes

which all probably act via  an ATP-dependent  covalent binding of AMP to their

substrate, share a region of sequence similarity. These enzymes are:

//////////////////////////////////////////////////////////////////////////////

-Consensus pattern: [LIVMFY]-x(2)-[STG]-[STAG]-G-[ST]-[STEI]-[SG]-x-[PASLIVM]-

                    [KR]

-Sequences known to belong to this class detected by the pattern: ALL.

-Other sequence(s) detected in SWISS-PROT: 13.

-Note: in a majority of cases the residue that  follows  the Lys at the end of

 the pattern is a Gly.

-Last update: November 1997 / Pattern and text revised.

[ 1] Toh H.

     Protein Seq. Data Anal. 4:111-117(1991).

[ 2] Smith D.J., Earl A.J., Turner G.

     EMBO J. 9:2743-2750(1990).

[ 3] Schroeder J.

     Nucleic Acids Res. 17:460-460(1989).

[ 4] Mallonee D.H., Adams J.L., Hylemon P.B.

     J. Bacteriol. 174:2065-2071(1992).

[ 5] Turgay K., Krause M., Marahiel M.A.

     Mol. Microbiol. 6:529-546(1992).

//////////////////////////////////////////////////////////////////////////////

The file ygbyad.sum lists the number of occurrences of each motif in the sequence of interest, the score for each occurrence, and the threshold score for that motif.

INPUT FILES

[ Previous | Top | Next ]

ProfileScan takes as input one or more protein sequences. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*. If ProfileScan rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

HmmerPfam compares one or more sequences to a database of profile hidden Markov models, such as the Pfam library, in order to identify known domains within the sequences. HmmerIndex creates an index for a profile hidden Markov model database so that profile HMMs can be retrieved from the database with HmmerFetch.

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns.

RESTRICTIONS

[ Previous | Top | Next ]

Unknown.

ALGORITHM

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileScan acts similarly to ProfileGap to align the motif profile to a sequence. Unlike ProfileGap, all alignments with scores above a set threshold are displayed. The scores are normalized for systematic effects of sequence length on the score. Since the average normalized score for sequences unrelated to the profile is expected to be 1.0, the threshold can be viewed as the factor by which an alignment score must exceed the expected alignment score for unrelated sequences to be reported. For instance, if the threshold is set at 2.0, an alignment is reported if its normalized score is at least 2.0 times the expected score for sequences unrelated to the profile.

In practice, two possible thresholds, high and interesting, can be selected. The threshold values for each motif are present in the motif library file, profilescan.fil. The interesting level is usually set at 3.0 standard deviations above the mean score for sequences in the database unrelated to the profile, and the high level is usually set at the 5.0 to 6.0 standard deviation level. The default high threshold can be overridden with -INTEResting. (See the entry for ProfileSearch in the Program Manual for a complete description of normalized scores.)

Validated Profiles

The motif library consists of validated profiles derived from aligned sequences known to contain each structural motif. A validated profile has the following properties: 1) all of the sequences used to create the profile correctly align to the profile; and 2) all sequences known to contain the motif score above the high threshold. The scores for these sequences are higher in every case than the scores for sequences known to lack the motif. Operationally, the process of creating a validated profile is as follows:

Each sequence known to contain the motif is aligned to the profile using ProfileGap. The alignment generated should correspond to the original alignment. If the alignments differ significantly, they are repeated with different gap creation and gap extension penalties until they agree.

Each motif profile is compared to all the sequences in the database using ProfileSearch. All sequences known to contain the motif represented by the profile should have higher scores than any sequences that lack the motif.

If the profile does not adequately discriminate between sequences with the motif and those without, and if changing the gap creation and gap extension penalties does not improve the discrimination, the alignments are examined by eye to determine why the sequences without the motif are giving high scores. The profile can then be edited by hand to reduce the scores in the profile at the positions that are contributing to the high scores of the sequences lacking the motif.

CONSIDERATIONS

[ Previous | Top | Next ]

ProfileScan may report multiple occurrences of a motif profile in a protein sequence. The alignments may represent repeats of a duplicated structure, or they may represent distinct alignments between the motif profile and the same region of the protein sequence. These alternatives can be distinguished by looking at the alignments in the .scan file.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % profilescan [-INfile=]pir:ygbyad -Default

Prompted Parameters:

-BEGin=1 -END=194           sets the range of interest

-REVerse                    uses the reverse strand (nucleic acid only)

[-LIBrary=]profilescan.fil  specifies profile library file

[-OUTfile=]ygbyad.scan      specifies paired alignment output file name

[-SUMfile=]ygbyad.sum       specifies summary output file name

Local Data Files:  None

Optional Parameters:

-INTEResting       reports scores higher than the INTERESTING threshold,

                     rather than the default HIGH

-NOAVErage         does not adjust quality score for sequence composition

-PAIr=1.0,0.5,0.1  specifies thresholds for displaying '|', ':', and '.'

-BATch             submits the program to run in the batch queue

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

ProfileScan reads the library file, profilescan.fil, containing a list of each validated profile, the high and interesting thresholds, the gap creation and gap extension penalties for each profile, and the three constants A, B, and C used for length dependent normalization of scores. See the entry for ProfileSearch in the Program Manual for details on the calculation of these constants.

Any profile can be used by ProfileScan by including its file name and appropriate values in the library file. Values for the two thresholds and two gap penalties must be included for each profile added to the library file. If values for the three constants A, B, and C are omitted from the library file, the values 0.0, 0.0, and 1.0 are used, respectively.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-LIBrary=profilescan.fil

Specifies the library file containing a list of validated profiles used to find motifs in protein sequences. Usually, the list of validated profiles is found in a default or local data file called profilescan.fil. -LIBrary allows you to name a different file.

-INTEResting

Reports alignments whose scores are higher than the INTERESTING threshold, rather than the more stringent default HIGH default.

-NOAVErage

Turns off the adjustment of scores for sequence composition. In the default ( -AVErage), a score due to the similarity in composition between the profile and sequence of interest is subtracted from the original alignment score.

-PAIr=1.0,0.5,0.1

The paired output file from this program displays sequence similarity by putting a pipe character (|), colon (:), and period (.) between similar sequence symbols. The default thresholds for the characters are determined by the values in the profile. The pipe character is put between symbols whose comparison value in the profile is at least the average positive value in the profile plus one tenth the difference between the maximum and average values in the profile. The colon character threshold is the average positive value in the profile. The period character threshold is the larger of the average positive value in the profile minus one tenth the difference between the maximum and average values, and one half the average value.

-BATch

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

Printed: May 27, 2005 14:12

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.