SPSCAN

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INPUT FILES

RELATED PROGRAMS

CONSIDERATIONS

ALGORITHM

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

SPScan scans protein sequences for the presence of secretory signal peptides (SPs).

DESCRIPTION

[ Previous | Top | Next ]

SPScan predicts secretory signal peptides (SPs) in protein sequences. For each sequence, SPScan prints a list of possible secretory signal peptides sorted in descending order according to score. Associated with each score is the probability of achieving that score in the target sequence by chance using the given weight matrix. SPScan has weight matrices for eukaryotes, Gram-positive prokaryotes, and Gram-negative prokaryotes.

EXAMPLE

[ Previous | Top | Next ]

Here is a session with SPScan that was used to find SPs in the apolipiprotein A-I precursor protein sequence from Salmo salar:

 
 
% spscan
 
  SPScan of what sequence(s)? PIR:Jh0472
 
                  Begin (* 1 *) ?
                End (*   258 *) ?
 
  Search using weight matrix for which organism type:
 
      A.  Eukaryote
      B.  Gram-Positive Prokaryote
      C.  Gram-Negative Prokaryote
 
     Please choose one: (* A *):
 
  Only display SPs whose score exceeds (* 7.0 *) ?
 
  What should I call the output file (* jh0472.spscan *) ?
 
     Number of input sequences processed: 1
  Number of sequences with predicted SPs: 1
                             Output file: jh0472.spscan
                          CPU time (sec): 1.01
 
%

OUTPUT

[ Previous | Top | Next ]

Here is the output file:

 
 
SPScan of PIR:Jh0472  September 29, 1998 15:02
 
  Weight matrix: GenRunData:speuk.dat
  Minimum score for SPs (threshold): 7.0
 
  Predicted cleavage sites indicated by '^'.
 
> sequence: pir2:jh0472
      name: jh0472  check: 8711  from: 1  to: 258
 
   1. 1 MKFLVLALTILLAAGTQA^FP 20
      Score: 12.2
      Probability: 1.455E-03
      SP length: 18
      McGeoch scan succeeded:
        Charged-region statistics:
          Length: 2   Charge: 1
        Hydrophobic-region statistics:
          Length: 9   Offset: 3   Total hydropathy: 67.8
          Maximum 8-residue hydropathy: 60.6, starting at 5
 
  Databases searched:
        NBRF, Release 57.0, Released on 30Jun1998, Formatted on 18Aug1998
  Input sequences searched: 1
  Number of sequences with predicted SPs: 1
  CPU time (sec): 0.42

SP Representation

The N-terminus->C-terminus direction of the predicted SP is from left to right. The position of the first residue in the SP is shown to the left, and the position of the second residue after the cleavage site is shown to the right. The predicted position of the cleavage site itself is indicated with a caret (^).

SP Data

Each predicted SP displayed in the output is followed by a summary of the information used to make the prediction:

Score gives the score computed using the weight matrix for the predicted SP. This is the maximum score generated from the weight matrix as it is moved over a region no longer than 70 residues downstream from a putative SP start site (an SP start site is either an initiator methionine or the first amino acid residue of the sequence, if the sequence didn't start with a methionine). The region immediately downstream of the putative start site is evaluated for certain characteristics indicative of a SP before the weight matrix was applied to the sequence. If you use -ADJustscores, the score is lessened by an amount proportional to that by which the length of the predicted SP exceeds the suggested maximum for the organism type. All the SPs predicted for a particular sequence are sorted according to this value, with highest scores appearing first.

Unadjusted score, when present, gives the score computed by applying the weight matrix to the predicted SP. This information will appear only when you use -ADJustscores.

Probability, when present, is the probability of the random occurrence of a score at least as high as the one reported in a sequence with the same amino acid composition as that portion of the target sequence scanned (see ALGORITHM topic below) whose positions are all independent of each other. -EVEn causes SPScan to compute score probabilities based on a sequence with even amino acid residue distribution whose positions are all independent of each other. -NOPROBabilities causes SPScan to forgo the calculation of probability. If you specify -ADJustscores, the probability always applies to the unadjusted score.

SP length is the length of the predicted SP from the putative SP start site to the residue immediately preceding the site of enzymatic cleavage. Note that the SP sequence display shows an indication of the cleavage site followed by the first two residues after the SP; the final two residues are not included in the SP length because they are not part of the SP.

McGeoch scan reports either "succeeded" or "failed," based on the result of the scan for McGeoch's criteria for a minimum SP. (See ALGORITHM topic below.)

Charged-region statistics are present only if the McGeoch scan succeeds.

Length gives the length of the charged region, or n-region (see ALGORITHM topic below), as measured from the putative SP start site to the distal charged residue. In a typical SP, the charged region is 1 to 5 amino acids in length and carries a positive charge.

Charge gives the total charge of the n-region. The total charge is the sum of the charges of the charged amino acids in the n-region.

Hydrophobic-region statistics are present only if the McGeoch scan succeeds.

Length gives the length of the hydrophobic region, or h-region (see ALGORITHM topic below), as measured from the residue immediately following the distal charged residue of the charged region to the last amino acid in the maximally hydrophobic 8-residue window beginning 8 to 15 residues downstream from the putative SP start site.

Offset gives the position of the first residue in the hydrophobic region of the potential SP relative to the beginning of the predicted SP.

Total hydropathy gives the total Kyte-Doolittle hydropathy of the hydrophobic region.

Maximum 8-residue hydropathy gives the Kyte-Doolittle hydropathy of the maximally hydrophobic 8-residue window in the hydrophobic region (see ALGORITHM topic below). The position of the first residue in this window is indicated. The final residue of this window is the last amino acid of the hydrophobic region.

INPUT FILES

[ Previous | Top | Next ]

The input to SPScan is one or more protein sequences. If SPScan rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*.

RELATED PROGRAMS

[ Previous | Top | Next ]

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds. FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. HTHScan scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation. CoilScan locates coiled-coil segments in protein sequences. TransMem scans for likely transmembrane helices in one or more input protein sequences. HTHScan+ scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation. CoilScan+ locates coiled-coil segments in protein sequences. TransMem scans for likely transmembrane helices in one or more input protein sequences. SPScan+ scans protein sequences for the presence of secretory signal peptides (SPs).

 

CONSIDERATIONS

[ Previous | Top | Next ]

Under normal circumstances it is likely that SPScan will predict more than one SP in your sequence. Often one of these will have a score significantly greater than the others. If not, keep the following points in mind when evaluating the results of SPScan (from Nielsen, H. et al. Protein Engineering 10(1); 1-6 (1997)):

- SPs in eukaryotes are very rarely longer than 35 residues in length (40 residues for Gram-negative bacteria, 45 for Gram-positive bacteria). -ADJustscores causes the scores of long predictions to linearly diminish as the predicted SP lengthens beyond those empirical limits.

- SPs shorter than 15 residues are extremely rare in both eukaryotes and prokaryotes. SPScan won't find any SPs shorter than 15 residues in length.

The probability value attached to each score, being a measure of the probability of achieving that score or higher by chance with the given weight matrix and target sequence, is extremely useful to use when evaluating SP predictions. A probability close to 0.0 indicates that achieving the score purely by chance is very unlikely, and that you can have more confidence in the SP prediction. Probabilities closer to 1.0 indicate that it's likely that you have gotten the score by chance alone, making the SP prediction more dubious.

Ambiguity codes (such as B or Z) in protein sequences contribute exactly 0 to the score of the sequence window within which they are found. Therefore, the scores and probabilities associated with any predicted motifs from such a sequence window are likely to differ to varying extents from what they would be otherwise. You shouldn't routinely encounter this problem because ambiguity codes are extremely rare in protein sequences.

The "McGeoch scan" information is included in the results to help you decide whether predicted SPs are real when their scores are only marginal or when the probability of achieving those scores seems rather high. The McGeoch scan looks at the upstream part of the predicted SP, beginning with the putative initiator methionine, to determine whether the sequence meets McGeoch's criteria for a minimum acceptable SP (see the ALGORITHM topic below). If a low-scoring SP fails the McGeoch scan, it may be a false positive prediction; if the McGeoch scan succeeds, that SP might merit a closer look.

Because of the way SPScan sorts and stores predicted SPs during scanning, no particular ordering is guaranteed among SPs that have exactly the same score (see the ALGORITHM topic below).

ALGORITHM

[ Previous | Top | Next ]

SPScan uses the weight matrix method of von Heijne (von Heijne, G. Sequence Analysis in Molecular Biology: Treasure Trove or Trivial Pursuit. (1987)), in concert with McGeoch's description of a minimum acceptable SP (McGeoch, D. Virus Research 3; 271-286 (1985)) to predict secretory signal peptides within a protein sequence.

von Heijne's weight matrix method is widely used for detecting SPs in protein sequences. However, this method can misclassify non-functional SPs resulting from events like point mutations. To help reduce false positive predictions, SPScan also determines whether potential SPs meet McGeoch's criteria. SP predictions which fail to meet these criteria are more dubious in general than those that do.

The following is a brief description of how these methods are used to make SP predictions.

Each input protein sequence is scanned from beginning to end. The first residue in each sequence is always examined as a potential SP starting point; subsequently, only methionine residues are considered as potential SP starting points.

At each potential SP starting point, SPScan first checks to see whether McGeoch's criteria for a minimum SP are met. SPScan looks for what von Heijne refers to as an n-region and what McGeoch calls the charged region or CR. This is a window of 11 or fewer residues (including the potential starting residue) containing at least one charged amino acid residue (the charged amino acids are arginine, lysine, asparagine, and glutamic acid). In a real SP, the charged region usually has a charge in the range -1 to +2. If a charged residue is not found, the potential SP has failed to meet McGeoch's criteria.

If a charged region is found, the distal charged amino acid residue is taken as the end of the charged region. The scan continues downstream for an 8-residue window within 15 residues of the end of the charged region. This is referred to by von Heijne as the h-region, and by McGeoch as the uncharged region or UR. To qualify as an uncharged region, the maximally hydrophobic 8-residue window within this 15-residue range should have a hydrophobicity on the Kyte-Doolittle scale of at least 15. If a good uncharged region is found, we take the end of that maximally hydrophobic 8-residue window to be the end of the uncharged region and the potential SP is deemed to have met the McGeoch criteria. The potential SP will be evaluated using von Heijne's weight matrix method in the next stage of the scan. If a good h-region is not found, the potential SP has failed to meet McGeoch's criteria.

The potential SP is then subjected to scanning using von Heijne's weight matrix method. The weight matrix is applied beginning with the potential starting residue for the SP, and scanning continues residue by residue until a region 70 residues long has been examined (very few SPs will be longer than 70 residues in eukaryotes or prokaryotes). The cleavage site predicted by the weight matrix application yielding the highest score is reported. The score reported for a predicted SP is just the von Heijne weight matrix score; the result of the scan for the McGeoch criteria is not reflected in that score, but is simply reported as success or failure.

The statistical significance of each score is computed as the probability of random occurrence of that score in a sequence with the same amino acid residue distribution as that portion of the target sequence scanned and whose positions are all independent of each other (Claverie, J.-M. and Audic, S. CABIOS 12(5); 431-439 (1996)).

The weight matrices used to compute scores for potential SPs are from data given in Nielsen, H. et al. Protein Engineering 10(1); 1-6 (1997). There are matrices for eukaryotes, Gram-positive prokaryotes, and Gram-negative prokaryotes.

There is no guarantee of the relative ordering between predicted SPs having exactly the same score. For example, as we scan from the beginning of the sequence to the end, if the first two SPs encountered each have the score 3.7, SPScan may list the second SP before the first in the final report.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % spscan [-INfile=]pir:jh0472 -Default
 
Prompted Parameters:
 
-BEGin=1 -END=258           sets the range of interest
-THRESHold=7.0              sets minimum score for SP detection
[-OUTfile=]jh0472.spscan    specifies name of results file
 
Local Data Files:
 
-DATa=speuk.dat    assigns the weight matrix for eukaryotic SPs
-DATa=spgpos.dat   assigns the weight matrix for Gram-positive prokaryotic SPs
-DATa=spgneg.dat   assigns the weight matrix for Gram-negative prokaryotic SPs
 
Optional Parameters:
 
-NUMTOPscores=3             specifies maximum number of SPs to report
-GRAMPositive               uses Gram-positive prokaryote weight matrix
-GRAMNegative               uses Gram-negative prokaryote weight matrix
-ADJustscores               reduces scores of very long SPs
-EVEn                       assumes even target residue distribution
-NOPROBabilities            doesn't compute score probabilities
-VERbose                    uses verbose output
-RSF[=spscan.rsf]           saves predicted SPs as features in the RSF file
-MONitor                    displays screen trace of progress
-NOSUMmary                  suppresses screen summary at the end of the program

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

If you use -GRAMPositive, SPScan will use the weight matrix file for Gram-positive prokaryotes called spgpos.dat. If you use -GRAMNegative, SPScan will use the weight matrix file for Gram-negative prokaryotes called spgneg.dat. The default behavior is to use the weight matrix file for detecting eukaryotic SPs, speuk.dat.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-THRESHold=7.0

Sets the minimum score for secretory signal peptide detection.

-NUMTOPscores=3

Allows you to specify the maximum number of predicted SPs to report for each sequence scanned. For example, if you specify -NUMTOPscores=3, SPScan will display no more than three of the highest scoring SPs predicted for each sequence. Use -NUMTOPscores=1 if you want to see only the highest-scoring SP in each sequence. By default, SPScan will display all SPs whose scores meet or exceed the threshold.

-GRAMPositive

Tells SPScan to use the Gram-positive prokaryote weight matrix described in Nielsen, H. et al. Protein Engineering 10(1); 1-6 (1997). The default weight matrix is the one for eukaryotes described in the same paper.

-GRAMNegative

Tells SPScan to use the Gram-negative prokaryote weight matrix described in Nielsen, H. et al. Protein Engineering 10(1); 1-6 (1997). The default weight matrix is the one for eukaryotes described in the same paper.

-ADJustscores

Tells SPScan to reduce each computed score by an amount proportional to the difference between the length of the predicted SP and the empirical "maximum" length of SPs for the appropriate organism type (eukaryote, Gram-positive prokaryote, or Gram-negative prokaryote). These maxima are not absolute limits, but are described in Nielsen, H. et al. (Protein Engineering 10(1); 1-6 (1997)) as being the length beyond which genuine SPs appear only very rarely. Use this option to cause predicted SPs that are probably too long to be real to be printed later in the sorted list of predictions.

-EVEn

Tells SPScan to assume that amino acid residues are distributed evenly throughout the length of the target sequence for the purpose of calculating score probabilities. This makes SPScan run a little faster, because it does not have to compute the actual distribution of residues in each input sequence, but reliability of the score probability calculations may be adversely effected.

-NOPROBabilities

Tells SPScan to forgo the calculation of the probability of random occurrence of the score in a sequence with even amino acid residue distribution whose positions are all independent of each other. This makes SPScan run much faster.

-VERbose

Tells SPScan to print more documentation about each sequence to the output file. The number of lines of documentation printed depends upon the value of the % DocLines global switch described in "Using Global Switches" in Section 3, Using Programs in the User's Guide.

-RSF=spscan.rsf

Writes an RSF (rich sequence format) file containing the input sequences annotated with features generated from the results of SPScan. This RSF file is suitable for input to other Accelrys GCG (GCG) programs that support RSF files. In particular, you can use SeqLab to view this features annotation graphically. If you don't specify a file name with this parameter, then the program creates one using spscan for the file basename and .rsf for the extension. For more information on RSF files, see "Using Rich Sequence Format (RSF) Files" in Section 2 of the User's Guide. Or, see "Rich Sequence Format (RSF) Files" in Appendix C of the SeqLab Guide.

-MONitor=10

Monitors this program's progress on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it with -NOMONitor.

The monitor is updated every time the program processes 10 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.

-SUMmary

Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: May 27, 2005 14:43 


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio