SPSCAN+

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[ Top | Next ]

SPScan+ scans protein sequences for the presence of secretory signal peptides (SPs).

DESCRIPTION

[ Previous | Top | Next ]

Advantages of Plus “+” Programs:

P Plus programs are enhanced to be able to read sequences in a variety of native formats such as GCG RSF, GCG SSF, GCG MSF, GenBank, EMBL, FastA, SwissProt, PIR, and BSML without conversion.

P Plus programs remove sequence length restriction of 350,000bp.

If you do not need these features and wish to have more interactivity, you might wish to seek out and run the original program version.

SPScan+ predicts secretory signal peptides (SPs) in protein sequences. For each sequence, SPScan+ prints a list of possible secretor signal peptides sorted in descending order according to score. Associated with each score is the probability of achieving that score in the target sequence by chance using the given weight matrix. SPScan+ has weight matrices for eukaryotes, Gram-positive prokaryotes, and Gram-negative prokaryotes.

EXAMPLE

[ Previous | Top | Next ]

Here is a session with SPScan+ that was used to find SPs in the apolipiprotein A-I precursor protein sequence from Salmo salar:

%spscan+

SPScan+ scans protein sequences for the presence of secretory signal peptides

SPScan with what sequence(s) ? pir:jh0472

Begin (* 1 *) ? 1

End (-1 for entire sequence) (* -1 *) ?

Only display SPs whose score exceeds (* 7.0 *) ?

What should I call the output file (* <sequence_name>.spscan+ *) ? jh0472.spscan+

SPScan of pir:jh0472  December 03, 2004 14:56

  Weight matrix: SHARE_MATRIX:speuk.dat

  Minimum score for SPs (threshold): 7.0

  Predicted cleavage sites indicated by '^'.

Analyzing sequence 'JH0472' from 'pir2:JH0472'

Processing results...

Input sequences processed              : 1

Number of sequences with predicted SPs : 1

Output File                            : jh0472.spscan+

Results written to jh0472.spscan+

OUTPUT

[ Previous | Top | Next ]

Here is the output file:

> sequence: pir2:JH0472

name: JH0472 check: 8711 from: 1 to: 258

1. 1 MKFLVLALTILLAAGTQA^FP 20

Score: 12.2

Probability: 1.455E-03

SP length: 18

McGeoch scan succeeded:

Charged-region statistics:

Length: 2 Charge: 1

Hydrophobic-region statistics:

Length: 9 Offset: 3 Total hydropathy: 67.8

Maximum 8-residue hydropathy: 60.6, starting at 5

*** SUMMARY ***

Input sequences processed : 1

Number of sequences with predicted SPs : 1

INPUT FILES

[ Previous | Top | Next ]

The input to SPScan+ is one or more protein sequences. If SPScan+ rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for exampleproject.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example Genbank:*.

RELATED PROGRAMS

[ Previous | Top | Next ]

SPScan scans protein sequences for the presence of secretory signal peptides (SPs).

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.

FindPatterns+ identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.

HTHScan+ scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.

CoilScan+ locates coiled-coil segments in protein sequences. TransMem scans for likely transmembrane helices in one or more input protein sequences.

FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.

HTHScan scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.

CoilScan locates coiled-coil segments in protein sequences. TransMem scans for likely transmembrane helices in one or more input protein sequences.

CONSIDERATIONS

[ Previous | Top | Next ]

Under normal circumstances it is likely that SPScan+ will predict more than one SP in your sequence. Often one of these will have a score significantly greater than the others. If not, keep the following points in mind when evaluating the results of SPScan+ (from Nielsen, H. et al. Protein Engineering 10(1); 1-6 (1997)):

- SPs in eukaryotes are very rarely longer than 35 residues in length (40 residues for Gram-negative bacteria, 45 for Gram-positive bacteria). -adjustscores causes the scores of long predictions to linearly diminish as the predicted SP lengthens beyond those empirical limits.

- SPs shorter than 15 residues are extremely rare in both eukaryotes and prokaryotes. SPScan+ won't find any SPs shorter than 15 residues in length.

The probability value attached to each score, being a measure of the probability of achieving that score or higher by chance with the given weight matrix and target sequence, is extremely useful to use when evaluating SP predictions. A probability close to 0.0 indicates that achieving the score purely by chance is very unlikely, and that you can have more confidence in the SP prediction. Probabilities closer to 1.0 indicate that it's likely that you have gotten the score by chance alone, making the SP prediction more dubious.

Ambiguity codes (such as B or Z) in protein sequences contribute exactly 0 to the score of the sequence window within which they are found. Therefore, the scores and probabilities associated with any predicted motifs from such a sequence window are likely to differ to varying extents from what they would be otherwise. You shouldn't routinely encounter this problem because ambiguity codes are extremely rare in protein sequences.

The "McGeoch scan" information is included in the results to help you decide whether predicted SPs are real when their scores are only marginal or when the probability of achieving those scores seems rather high. The McGeoch scan looks at the upstream part of the predicted SP, beginning with the putative initiator methionine, to determine whether the sequence meets McGeoch's criteria for a minimum acceptable SP (see the ALGORITHM topic below). If a low-scoring SP fails the McGeoch scan, it may be a false positive prediction; if the McGeoch scan succeeds, that SP might merit a closer look.

Because of the way SPScan+ sorts and stores predicted SPs during scanning, no particular ordering is guaranteed among SPs that have exactly the same score (see the ALGORITHM topic below).

ALGORITHM

[ Previous | Top | Next ]

SPScan+ uses the weight matrix method of von Heijne (von Heijne, G. Sequence Analysis in Molecular Biology: Treasure Trove or Trivial Pursuit. (1987)), in concert with McGeoch's description of a minimum acceptable SP (McGeoch, D. Virus Research 3; 271-286 (1985)) to predict secretor signal peptides within a protein sequence.

Von’s weight matrix method is widely used for detecting SPs in protein sequences. However, this method can misclassify non-functional SPs resulting from events like point mutations. To help reduce false positive predictions, SPScan+ also determines whether potential SPs meet McGeoch's criteria. SP predictions which fail to meet these criteria are more dubious in general than those that do.

The following is a brief description of how these methods are used to make SP predictions.

Each input protein sequence is scanned from beginning to end. The first residue in each sequence is always examined as a potential SP starting point; subsequently, only methionine residues are considered as potential SP starting points.

At each potential SP starting point, SPScan+ first checks to see whether McGeoch's criteria for a minimum SP are met. SPScan+ looks for what von Heijne refers to as an n-region and what McGeoch calls the charged region or CR. This is a window of 11 or fewer residues (including the potential starting residue) containing at least one charged amino acid residue (the charged amino acids are arginine, lysine, asparagine, and glutamic acid). In a real SP, the charged region usually has a charge in the range -1 to +2. If a charged residue is not found, the potential SP has failed to meet McGeoch's criteria.

If a charged region is found, the distal charged amino acid residue is taken as the end of the charged region. The scan continues downstream for an 8-residue window within 15 residues of the end of the charged region. This is referred to by von Heijne as the h-region, and by McGeoch as the uncharged region or UR. To qualify as an uncharged region, the maximally hydrophobic 8-residue window within this 15-residue range should have hydrophobicity on the Kyte-Doolittle scale of at least 15. If a good uncharged region is found, we take the end of that maximally hydrophobic 8-residue window to be the end of the uncharged region and the potential SP is deemed to have met the McGeoch criteria. The potential SP will be evaluated using von Heijne's weight matrix method in the next stage of the scan. If a good h-region is not found, the potential SP has failed to meet McGeoch's criteria.

The potential SP is then subjected to scanning using von Heijne's weight matrix method. The weight matrix is applied beginning with the potential starting residue for the SP, and scanning continues residue by residue until a region 70 residues long has been examined (very few SPs will be longer than 70 residues in eukaryotes or prokaryotes). The cleavage site predicted by the weight matrix application yielding the highest score is reported. The score reported for a predicted SP is just the von Heijne weight matrix score; the result of the scan for the McGeoch criteria is not reflected in that score, but is simply reported as success or failure.

The statistical significance of each score is computed as the probability of random occurrence of that score in a sequence with the same amino acid residue distribution as that portion of the target sequence scanned and whose positions are all independent of each other (Claverie, J.-M. and Audic, S. CABIOS 12(5); 431-439 (1996)).

The weight matrices used to compute scores for potential SPs are from data given in Nielsen, H. et al. Protein Engineering 10(1); 1-6 (1997). There are matrices for eukaryotes, Gram-positive prokaryotes, and Gram-negative prokaryotes.

There is no guarantee of the relative ordering between predicted SPs having exactly the same score. For example, as we scan from the beginning of the sequence to the end, if the first two SPs encountered each have the score 3.7, SPScan+ may list the second SP before the first in the final report.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -check to view the summary below and to specify parameters before the program executes. In the syntax summary below, square brackets ([ and ]) enclose parameter values that are optional. For each program parameter, square brackets enclose the type of parameter value specified, the default parameter value, and shortened forms of the parameter name, aliases. Programs with a plus in the name use either the full parameter name or a specified alias. If “Type” is “Boolean”, then the presence of the parameter on the command line indicates a true condition. A false condition needs to be stated as, parameter=false.

SPScan+ scans protein sequences for the presence of secretory signal peptides.

Minimal Syntax: % spscan+ [-infile=]value -Default

Minimal Parameters (case-insensitive):

-infile         [Type: InFile / Default: EMPTY / Aliases: infile1 in]

                The name of the input file.

Prompted Parameters (case-insensitive):

-begin          [Type: Integer / Default: '1' / Aliases: beg]

                First base of interest for each query sequence.

-end            [Type: Integer / Default: '-1']

                Last base of interest for each query sequence.

-threshold      [Type: Double / Default: '7.0' / Aliases: thresh]

                Sets minimum score for SP detection.

-outfile        [Type: OutFile / Default: '<sequence_name>.spscan+' / Aliases: out outfile1]Names the output file.

Optional Parameters (case-insensitive):

-check          [Type: Boolean / Default: 'false' / Aliases: che help]

                Prints out this usage message.

-default        [Type: Boolean / Default: 'false' / Aliases: d def]

                Specifies that sensible default values be used for all parameters where possible.

-documentation  [Type: Boolean / Default: 'true' / Aliases: doc]

                Prints banner at program startup.

-quiet          [Type: Boolean / Default: 'false' / Aliases: qui]

                Tells application to print only a minimal amount of information.

-grampositive   [Type: Boolean / Default: 'false' / Aliases: gramp]

                Uses Gram-positive prokaryote weight matrix.

-gramnegative   [Type: Boolean / Default: 'false' / Aliases: gramn]

                Uses Gram-negative prokaryote weight matrix.

-adjustscores   [Type: Boolean / Default: 'false' / Aliases: adj]

                Reduces scores of very long SPs.

-data           [Type: String / Default: EMPTY / Aliases: dat]

                Assigns weight matrix.

-seqout         [Type: OutFile / Default: EMPTY / Aliases: rsf]

                Annotated sequence output.

-numtopscores   [Type: Integer / Default: '-1' / Aliases: numtop maxhits]Specifies maximum number of SPs to report.

-even           [Type: Boolean / Default: 'false' / Aliases:]

                Assumes even target residue distribution.

-probabilities  [Type: Boolean / Default: 'true' / Aliases: prob]

                Compute score probabilities.

-verbose        [Type: Boolean / Default: 'false' / Aliases: ver]

                Print more documentation about each sequence to the output file.

-monitor        [Type: Boolean / Default: 'false' / Aliases: mon]

                Displays screen trace of progress.

-summary        [Type: Boolean / Default: 'true' / Aliases: sum]

                Displays screen summary at end of the program.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -data1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

If you use -grampositive, SPScan+ will use the weight matrix file for Gram-positive prokaryotes called spgpos.dat($GCGROOT/share/matrix). If you use -gramnegative, SPScan+ will use the weight matrix file for Gram-negative prokaryotes called $GCGROOT/share/matrix/spgneg.dat. The default behavior is to use the weight matrix file for detecting eukaryotic SPs, speuk.dat.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line. Shortened forms of the parameter name, aliases, are shown, separated by commas.

-infile, -infile1, -in

                    The name of the input file.

-begin, -beg

                     First base of interest for each query sequence.

-end

                     Last base of interest for each query sequence.

-outfile, -out, -outfile1

Names the output file.

-check, -che, -help

                      Prints out this usage message.

-default, -d, -def

                      Specifies that sensible default values be used for all parameters where possible.

-documentation, -doc

                     Prints banner at program startup.

-quite, -qui

This parameter is not supported.

-threshold=7.0, -thresh

Sets the minimum score for secretor signal peptide detection.

-data, -dat

           Assigns weight matrix.

-seqout, -rsf

           Annotated sequence output.

-numtopscores=3, -numtop

Allows you to specify the maximum number of predicted SPs to report for each sequence scanned. For example, if you specify -numtopscores=3, SPScan+ will display no more than three of the highest scoring SPs predicted for each sequence. Use -numtopscores=1 if you want to see only the highest-scoring SP in each sequence. By default, SPScan+ will display all SPs whose scores meet or exceed the threshold.

-grampositive, -gramp

Tells SPScan+ to use the Gram-positive prokaryote weight matrix described in Nielsen, H. et al. Protein Engineering 10(1); 1-6 (1997). The default weight matrix is the one for eukaryotes described in the same paper.

-gramnegative, -gramn

Tells SPScan+ to use the Gram-negative prokaryote weight matrix described in Nielsen, H. et al. Protein Engineering 10(1); 1-6 (1997). The default weight matrix is the one for eukaryotes described in the same paper.

-adjustscores, -adj

Tells SPScan+ to reduce each computed score by an amount proportional to the difference between the length of the predicted SP and the empirical "maximum" length of SPs for the appropriate organism type (eukaryote, Gram-positive prokaryote, or Gram-negative prokaryote). These maxima are not absolute limits, but are described in Nielsen, H. et al. (Protein Engineering 10(1); 1-6 (1997)) as being the length beyond which genuine SPs appear only very rarely. Use this option to cause predicted SPs that are probably too long to be real to be printed later in the sorted list of predictions.

-even, -eve

Tells SPScan+ to assume that amino acid residues are distributed evenly throughout the length of the target sequence for the purpose of calculating score probabilities. This makes SPScan+ run a little faster, because it does not have to compute the actual distribution of residues in each input sequence, but reliability of the score probability calculations may be adversely effected.

-probabilities, -prob

           Compute score probabilities.

-verbose, -ver

Tells SPScan+ to print more documentation about each sequence to the output file. The number of lines of documentation printed depends upon the value of the % DocLines global switch described in "Using Global Switches" in Section 3, Using Programs in the User's Guide.

-monitor=10, -mon

Program monitors its progress on your screen by displaying a screen trace of progress. However, when you use -default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

The monitor is updated every time the program processes 10 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.

-summary, -sum

Writes a summary of the program's completion to the screen. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -summary=false.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: May 27, 2005 14:44

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.