BLAST+

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INTERPRETING OUTPUT

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

CHOOSING SEARCH SETS

ALGORITHM

PRELIMINARIES

TURNING HITS INTO HSPs

GENERATING GAPPED EXTENSIONS

CONSIDERATIONS

SUGGESTIONS

FILTERING OUT LOW COMPLEXITY SEQUENCES

AMINO ACID SCORING

NUCLEOTIDE SCORING

ALTERNATIVE GENETIC CODES

NETWORK CONSIDERATIONS

Re-directing BLAST Services

COMMAND-LINE SUMMARY

CITING BLAST+

LOCAL DATA FILES

PARAMETER REFERENCE

NEW FUNCTIONS


FUNCTION

[ Top| Next]

BLAST+ searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST+ can produce gapped alignments for the matches it finds.

DESCRIPTION

[ Previous | Top| Next]

 

Advantages of Plus “+” Programs:

 

P      Plus programs are enhanced to be able to read sequences in a variety of native formats such as GCG RSF, GCG SSF, GCG MSF, GenBank, EMBL, FastA, SwissProt, PIR, and BSML without conversion.

 

P      Plus programs remove sequence length restriction of 350,000bp.

 

If you do not need these features and wish to have more interactivity, you might wish to seek out and run the original program version.

BLAST+, or Basic Local Alignment Search Tool, uses the method of Altschul et al. (J. Mol. Biol. 215: 403-410 (1990)) to search for similarities between a query sequence and all the sequences in a database.

This release of BLAST+ implements version 2 of BLAST+ from the National Center for Biotechnology Information (NCBI) described in Altschul et al. (Nucleic Acids Res. 25(17): 3389-3402 (1997)). BLAST+ is known as "gapped BLAST+" because, in addition to offering a three-fold speedup over the original BLAST+, it generates gapped alignments between query and database sequences.

You can specify any number of query sequences to BLAST+, and they may be in any combination of protein or nucleic acid sequences. You can also specify any number of databases to BLAST+ The databases need not be of the same type. In the current release, if you want to specify multiple databases you must do so on the command line by separating them by comma.

In other words, you cannot specify more than one database from the interactive menu. For example:
 
  % blast+ -INfile2=PIR,SWPLUS

You can also specify multiple queries using any valid multiple sequence specification. For example:
 
  % blast+ -INfile1=hsp70.msf{*}

Accelrys GCG (GCG) BLAST+ program supports five different programs in the BLAST+ family:

 

BLASTP, Protein Query Searching a Protein Database

Each database sequence is compared to each query in a separate protein-protein pair wise comparison.

BLASTX, Nucleotide Query Searching a Protein Database

Each query is translated, and each of the six products is compared to each database sequence in a separate protein-protein pair wise comparison.

BLASTN, Nucleotide Query Searching a Nucleotide Database

Each database sequence is compared to the query in a separate nucleotide-nucleotide pair wise comparison.

TBLASTN, Protein Query Searching a Nucleotide Database

Each nucleotide database sequence is translated, and each of the six products is compared to the queries in a separate protein-protein pair wise comparison.

TBLASTX, Nucleotide Query Searching a Nucleotide Database

The query and database sequences are translated in six frames, and each of the 12 products (for each query sequence) is compared in 36 different pair wise comparisons. Because this program involves more computation than the others, gapped alignments are not available when using TBLASTX.

Normally, BLAST+ decides which BLAST+ program you want to use simply by looking at the type (protein or nucleic acid) of your query sequence and the database you have selected. In the case of nucleotide-nucleotide searches, there are two programs that can do the search. By default, BLASTN is used. To search using TBLASTX instead, use -TBLASTX (but remember that gapped alignments are not available when using TBLASTX).

BLAST+ performs only local searches: It searches databases maintained at your institution. Local searches can consume significant computing resources, and require diligent maintenance of local databases. An alternative to running searches locally is to use NetBLAST+ which sends your query sequences over the internet to a server at NCBI, in Bethesda, MD. Keep in mind, however, that NCBI imposes some limitations on NetBLAST+ searches such as restricting the number of searches that a user is permitted to run in a single day, and prohibiting TBLASTX searches against the NR database. Additionally, NetBLAST+ does not support as many search options as are available with BLAST+.

BLAST+ is a statistically driven search method that finds regions of similarity between your query and database sequences and produces gapped alignments of these regions. Within these aligned regions, the sum of the scoring matrix values of their constituent symbol pairs is higher than some level that you would expect to occur by chance alone.

You are prompted to set an expectation level for the entire search. The expectation of a sequence is the probability of the current search finding a sequence with as good a score by chance alone. Therefore setting the maximum expectation level to 10.0, the default, limits the reported sequences to those with scores high enough to be have been found by chance only ten or fewer times.

EXAMPLE

[ Previous| Top| Next]

Listing the blast databases

> blast+ -dbr

BLAST+ searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST+ can produce gapped alignments for the matches it finds.

 

pir P Protein Information Resource

uniprot P SWISS-PROT + SP-TREMBL

est_human N Human Expressed Sequence Tags (GenBank )

est_mouse N Mouse Expressed Sequence Tags (GenBank )

est_other N All Other Expressed Sequence Tags (GenBank )

genbank N GenBank

htg N High Throughput Genomes (HTG from GenBank )

htc N High Throughput Genomes (HTC from GenBank )

gss N Genome Survey Sequences (GSS from GenBank )

genpept P GenPept (Translated GenBank)

vbabuaa P Satheesh AA Sequences

vbabuna N Satheesh NA Sequences

Note: It is usually good practice to run the –dbr option to list the database, before running the blast+ command or write scripts that include specific databases.

Running BLAST+ with a single public database

After seeing the list of databases as given above, you can run a simple session using BLAST+ to find the sequences in GenBank with similarities to a myoglobin gene:

blast+ with what query sequence(s) ? ggamma.seq

Begin (* 1 *) ? 1050

End (-1 for entire sequence) (* -1 *) ? 1700

Search for query in what sequence database ? Genbank

Ignore hits expected to occur by chance more than n times (* 10 *) ? 1

Limit the number of sequences in my output to (* 500 *) ? 5

What should I call the output file (* <sequence_name>.blast+ *) ?

 

 

Results written to ggamma.blast+

 

Running BLAST+ with multiple public databases

If you want to specify multiple databases you must do so on the command line.

For example:

% blast+ -INfile2=PIR, SWPLUS

 

Running BLAST with a single/multiple personal blastable database

 

1) Create personal blast database using gcgtoblast (in Wisconsin Package 10.3) or FormatDB+ (in GCG 11.0)

 

For example, assume the databases created are:

 

$HomeDir/blastdb/smdb and $HomeDir//blastdb/testnadb created using FormatDB+ in GCG 11.0 environment.

 

$HomeDir/frmtdb/ggammabst created using gcgtoblast in Wisconsin Package 10.3 environment.

 

2) Create a blast.sdbs in local folder

Create new configuration file blast.sdbs having entries for personal databases.

Sample blast.sdbs file $HomeDir/blast.sdbs looks like:

 

Database Type Description ..

 

$HomeDir/blastdb/nadb n TEST NA

$HOME/ blastdb/aadb p TEST AA

 

3) Listing the personal databases

Use –dbr command line option in BLAST and BLAST+ to list the databases present in $HOME/blast.sdbs file.

 

blast $HOME/ggamma.seq -data=$HOME/blast.sdbs -dbr

blast+ $HOME/ggamma.seq -data=$HOME/blast.sdbs -dbr

 

4) Running BLAST with personal databases

 

We can run blast searches against personal databases present in blast.sdbs, using –data command line option.

 

blast $HOME/ggamma.seq –in2=nadb

-data=$HOME/blast.sdbs

 

blast+ $HOME/ggamma.seq –in2=aadb

-data=$HOME/blast.sdbs

 

OUTPUT

[ Previous| Top| Next]

Below is part of the output from the search in the example session:

The output has four parts: 1) an introduction that tells where the search occurred and what database and query were compared; 2) a list of the sequences in the database containing HSPs (high-scoring segment pairs) whose scores were least likely to have occurred by chance (the entries in this list have begin and end ranges on them if -fragments is specified); 3) a display of the alignments of the HSPs showing identical and similar residues; and 4) a complete list of the parameter settings used for the search.

By default, BLAST+ looks for alignments that contain gaps. If you only look for alignments that do not contain gaps, there will often be more than one segment pair associated with each database sequence.

 
///////////////////////////////////////////////////////////////////////////////
 

!!SEQUENCE_LIST 1.0
BLASTN 2.2.9 [May-01-2004]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST+ and PSI-BLAST+: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Query= ggamma.seq (651 letters)

Database: BlastDir/Genbank
1756097 sequences; 1,999,994,996 total letters

Sequences producing significant alignments: Score(bits) E value

..
GB_PR:HUMHBB Begin: 20876 End: 41068
!U01317 Human beta globin region on chromosome 11. 3/2001 1057 0.0
GB_PAT:AX334794 Begin: 20876 End: 41068
!AX334794 Sequence 5303 from Patent WO0194629. 1/2002 1057 0.0
GB_PR:HUMGAMGLOA Begin: 3151 End: 8710
!M91036 Homo sapiens G-gamma globin (G-gamma globin) and A-gamma ... 1047 0.0
GB_PR:AC104389 Begin: 50531 End: 70896
!AC104389 Homo sapiens chromosome 11, clone CTD-2643I7, complete ... 1031 0.0
GB_PR:HUMGAMGLOB Begin: 3146 End: 8730
!M91037 Homo sapiens G-gamma globin and A-gamma globin genes, com... 1023 0.0
\\End of List

>GB_PR:HUMHBB U01317 Human beta globin region on chromosome 11. 3/2001
Length = 73308

Score = 1057 bits (533), Expect = 0.0
Identities = 557/557 (100%)
Strand = Plus / Plus


Query: 95 ttcttttaacgttttcagcctacagcatacagggttcatggtggcaagaagataacaaga 154
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 35596 ttcttttaacgttttcagcctacagcatacagggttcatggtggcaagaagataacaaga 35655


Query: 155 tttaaattatggccagtgactagtgctgcaagaagaacaactacctgcatttaatgggaa 214
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 35656 tttaaattatggccagtgactagtgctgcaagaagaacaactacctgcatttaatgggaa 35715


Query: 215 agcaaaatctcaggctttgagggaagttaacataggcttgattctgggtggaagcttggt 274
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 35716 agcaaaatctcaggctttgagggaagttaacataggcttgattctgggtggaagcttggt 35775


Query: 275 gtgtagttatctggaggccaggctggagctctcagctcactatgggttcatctttattgt 334
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 35776 gtgtagttatctggaggccaggctggagctctcagctcactatgggttcatctttattgt 35835



Query: 382 ggcaatccatttcggcaaagaattc 406
|||||||||||||||||||||||||
Sbjct: 25 ggcaatccatttcggcaaagaattc 1


>GB_PAT:AR028117 AR028117 Sequence 7 from patent US 5858649. 9/1999
Length = 25

Score = 50.1 bits (25), Expect = 0.0051
Identities = 25/25 (100%)
Strand = Plus / Minus


Query: 407 acccctgaggtgcaggcttcctggc 431
|||||||||||||||||||||||||
Sbjct: 25 acccctgaggtgcaggcttcctggc 1


Database: BlastDir/Genbank
Posted date: Fri Nov 26 14:40:53 2004
Number of letters in database: 1,999,994,996
Number of sequences in database: 1,756,970

Gapped
Lambda K H
1.37 0.711 1.31

Matrix: blastn matrix:BLOSUM62
Gap Penalties: Existence: 5, Extension: 2
Number of Sequences: 1756097
length of query: 651
length of database: 1,999,994,996
effective HSP length: 22

INTERPRETING OUTPUT

[ Previous| Top| Next]

 

Bit Score

Each aligned segment pair has a normalized score expressed in bits that lets you estimate the magnitude of the search space you would have to look through before you would expect to find an HSP score as good as or better than this one by chance. If the bit score is 30, you would have to score, on average, about 1 billion independent segment pairs (2(30)) to find a score this good by chance. Each additional bit doubles the size of the search space. This bit score represents a probability; one over two raised to this power is the probability of finding such a segment by chance. Bit scores represent a probability level for sequence comparisons that is independent of the size of the search.

The size of the search space is proportional to the product of the query sequence length times the sum of the lengths of the sequences in the database. This product, referred to as N in Altschul's publications, is multiplied by a coefficient K to get the size of the search space. When searching protein databases with protein queries, K is about 0.13. BLAST+ uses estimates of K produced before it runs by random simulation (Altschul & Gish, Methods in Enzymology 266; 460-480 (1996)).

E Value

There is a probability associated with each pair wise comparison in the list and with each segment pair alignment. The number shown in the list is the probability that you would observe a score or group of scores as high as the observed score purely by chance when you do a search against a database of this size.

An ideal search would find hits that go from extremely unlikely to ones whose best scores should have occurred by chance alone (that is, with probabilities approaching 1.0).

N

If you specify ungapped alignments to BLAST+, a third column of data will appear in your output under the heading N. The number in that column indicates how many HSPs were involved in computing the statistics for the sequence. If the number is greater than 1, the scores of multiple HSPs were combined to produce the result. See the ALGORITHM topic for more information.

BLAST+ Parameters

At the end of the output is a listing of parameter settings along with some trace information about the search. Some of these parameters are described in this document, but to get more complete documentation on these parameters, look at the BLAST+ release notes on the World Wide Web at http://www.ncbi.nlm.nih.gov/blast/docs/

INPUT FILES

[ Previous| Top| Next]

BLAST+ accepts any number of protein or nucleic acid sequences as input. The search set is a specially formatted database. See the FormatDB+ entry in the Program Manual for information on how to create a local database that BLAST+ can search from a set of sequences in GCG format.

The function of BLAST+ depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous| Top| Next]

PSIBLAST iteratively searches one or more protein databases for sequences similar to one or more protein query sequences. PSIBLASTis similar to BLAST+ except that it uses position-specific scoring matrices derived during the search.

NetBLAST+ searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST + can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

FormatDB+ combines any set of GCG sequences into a database that you can search with BLAST.

FastA+ does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA+ may be more sensitive than BLAST+.

TFastA+ does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA+ translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX+ does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX + translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

TFastX+ does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA+, and like TFastA+, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

SSearch+ does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST+ and FastA+, it can be very slow.

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.

ProfileSearch and MotifSearch use a profile (derived from a set of aligned sequences) instead of a query sequence to search a collection of sequences.

HmmerSearch uses a profile hidden Markov model as a query to search a sequence database to find sequences similar to the family from which the profile HMM was built. Profile HMMs can be created using HmmerBuild.

FindPatterns+ uses a pattern described by a regular expression to search a collection of sequences. Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.

NetBLAST+ searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST + can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

FormatDB+ combines any set of GCG sequences into a database that you can search with BLAST.

BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST+.

TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

TFastX does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA, and like TFastA, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

NetBLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

FindPatterns uses a pattern described by a regular expression to search a collection of sequences. Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns.

 

 

RESTRICTIONS

[ Previous| Top| Next]

Because of the way BLAST+ must estimate certain statistical parameters (see the ALGORITHM topic elsewhere in this document), the number of scoring matrices available for use with BLAST+ is limited. Currently, valid choices for the -matrix parameter are BLOSUM62 (the default), BLOSUM45, BLOSUM80, PAM30, and PAM70.

Gap creation and gap extension penalties are supported in limited combinations depending upon which scoring matrix is in use. The following table shows the allowed combinations for amino acids. The first values listed are the defaults for each scoring matrix.

 Scoring Matrix    Gap Opening Penalty    Gap Extension Penalty 
 
 
 
 
 



 
 
 
 
   BLOSUM62                 11                       1 
 
 
                             7                       2
 
 
                             8                       2
 
 
                             9                       2
 
 
                            10                       1
 
 
                            12                       1
 
 
 
 
 


 
 
 
 
   BLOSUM80                 10                       1 
 
 
                             6                       2
 
 
                             7                       2
 
 
                             8                       2
 
 
                             9                       1
 
 
                            11                       1
 
 
 
 


 
 
 
 
   BLOSUM45                 14                       2      
 
 
                            10                       3
 
 
                            11                       3
 
 
                            12                       3
 
 
                            13                       3
 
 
                            12                       2
 
 
                            13                       2
 
 
                            15                       2
 
 
                            16                       1
 
 
                            17                       1
 
 
                            18                       1
 
 
                            19                       1
 
 
 
 


 
 
 
 
    PAM30                    9                       1 
 
 
                             5                       2
 
 
                             6                       2
 
 
                             7                       2
 
 
                             8                       1
 
 
                            10                       1
 
 
 
 


 
 
 
 
    PAM70                   10                       1 
 
 
                             6                       2
 
 
                             7                       2
 
 
                             8                       2
 
 
                             9                       1
 
 
                            11                       1

Gapped alignments are not an option when running TBLASTX.

You may choose multiple query sequences, any of which may be either nucleic acid or protein. You may also choose multiple databases against which to search, however each of these must be of the same type.

If you used FormatDB+ to create your BLAST+ databases from any source other than a GCG-formatted database (such as from arbitrary sequence files, an MSF or RSF file, etc.), then BLAST+'s list file output won't be a functional list file. If you want to take full advantage of BLAST+'s list file output, make sure that you generate your BLAST+ databases from a GCG-formatted database. You can use DataSet+ to generate such databases from any set of sequences in GCG format.

CHOOSING SEARCH SETS

[ Previous| Top| Next]

BLAST+ can search only a specially compressed form of the data. Therefore, you can search only those databases that are available in this form, and you must search them in their entirety. If you want to restrict the search to a specific set of sequences, use the program FormatDB+ to create a specially compressed database consisting of just those sequences.

To name a searchable database interactively, choose the number of the database of interest from the menu. Use a parameter like -infile2=GenBank to choose the name of the database you want to search.

If a nucleic acid and a protein database share the same name, BLAST+ cannot be sure which one of them you mean when you specify one of them using the -infile2 parameter. If the database you want to search cannot be named unambiguously with the -infile2 parameter, add either -dbnucleotideonly or -dbproteinonly to the command line.

ALGORITHM

[ Previous| Top| Next]

BLAST+ is a client for an implementation of gapped BLAST+ (Altschul et al., Nucleic Acids Research 25; 3389-3402 (1997)), an heuristic algorithm for searching protein and nucleic acid databases for similarities to query sequences.

The above example demonstrates BLASTP, which searches for similarities between protein queries and protein databases, as a prototype for BLAST+. However, the ideas are immediately applicable to comparisons involving conceptual translations of query sequences and databases, and extend to similarity searches between nucleic acid sequences as well.

BLAST+ compares a query sequence with a database sequence by first locating two non-overlapping sequence segments in common within a certain distance of each other, and then attempts to extend these putative "hits" into locally optimal alignments between the sequences being compared. A more detailed description is provided below.

PRELIMINARIES

[ Previous| Top| Next]

BLAST+ uses a substitution matrix (such as the BLOSUM or PAM matrices) to assign a score to the alignment of any pair of amino acids. An aggregate score for an alignment segment can be computed by summing the scores of each amino acid pair in that segment. When given two sequences to compare, the original (ungapped) BLAST+ algorithm searches for arbitrary but equal length segments within each sequence that have a maximal aggregate score which meets or exceeds some threshold or cutoff score. BLAST+ looks for locally optimal alignments between the two sequences whose scores cannot be improved either by extending or trimming. Such locally optimal alignments are called "high-scoring segment pairs," or HSPs.

If you assume a simple protein model in which amino acids occur randomly at all positions and in proportion to the frequencies at which they are found within the database and query sequences, then we can compute a normalized score (expressed in units called bits) from the nominal score of an HSP. Such normalized scores allow direct statistical comparison of results regardless of the scoring system used (see "Generating Gapped Extensions" for a caveat to this). Furthermore, the normalized score can be used to compute an expect value, or E-value, which is the number of distinct HSPs having at least that normalized score expected to occur by chance. This theory has not been proved for gapped local alignments and their associated scores, but there are indications that it remains valid (Altschul et al., 1997).

TURNING HITS INTO HSPs

[ Previous| Top| Next]

The central idea of the BLAST+ algorithm is that any statistically significant alignment between two sequences is likely to contain a high-scoring pair of aligned words. A word is simply a sequence segment of specified length (usually 3 for protein sequences). BLAST+ begins its comparison of a query sequence to a database by scanning the database for words that score at least the threshold score T when aligned with some word within the query sequence. Any word pair satisfying this condition is called a hit. The diagonal of a hit involving words starting at positions (x, y) of the database and query sequences is defined as x-y. The distance between two hits on the same diagonal is defined as the difference between their first coordinates.

Once a hit is found, BLAST+ determines whether the hit lies within an alignment having an aggregate score high enough to be reported. It does this by extending the hit in both directions until the running alignment's score has dropped more than some quantity X below the maximum score yet attained. This extension step is quite costly, taking upwards of 90% of BLAST+'s execution time under most circumstances.

In order to reduce the number of extensions it has to perform, BLAST+ takes advantage of the fact that an interesting HSP is typically much longer than a single hit. In fact, it is likely to contain multiple hits on the same diagonal within a relatively short distance of one another. Therefore, BLAST+ chooses a length A and invokes an ungapped extension if and only if two non-overlapping hits are found on the same diagonal within distance A of one another. (Any hit that overlaps the most recent one is ignored.)

GENERATING GAPPED EXTENSIONS

[ Previous| Top| Next]

Gapped extensions allow BLAST+ to maintain its sensitivity while tolerating a much higher chance of missing any single moderately scoring HSP. However, gapped extensions take about 500 times longer to execute than ungapped extensions. Therefore, BLAST+ triggers a gapped extension for an HSP only when its score exceeds a moderate score (Sg) specifically chosen so that no more than about one gapped extension is invoked per 50 database sequences.

To generate the gapped local alignment, BLAST+ uses a standard dynamic programming algorithm for pair wise sequence alignment which traverses the cells of a path graph, the dimensions of which are the lengths of the two sequences being compared, performing a fixed amount of computation per each cell. Starting from a single aligned pair of residues, called the seed, the dynamic programming proceeds both forward and backward through the path graph considering only those cells for which the optimal local alignment score falls no more than X below the best alignment score yet found. (This description is a generalization of BLAST+'s method for constructing HSPs.) The region of the path graph explored adapts to the alignment being produced.

The seed for the dynamic programming is the central residue pair of the length-11 segment of the HSP having the highest alignment score. If the HSP itself is shorter than 11 residues in length, its central pair of residues is chosen.

The resulting gapped alignment is reported only if it has an E-value low enough to be of interest. For any alignment actually reported, BLAST+ performs a gapped extension that records "traceback" information (Sankoff and Kruskal, 1983) using a substantially larger X parameter than that employed during the search stage to increase the accuracy of the alignment.

Because BLAST+ produces gapped alignments only for those few database sequences likely to be related to the query, it cannot estimate the parameters necessary to compute normalized scores on the fly. Instead, BLAST+ must rely on estimates of these parameters generated beforehand by random simulation. For this reason, BLAST+ cannot use a scoring system for which no simulation has been performed and still produce accurate estimates of statistical significance.

CONSIDERATIONS

[ Previous| Top| Next]

 

Bit Scores and the Size of the Search

Altschul has shown that for sequences that have diverged by a certain amount, there is an informativeness (or ability to discriminate between chance scores and significant scores) associated with each residue pair in the segment pair. This informativeness is the amount of information obtainable from each residue pair in a real alignment that can be used to distinguish the real alignment from a random one. This informativeness can be expressed in bits. The sum of the information available from each residue pair in a segment is the segment pair's score in bits. Such scores are intuitively understandable as the significance of a segment pair score. To express such scores as a fraction you would divide 1 by 2 to the number of bits in the score. For example, if a segment pair has a bit-score of 16, then the appropriate fraction (1/2(16)=1/65,536) would suggest that you should see a score this high by chance about once for every 65,000 independent segment pairs you examine.

For nucleotide sequences that have not diverged, there should be an informativeness of about 2 bits per nucleotide pair. For protein sequences that have not diverged, the informativeness should be slightly over 4 bits per amino acid pair. (The informativeness per pair goes down as the sequences diverge and a segment pair score is maximally informative only when a scoring matrix appropriate to the extent of divergence between the sequences is used to calculate the score.)

The bit scores are absolute, but the expectation of finding any particular score depends on the size of the search space. The number of places where a segment pair might originate is proportional to the product of the length of the query times the sum of the lengths of all the sequences searched. This product is multiplied by a coefficient K to get the size of the search space. When searching protein databases with protein queries, K is approximately 0.13.

For a query sequence of length 300 a searching a database of 12 million residues, the size of the search space would be 300 x 12,000,000 x 0.13 or 468,000,000. For a search this size, a score that only occurs once in every 65,000 potential segment pairs (that is, with a bit score of 16) would be expected to occur about 7,200 times by chance alone.

If the database being searched is highly redundant (as it might be if it contained several hundred homologous cytochromes), then size of the search space calculated by these methods will overestimate the size of the real search space.

Using BLAST+ for Nucleotide Searches

The detection of distant relationships between proteins is easier than between nucleotide sequences, even if the nucleotide sequences have to be translated in all six frames to make the amino acid comparison. To give a rough magnitude to this generalization, it is possible to detect similarities in proteins that have diverged by 250 substitutions per 100 residues (250 PAM units) while nucleotide similarities become obscure at distances much greater than 50 substitutions per 100 nucleotides (50 DNA PAM units). Nonetheless, when the nucleotide sequences being compared do not code for proteins, you have no alternative but to search at the nucleotide level. We suggest you consider either reducing the word size for BLAST+ from its default of 11 to perhaps six or seven, or using the FastA+ program when looking for nucleotide homologs.

Increasing Program Speed Using Multithreading

This program is multithreaded. It has the potential to run faster on a machine equipped with multiple processors because different parts of the analysis can be run in parallel on different processors. By default, the program assumes you have one processor, so the analysis is performed using one thread. You can use -processors to increase the number of threads up to the number of physical processors on the computer.

Under ideal conditions, the increase in speed is roughly linear with the number of processors used. But conditions are rarely ideal. If your computer is heavily used, competition for the processors can reduce the program's performance. In such an environment, try to run multithreaded programs during times when the load on the system is light.

As the number of threads increases, the amount of memory required increases substantially. You may need to ask your system administrator to increase the memory quota for your account if you want to use more than two threads.

Never use -processors to set the number of threads higher than the number of physical processors that the machine has -- it does not increase program performance, but instead uses up a lot of memory needlessly and makes it harder for other users on the system to get processor time. Ask your system administrator how many processors your computer has if you aren't sure.

When Blastall Produces No Output

You may see an error indicating that blastall produced no output (blastall is the name of the BLAST+ executable provided by NCBI). One of the possible causes of this condition is the presence of a file in your home directory called ".ncbirc" which contains an invalid path to the NCBI data directory. The NCBI data directory should contain seqcode.val, gc.code, BLOSUM62, and perhaps some other data files. If your home directory does indeed contain such a file, we recommend that you either rename it (the safest option), edit it to update the path to the NCBI data directory (this takes some effort, but that path is contained in the logical name "NCBI"), or delete it (the simplest option). Your system administrator should be able to help you do this if you have trouble, or you may contact support at support-us@accelrys.com.

Using PSI-TBLASTN

When searching a nucleotide database with a protein query (i.e. when using TBLASTN) you may optionally use a position-specific matrix (PSSM) instead of a standard scoring matrix. This kind of search is called PSI-TBLASTN and it is enabled when you use -restorecheckpoint to specify a checkpoint file that was created in advance using the program PSIBLAST.

A checkpoint file contains both the PSSM and the query that was used when running PSIBLAST. For this reason, when performing a PSI-TBLASTN search, you must use the exact same query sequence that was used when the checkpoint file was saved. In addition, checkpoint files are platform-specific binary files which means that checkpoint files created with PSIBLAST one operating system will not work correctly when running BLAST+ on a different type of system.

BLAST+ filters query sequences by default, in contrast to PSIBLAST which does not. For the sake of compatibility, when you plan to use a PSSM from PSIBLAST to perform a PSI-TBLASTN search you should specify -nofilter unless you specified -filter when you ran PSIBLAST.

SUGGESTIONS

[ Previous| Top| Next]

 

List Size Limit

A list size that is too small to display all the significant hits is a common problem. To see the unlisted hits you must run the search again with the list size limit set high enough to include everything significant.

Segment Pair Alignment Limit

BLAST+ displays alignments of segment pairs from the top 250 sequences in the list. You can adjust this limit with -alignments. BLAST+ will not show alignments for sequences not present in the list.

Sensitivity

For nucleotide sequence comparisons, the word size defaults to 11 -- no segment pair can be scored unless it contains a perfect match of at least 11 consecutive bases. If sensitivity is much more important than selectivity, and your search cannot be done at the amino acid level, you might want to reduce the word size to seven or even six. NCBI has stated that there is only a marginal increase in sensitivity for settings smaller than this.

BLAST+ uses a word size of three for proteins (11 for blastn searches), which is appropriate for a wide range of searches, but you can adjust the synonym threshold T downwards to increase sensitivity at the price of speed. Read the PARAMETER REFERENCE topic for more information on -hitextthreshold and -expect.

Batch Queue

Using BLAST+ to search a large local database can take a long time. You may want to run searches in the batch queue. You can specify that this program run at a later time in the batch queue by using -batch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide.

Relationship to FastA+

For protein database searches, BLAST+ and FastA+ have similar sensitivity, although the different algorithms employed make it possible, at least in principle, for FastA+ to find things that BLAST+ misses and vice versa. For nucleotide database searches with nucleotide query sequences, FastA+ may be more sensitive, since by default BLAST+ ignores segment pairs that do not contain a perfect match of at least 11 adjacent nucleotides (22 bits). This default misses many obviously significant relationships. If you are looking for nucleotide sequence homologs that do not code for proteins (that is, if your search cannot be done at the amino acid level), we suggest you either reduce the word size to seven or use the FastA+ program instead of BLAST+.

FILTERING OUT LOW COMPLEXITY SEQUENCES

[ Previous| Top| Next]

BLAST+ filters out regions of low complexity from query sequences by default. You can turn filtering off by using the -nofilter parameter. Searches against a nucleotide database with nucleotide queries (blastn) employ the DUST filter program (Hancock and Armstrong, Comput. Appl. Biosci. 10: 67-70 (1994); Tatusov and Lipman, unpublished). All other searches employ the SEG filter program (Wootton and Federhen, Computers in Chemistry 17: 149-163 (1993); Wootton and Federhen, Methods in Enzymology 266: 554-571 (1996)). For a general discussion of the role of filtering in search strategies, see Altschul et al., Nature Genetics 6: 119-129 (1994).

Short repeats and low complexity sequences, such as glutamine-rich regions, confound most database searching methods. For BLAST+, the random model against which the significance of segment pair scores is evaluated assumes that at each position, each residue has a probability of occurring which is proportional to its composition in the database as a whole. Low complexity or highly repetitive sequences are inconsistent with this assumption.

Low complexity sequence found by the filter program is substituted using the letter N in nucleotide sequence and the letter X in amino acid sequence. Here is an example of a sequence aligned to a filtered copy of itself to show which parts are filtered out:

  1 MAAKIFCLIMXXXXXXXXXXXXIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60
  1 MAAKIFCLIMLLGLSASAATASIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60
 
 61 AIAAGIXXXXXXXXXXXXXXXXXXXXXXXXXXNIRXXXXXXXXXXXXXXYSQQQQFLPFN 120
 61 AIAAGILPLSPLFLQQSSALLQQLPLVHLLAQNIRAQQLQQLVLANLAAYSQQQQFLPFN 120
 
121 QXXXXXXXXXXXXXXXXPFSQLAAAYPRQFLPFNQLAALNSHAYVXXXXXXPFSQLAAVS 180
121 QLAALNSAAYLQQQQLLPFSQLAAAYPRQFLPFNQLAALNSHAYVQQQQLLPFSQLAAVS 180
 
181 PAAFLTQQQLLPFYLHTAPNVGTXXXXXXXXXXXXXXXTNPAAFYQQPIIGGALF 235
181 PAAFLTQQQLLPFYLHTAPNVGTLLQLQQLLPFDQLALTNPAAFYQQPIIGGALF 235

AMINO ACID SCORING

[ Previous| Top| Next]

BLAST+ normally uses the BLOSUM62 scoring matrix from Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)) whenever the sequences being compared are proteins (including cases where nucleotide databases or query sequences are translated into protein sequences before comparison). You can use other BLOSUM45, BLOSUM80, or the more traditional PAM70 and PAM30 scoring matrices with -matrix, for example -matrix=PAM40. Each matrix is most sensitive for finding homologs at the corresponding PAM distance. The seminal paper on this subject is Stephen Altschul's "Amino acid substitution matrices from an information theoretic perspective" (J. Mol. Biol. 219; 555-565 (1991)). If you are new to this literature, an easier place to start reading might be Altschul et al., "Issues in searching molecular sequence databases" (Nature Genetics, 6; 119-129 (1994)).

NUCLEOTIDE SCORING

[ Previous| Top| Next]

There is no external scoring matrix for nucleotide-nucleotide searches (that is, searches where both the query and the database are nucleotide sequences and where you have not used -tblastx. But as is explained below you can specify a nucleotide-nucleotide scoring matrix for any PAM distance by changing the match/mismatch ratio. The default ratio is +1/-3. You can change the ratio by specifying a new value for the numerator using -match.

ALTERNATIVE GENETIC CODES

[ Previous| Top| Next]

BLAST+ normally uses the standard genetic code if either the query or the database sequences requires translation. If your query comes from a system where this genetic code is inappropriate, you can select any of these alternative codes by the numbers given in the following table:

 
     1 Standard or Universal
     2 Vertebrate Mitochondrial
     3 Yeast Mitochondrial
     4 Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
     5 Invertebrate Mitochondrial
     6 Ciliate Macronuclear
     7 [Do not use this index]
     8 [Do not use this index]
     9 Echinodermate Mitochondrial
    10 Alternative Ciliate(Euplotid) Macronuclear
    11 Eubacterial
    12 Alternative Yeast
    13 Ascidian Mitochondrial
    14 Flatworm Mitochondrial
    15 Alternate Ciliate (Blepharisma) Nuclear
    16 Chlorophycean Mitochondrial
    21 Trematode Mitochondrial

 

You can specify the genetic code for the query and the database independently. Use -translate=2 to tell BLAST+ to use the vertebrate mitochondrial code to translate the query. Use -dbtranslate=3 to tell BLAST+ to use the yeast mitochondrial code to translate the database. (Note that most of the genes in GenBank will be translated inappropriately if you select a nonstandard genetic code for database translation.)

NETWORK CONSIDERATIONS

[ Previous| Top| Next]

BLAST+ searches only local databases. See the NetBlast+ entry in the Program Manual for information on how to run BLAST+ searches remotely. However if you need to run BLAST+ on a remote server in your intranet network, follow the procedure given below

Re-directing BLAST services

Introduction:

 

BLAST+ has an option called –config which lets you specify your own configuration file. This feature can be used to redirect blast searches and to have private blastable databases.

 

Redirect-able BLAST

 

Let us assume that a gcgadmin need to setup BLAST+ to redirect the job to a remote server.

 

The Steps are as follows:

Create a new configuration file called remoteblast.conf. You can copy $GCGROOT/etc/blast/blastall.conf to this file to create one quickly.

Edit the file so that:

command = (ssh/rsh) –l <login_name for remote machine> <remote_machine_name> <path for blast exe on the remote machine>

blastdir = <path for blast databases on remote machine>

 

#for accessing list of databases present on remote machine we need to set blastconfig parameter.

blastconfig= <path for localdbs.conf >

 

# (this file will list the databases available on remote machine. Name of this file has to be localdbs.conf)

 

Sample file: remoteblastall.conf

 

plugin = libBlastAll.so # gcgblast plugin (required)

command = ssh -l gcg11

crunch.bang.accelrys.com

/gcg11/blast-2.2.10-sparc64-solaris/bin/blastall

blastdir = /u/gcg11/gcgdata/gcgblast

blastconfig = /u/gcg11/myblast #(dir to look for localdbs.conf)

 

After this, we manually list the databases installed in remote machine at /u/gcg11/myblast/localdbs.conf

Once this is setup, we can redirect BLAST+ to remote machine as shown below:

 

blast+ ../ggamma.seq -config=/u/gcg11/remoteblast.conf -in2 htg

 

“Or”

we can list the databases setup in the remote machine (Note that this is not dynamically found out, but taken from /u/gcg11/myblast/localdbs.conf.)

 

blast+ ../ggamma.seq -config=/u/gcg11/remoteblast.conf -dbr

 

If GCG administrator decides to make blast to be redirected to another server always, such changes can be made in $GCGROOT/etc/blast/blastall.conf.

 

Note: To make sure that ssh/rsh does not wait at the password prompt, please refer ssh/rsh documentation on how to setup password-less authentication.

 

COMMAND-LINE SUMMARY

[ Previous| Top| Next]

All parameters for this program may be added to the command line. Use -check to view the summary below and to specify parameters before the program executes. In the syntax summary below, square brackets ([ and ]) enclose parameter values that are optional. For each program parameter, square brackets enclose the type of parameter value specified, the default parameter value, and shortened forms of the parameter name, aliases. Programs with a plus in the name use either the full parameter name or a specified alias. If “Type” is “Boolean”, then the presence of the parameter on the command line indicates a true condition. A false condition needs to be stated as, parameter=false.

 

Minimal Syntax: % blast+ [-infile=]value [-infile2=]value [-outfile=]value -Default

 

 

Minimal Parameters (case-insensitive):

 

-infile [Type: List / Default: EMPTY / Aliases: infile1 in]

Input file specification

 

Prompted Parameters (case-insensitive):

 

-begin [Type: Integer / Default: '1' / Aliases: beg]

First base of interest in each query sequence.

-end [Type: Integer / Default: '-1']

Last base of interest in each query sequence.

-infile2 [Type: List / Default: EMPTY / Aliases: in2 db]

Specifies database to search.

-expect [Type: Double / Default: '10' / Aliases: exp]

Ignores scores that would occur by chance more than n times.

-listsize [Type: Integer / Default: '500' / Aliases: lis list]

Sets maximum number of sequences listed in the output.

-outfile [Type: OutFile / Default: '<sequence_name>.blast+' / Aliases: out outfile1]

Names the output file. '-' for stdout.

 

Optional Parameters (case-insensitive):

 

-check [Type: Boolean / Default: 'false' / Aliases: che help]

Prints out this usage message.

-default [Type: Boolean / Default: 'false' / Aliases: d def]

Specifies that sensible default values be used for all parameters where possible.

-documentation [Type: Boolean / Default: 'true' / Aliases: doc]

Prints banner at program startup.

-quiet [Type: Boolean / Default: 'false' / Aliases: qui]

Tells application to print only a minimal amount of information.

-doclines [Type: Integer / Default: EMPTY / Aliases: docl]

Specifies number of documentation lines to copy.

-config [Type: String / Default:

'$GCGROOT/etc/blast/blastall.conf']

Blast configuration file for the plugin.

-format [Type: String / Default: EMPTY / Aliases: fmt]

Output format. Valid values are:

list: Sequence list file of hits

native: Native BLAST report

xml: BLAST XML

-xml [Type: Boolean / Default: 'false']

Output BLAST XML format (same as -format=xml)

-native [Type: Boolean / Default: 'false']

Output native BLAST report format (same as -format=native)

-data [Type: String / Default: EMPTY / Aliases: dat]

List for local blast databases.

-plugin [Type: String / Default: 'libBlastAll.so']

Blast plugin.

-algorithm [Type: String / Default: EMPTY / Aliases: alg program prog]

Set Blast algorithm to blastn, blastp, blastx, tblastn or tblastx.

-tblastx [Type: Boolean / Default: 'false']

If query and database are both nucleotide, translates both and does protein comparisons

-dbreport [Type: Boolean / Default: 'false' / Aliases: dbr dbs listdb listdbs dblist]

Lists valid databases then exits.

-dbnucleotideonly

[Type: Boolean / Default: 'false' / Aliases: dbn]

Searches only nucleic databases.

-dbproteinonly [Type: Boolean / Default: 'false' / Aliases: dbp]

Searches only protein databases.

-append [Type: List / Default: EMPTY]

Appends string to pass-through command line.

-alignments [Type: Integer / Default: '250' / Aliases: ali align]

Sets number of sequences for which to show alignments.

-chunksize [Type: Integer / Default: '50' / Aliases: chunk]

Sets number of sequences to submit in parallel, large values may run out of memory.

-wordsize [Type: Integer / Default: '0' / Aliases: word]

Sets word size (0 for default).

-match [Type: Integer / Default: '1' / Aliases: mat]

Sets nucleotide match reward.

-mismatch [Type: Integer / Default: '-3' / Aliases: mis]

Sets nucleotide mismatch penalty.

-matrix [Type: String / Default: 'BLOSUM62' / Aliases: matr]

Assigns the scoring matrix for proteins.

-gapweight [Type: Integer / Default: '0' / Aliases: gap]

Sets gap creation penalty (0 for default)

-lengthweight [Type: Integer / Default: '0' / Aliases: len]

Sets gap extension penaly (0 for default)

-hitextthreshold

[Type: Integer / Default: '0' / Aliases: hitextthresh hitext]

Sets mimimum score to extend hits (0 for default)

-filter [Type: Boolean / Default: 'true' / Aliases: fil]

Enables filtering of low complexity segments out of query sequences.

-translate [Type: Integer / Default: '1' / Aliases: trans]

Names genetic code for translating query.

-dbtranslate [Type: Integer / Default: '1' / Aliases: dbtrans]

Names genetic code for translating database.

-effdbsize [Type: Integer / Default: '0' / Aliases: eff]

Sets effective database size (0 for real size)

-gaps [Type: Boolean / Default: 'true']

Enables gapped alignments.

-xdropoff [Type: Integer / Default: '0' / Aliases: xdr]

Sets X dropoff value for gapped alignments (0 for default)

-lowercasemask [Type: Boolean / Default: 'false' / Aliases: low lower]

Filters lower case characters in query sequence.

-hitwindow [Type: Integer / Default: '40' / Aliases: hitw]

Sets multiple hits window size (0 for single hit algorithm)

-besthits [Type: Integer / Default: '0' / Aliases: bes best]

Sets number of best hits to keep from a region (off by default, if used a value of 100 is recommended)

-megablast [Type: Boolean / Default: 'false' / Aliases: mega]

Uses MegaBLAST algorithm for search.

-processors [Type: Integer / Default: '1' / Aliases: proc]

Sets the number of processors to use.

-batch [Type: Boolean / Default: 'false']

Allows to submit a job to a batch queue.

CITING BLAST+

[ Previous| Top| Next]

The original paper describing BLAST+ is Altschul, Stephen F., Gish, Warren, Miller, Webb, Myers, Eugene W., and Lipman, David J. (1990). Basic local alignment search tool. J. Mol. Biol. 215; 403-410. Gapped BLAST+ is described in Altschul, Stephen F., Madden, Thomas L., Schaffer, Alejandro A., Zhang, Jinghui, Zhang, Zheng, Miller, Webb, and Lipman, David J. (1997). Gapped BLAST+ and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17); 3389-3402.

LOCAL DATA FILES

[ Previous| Top| Next]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -data=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Default behaviour: BLAST+ reads the file locadbs.conf in $GCGROOT/etc/blast folder for getting the list of local databases available

If you have sequences of local interest that you would like to search with BLAST+, read the documentation for FormatDB+ to see how to create local BLAST+ searchable databases, then fetch the file blast.sdbs, and add the name of the local search set so that it appears in the menu.(using –dbr option)

Custom behaviour: Blast+ also accepts local data file - blast.sdbs (site-specific databases). This data file can be specified on commandline using the –data option. This is similar to –data2 option of Wisconsin Package 10.3 where users can give site-specific databases (private) databases which are not part of the global GCG environment and is local only to user’s system. For more information on “Configuring Personal Databases to work with BLAST+ “ refer to GCG Software Configuration document.

PARAMETER REFERENCE

[ Previous| Top| Next]

You can set the parameters listed below from the command line. Shortened forms of the parameter name, aliases, are shown, separated by commas.

Following some of the optional parameters described below is a letter or short expression in parentheses. These are the names of the corresponding parameters at the bottom of your BLAST+ output.

 

-infile, -infile1, -in

 

Inputs file specification.

 

-begin, -beg

 

First base of interest in each query sequence.

 

-end

 

Last base of interest in each query sequence.

 

-infile2, -in2, -db

 

Specifies database to search.

 

-expect, -exp

 

Ignores scores that would occur by chance more than n times.

 

-listsize, -lis, list

 

Sets maximum number of sequences listed in the output.

 

-outfile, -out, -outfile1

Names the output file. '-' for stdout.

 

-expect=10.0, -exp

This parameter, for which there is a prompt if you don't set it on the command line, lets you influence the number of hits in your output having scores that would be expected to have occurred by chance alone. There is nothing to prevent many biologically significant but statistically insignificant segment pairs from being screened out, so you may sometimes want to increase this parameter in order to have an opportunity to see them.

-listsize=500, -lis

By default, the BLAST+ output list file will contain 500 sequences (or fragments thereof, depending upon the state of -fragments), even if more than 500 sequences had scores above the cutoff score. The list is sorted in order of increasing probability, that is, with the most significant sequences first. Use -listsize to change the number of sequences in your output to any value between 0 (for blastall's program defaults) and 1000.

-processors=2, -proc

Tells the program to use 2 threads for the database search on a multiprocessor computer. Check with your system manager for the number of processors available at your site. Never set the number of processors greater than what you have available.

-tblastx

When searching a nucleotide sequence database with a nucleotide query sequence, this specifies that tblastx should be run instead of blastn. tblastx translates the query and every sequence in the database and examines all pair wise combinations to find similarities at the amino acid level.

The search set menu can scroll off your screen if it contains all of the searchable databases supported locally on your computer. The next two parameters can reduce the size of that menu.

 

-dbnucleotideonly, -dbn

Confines the menu to search sets containing nucleotide sequences.

-dbproteinonly, -dbp

Confines the menu to search sets containing protein sequences.

-wordsize=0, -wor

Sets the size of the short regions of similarity between sequences that BLAST+ initially searches for. If -wordsize=0, BLAST+ uses the default values: 11 for blastn and 3 for the other programs. Smaller word sizes result in a more sensitive search at the expense of a longer search time.

-match=1

Sets the nucleotide match reward to 1 (blastn only).

-mismatch=-3, -mis

Sets the nucleotide mismatch penalty to -3 (blastn only).

-matrix=BLOSUM62, -mat

Sets the amino acid substitution matrix to use. BLAST+ normally uses the BLOSUM62 amino acid substitution matrix from Henikoff and Henikoff for protein sequence comparisons (including all cases where nucleotide database or query sequences are translated before comparison). Other valid options are BLOSUM45, BLOSUM80, PAM30, and PAM70.

-gapweight=11, -gap

Sets the penalty for adding a gap to the alignment. See the RESTRICTIONS topic for more information about setting the gap opening penalty.

-lengthweight=1, -len

Sets the penalty for lengthening an existing gap in the alignment. See the RESTRICTIONS topic for more information about setting the gap extension penalty.

-hitextthreshold=0, -hitextthresh

Sets the threshold for extending hits using the two-hit method. Words with scores at least this high can be extended as ungapped alignments.

-translate=1, -trans

Sets a genetic code to use for the translation of the query sequence. BLAST+ uses the standard ("universal") genetic code unless you specify the number of one of the alternative codes listed under the topic ALTERNATIVE GENETIC CODES.

-dbtranslate=1, -dbtrans

Sets a genetic code to use for the translation of the database sequences. If you are searching for proteins from a system that doesn't use the standard ("universal") genetic code, you can select a more appropriate code from those listed under the topic ALTERNATIVE GENETIC CODES. Note that most of the genes in the nucleotide databases will be translated incorrectly if you select a nonstandard genetic code.

-effdbsize=0, -eff

Sets the effective database size. A value of 0 selects the program default.

-nofragments, -nofra

Suppresses the appearance of begin and end ranges on each output list file entry based on the alignment between the entry and the query sequence.

-alignments=250, -ali

By default, BLAST+ displays the alignments of HSPs from the best 250 sequences in the list. Use -alignments to change the number of sequences for which alignments are shown in your output to any value between 0 and 1000.

-xdropoff=0 [X2], -xdr

Sets the X2 dropoff value for gapped alignments (in bits). Gapped alignments are extended until the score drops below this value. This limits the (computationally expensive) extension of hits. Use -xdropoff=0 for default behavior.

-megablast[=mywp.chk], -mega

Causes BLAST+ to use Miller's greedy algorithm to align sequences after performing the initial ungapped extension. You can only -megablast when the type of the query sequence and database are both nucleotide.

This algorithm is optimized for aligning sequences that differ slightly as a result of sequencing or other similar errors and it can be considerably (up to 10 times) faster than BLASTN. As such, it is particularly useful for comparing large sequences. The minimum wordsize that is allowed with Megablast is larger than the standard default, which can reduce the sensitivity for relatively short sequences. See Z. Zheng et al. Journal of Computational Biology 7: 203-214 (2000) for more information regarding the greedy sequence alignment algorithm.

For the most part, any parameters that can be used with BLASTN can be used with -megablast. However, unlike BLASTN, options that affect gapped extensions (e.g. xdropoff) are ignored.

-lowercasemask, -low

Masks lowercase characters in the query sequence by replacing them with the letter X during the search. Masked residues are ignored when calculating scores. This is one of the few cases in GCG where the uppercase and lowercase characters in input sequences can produce different results.

-hitwindow=40, -hitw

Sets the maximum distance allowed for two non-overlapping sequence segments on the same diagonal, when looking for matches between the query and a database sequence.

-besthits=0, -bes

Sets the maximum number of hits from a given region of the query sequence. Only the highest scoring hits from the region are kept. With -besthits=0, the maximum number is set internally. This parameter can be used to counter the tendency of highly abundant, conserved regions to be so prevalent in the output that the detection of other domains would be precluded.

-HTML

Uses HTML format for output. This parameter has no effect if you use -VIEW=7 (XML output) or -VIEW=8 (tab-delimited output).

-native

Produces output in unmodified BLAST+2 format.

-append="string", -app

GCG implementation of BLAST+ is what is known as a "wrapper" program. After collecting your input parameters, the wrapper calls the locally-built implementation of BLAST+ from NCBI called blastall. If you are familiar with the interface to the blastall program as it was originally written, you can pass parameters to it directly using this parameter. Please call us if there are additional parameters you want to use with BLAST+ that you would like to look more like GCG parameters.

-batch, -bat

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

-dbreport, -dbr

Lists valid databases then exits without searching.

NEW FUNCTIONS

[ Previous| Top]

-check, -che, -help

Prints out this usage message.

-default, -d, -def

Specifies that sensible default values be used for all parameters wherever possible.

-documentation, -doc

Prints banner at program startup.

-quiet, -qui

This parameter is not supported.

-data

Lists the local blast databases by reading the blast.sdbs file specified on command line.

-docline, -docl

Specifies number of documentation lines to copy.

-config

Blast+ configuration file for the plugin. Users have a choice to select the config file to be used for Blast analysis.

-defaultnucdb

Default nucleic database is GenBank.

-defaultprotdb

Default protein database is uniprot.

-format, fmt

Output format. Valid values are: list: Sequence list file of hits native: Native BLAST+ report xml: BLAST+ XML.

-xml

Output BLAST+ XML format (same as -format=xml).

-plugin

BLAST+ plugin.

-algorithm, -alg

BLAST+ algorithm- blastn, blastp, blastx, tblastn or tblastx.

-chunksize, -chunk

Sets number of sequences to submit in parallel, large values.

-filter, -f

Enables filtering of low complexity segments out of query.

-gaps

Enables gapped alignments.

The release notes for BLAST+ 2.0 can be found at

http://www.ncbi.nlm.nih.gov/blast/docs/

Printed: June 1, 2005 14:46


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio , SeqLab , SeqWeb , SeqMerge , GCG and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio