FASTA+

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[Top | Next ]

FastA+ does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA+ may be more sensitive than BLAST+.

DESCRIPTION

[ Previous | Top | Next ]

Advantages of Plus “+” Programs:

P Plus programs are enhanced to be able to read sequences in a variety of native formats such as GCG RSF, GCG SSF, GCG MSF, GenBank, EMBL, FastA, SwissProt, PIR, and BSML without conversion.

P Plus programs remove sequence length restriction of 350,000bp.

If you do not need these features and wish to have more interactivity, you might wish to seek out and run the original program version.

FastA+ uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to search for similarities between one sequence (the query) and any group of sequences of the same type (nucleic acid or protein) as the query sequence.

In the first step of this search, the comparison can be viewed as a set of dot plots, with the query as the vertical sequence and the group of sequences to which the query is being compared as the different horizontal sequences. This first step finds the registers of comparison (diagonals) having the largest number of short perfect matches (words) for each comparison. In the second step, these "best" regions are rescored using a scoring matrix that allows conservative replacements, ambiguity symbols, and runs of identities shorter than the size of a word. In the third step, the program checks to see if some of these initial highest-scoring diagonals can be joined together. Finally, the search set sequences with the highest scores are aligned to the query sequence for display.

What is a Word?

A word is any short sequence (n-mer or k-tuple) where you have set n to some small integer less than or equal to six. The word GGATGG is one of the 4,096 possible words of length six that can be created from an alphabet consisting of the four letters G, A, T, and C. The word QL is one of the 400 possible words of length two that you can make with the 20 letters of the amino acid alphabet.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using FastA+ to identify sequences in the PIR protein sequence database that are similar to a human globin protein sequence:

 FastA+ does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA+ may be more sensitive than BLAST.

fasta+ with what query sequence(s) ? ggamma.pep

Begin (* 1 *) ?

End (-1 for entire sequence) (* -1 *) ?

Enter value for search set (*Default DB*) ? PIR:*

What should I call the output file (* <sequence_name>.<program_name> *) ?

# $GCGROOT/bin/fasta34_native -O /var/tmp/bslskBAAjdaWHp.tmp -E 2.0 -b 10 -T 1 /var/tmp/bslskDAAldaWHp.fa "pir:*"

 FASTA searches a protein or DNA sequence data bank

 version 3.4t21 May 14, 2003

Please cite:

 W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

Query library /var/tmp/bslskDAAldaWHp.fa vs pir:* library

searching pir:* library

  1>>>GGAMMA.PEP TRANSLATE of: ggamma.seq check: 7694 from: 1 to: 1700 - 566 aa

 vs  pir:* library

    96233341 residues in 283366 sequences

 statistics sampled from 60000 to 282777 sequences

  Expectation_n fit: rho(ln(x))= 4.4293+/-0.000181; mu= 14.2640+/- 0.010

 mean_var=61.1282+/-12.006, 0's: 125 Z-trim: 241  B-trim: 2027 in 1/63

 Lambda= 0.164041

 Kolmogorov-Smirnov  statistic: 0.0235 (N=29) at  50

FASTA (3.45 Mar 2002) function [optimized, BL50 matrix (15:-5)] ktup: 2

 join: 37, opt: 25, open/ext: -10/-2, width:  16

OUTPUT

[ Previous | Top | Next ]

The output from FastA+ is a list file, and is suitable for input to any GCG program that allows indirect file specifications. (For information about indirect file specification, see Section 2, Using Sequence Files and Databases of the User's Guide.)

Here is some of the output file:

opt E()

< 20 617 0:==

22 0 0: one = represents 466 library sequences

24 7 0:=

26 16 6:*

28 53 64:*

30 275 390:*

32 1148 1506:===*

34 3617 4085:========*

36 8145 8390:==================*

38 14547 13865:=============================*==

40 20665 19341:=========================================*===

42 25393 23642:==================================================*====

44 27782 26079:=======================================================*====

46 27909 26562:========================================================*===

48 26282 25430:======================================================*==

50 23364 23205:=================================================*=

52 20305 20401:===========================================*

54 16832 17426:=====================================*

56 13835 14556:============================== *

58 11149 11950:======================== *

60 9046 9681:====================*

62 7092 7761:================*

64 5335 6172:============ *

66 4351 4878:==========*

68 3473 3837:========*

70 2654 3007:======*

72 2149 2350:=====*

74 1607 1832:===*

76 1254 1426:===*

78 965 1108:==*

80 685 861:=*

82 542 658:=*

84 421 521:=*

86 298 403:*

88 236 312:* inset = represents 12 library sequences

90 155 242:*

92 124 187:* :=========== *

94 97 145:* :========= *

96 74 112:* :======= *

98 72 87:* :====== *

100 49 67:* :=====*

102 41 52:* :====*

104 37 40:* :===*

106 25 31:* :==*

108 13 24:* :=*

110 9 19:* :=*

112 6 14:* :=*

114 8 11:* :*

116 7 9:* :*

118 6 7:* :*

>120 594 5:*= :*=======================================

96233341 residues in 283366 sequences

statistics sampled from 60000 to 282777 sequences

Expectation_n fit: rho(ln(x))= 4.4293+/-0.000181; mu= 14.2640+/- 0.010

mean_var=61.1282+/-12.006, 0's: 125 Z-trim: 241 B-trim: 2027 in 1/63

Lambda= 0.164041

Kolmogorov-Smirnov statistic: 0.0235 (N=29) at 50

FASTA (3.45 Mar 2002) function [optimized, BL50 matrix (15:-5)] ktup: 2

join: 37, opt: 25, open/ext: -10/-2, width: 16

The best scores are: opt bits E(283366)

pir:HGHUG Begin: 105 End: 147

! hemoglobin gamma-G chain [validated] - human... 269 464 269 347.5 4.3e-12

pir:I37025 Begin: 105 End: 147

! hemoglobin gamma-G chain - gorilla... 269 464 269 347.5 4.3e-12

pir:HGCZG Begin: 105 End: 147

! hemoglobin gamma-G chain - chimpanzee... 269 464 269 347.5 4.3e-12

pir:A27800 Begin: 105 End: 147

! hemoglobin gamma-1 chain - orangutan... 268 463 268 346.3 5e-12

pir:I78580 Begin: 2 End: 43

! hemoglobin gamma-G - human (fragment)... 260 260 260 343 7.6e-12

pir:I58221 Begin: 2 End: 43

! hemoglobin gamma-A chain - human (fragment)... 259 259 259 341.7 9e-12

pir:HGBAY Begin: 104 End: 146

! hemoglobin gamma chain - yellow baboon... 263 451 263 339.9 1.1e-11

pir:HGMQJ Begin: 104 End: 146

! hemoglobin gamma chain - Japanese macaque... 263 451 263 339.9 1.1e-11

pir:I37036 Begin: 105 End: 147

! hemoglobin gamma-2 chain - common gibbon... 263 453 263 339.9 1.1e-11

pir:I37035 Begin: 105 End: 147

! hemoglobin gamma-1 chain - common gibbon... 263 458 263 339.9 1.1e-11

\\End of List

>>pir:HGHUG hemoglobin gamma-G chain [validated] - human (147 aa)

initn: 464 init1: 269 opt: 269 Z-score: 347.5 bits: 72.2 E(): 4.3e-12

Smith-Waterman score: 269; 97.674% identity (97.674% ungapped) in 43 aa overlap (467-509:105-147)

440 450 460 470 480 490

GGAMMA GWKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFGKEFTPEVQASWQKM

.:::::::::::::::::::::::::::::

HGHUG AIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKM

80 90 100 110 120 130

500 510 520 530 540 550

GGAMMA VTGVASALSSRYH*AHCP*CRAFKDRLYSASNTNNKSILLRDHTWLSSVLFFMSF*IYEP

:::::::::::::

HGHUG VTGVASALSSRYH

140

>>pir:I37025 hemoglobin gamma-G chain - gorilla (147 aa)

initn: 464 init1: 269 opt: 269 Z-score: 347.5 bits: 72.2 E(): 4.3e-12

Smith-Waterman score: 269; 97.674% identity (97.674% ungapped) in 43 aa overlap (467-509:105-147)

440 450 460 470 480 490

GGAMMA GWKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFGKEFTPEVQASWQKM

.:::::::::::::::::::::::::::::

I37025 AIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKM

80 90 100 110 120 130

500 510 520 530 540 550

GGAMMA VTGVASALSSRYH*AHCP*CRAFKDRLYSASNTNNKSILLRDHTWLSSVLFFMSF*IYEP

:::::::::::::

I37025 VTGVASALSSRYH

140

>>pir:HGCZG hemoglobin gamma-G chain - chimpanzee (147 aa)

initn: 464 init1: 269 opt: 269 Z-score: 347.5 bits: 72.2 E(): 4.3e-12

Smith-Waterman score: 269; 97.674% identity (97.674% ungapped) in 43 aa overlap (467-509:105-147)

440 450 460 470 480 490

GGAMMA GWKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFGKEFTPEVQASWQKM

.:::::::::::::::::::::::::::::

HGCZG AIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKM

80 90 100 110 120 130

500 510 520 530 540 550

GGAMMA VTGVASALSSRYH*AHCP*CRAFKDRLYSASNTNNKSILLRDHTWLSSVLFFMSF*IYEP

:::::::::::::

HGCZG VTGVASALSSRYH

140

>>pir:A27800 hemoglobin gamma-1 chain - orangutan (147 aa)

initn: 463 init1: 268 opt: 268 Z-score: 346.3 bits: 72.0 E(): 5e-12

Smith-Waterman score: 268; 97.674% identity (97.674% ungapped) in 43 aa overlap (467-509:105-147)

440 450 460 470 480 490

GGAMMA GWKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFGKEFTPEVQASWQKM

.:::::::::::::::::::::::::::::

A27800 AIKNLDDLKGTFAQLSELHCDKLHVDPENFRLLGNVLVTVLAIHFGKEFTPEVQASWQKM

80 90 100 110 120 130

500 510 520 530 540 550

GGAMMA VTGVASALSSRYH*AHCP*CRAFKDRLYSASNTNNKSILLRDHTWLSSVLFFMSF*IYEP

:::::::::::::

A27800 VTGVASALSSRYH

140

>>pir:I78580 hemoglobin gamma-G - human (fragment) (43 aa)

initn: 260 init1: 260 opt: 260 Z-score: 343.0 bits: 69.6 E(): 7.6e-12

Smith-Waterman score: 260; 97.619% identity (97.619% ungapped) in 42 aa overlap (468-509:2-43)

440 450 460 470 480 490

GGAMMA WKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFGKEFTPEVQASWQKMV

:::::::::::::::::::: :::::::::

I78580 XLLGNVLVTVLAIHFGKEFTPAVQASWQKMV

10 20 30

500 510 520 530 540 550

GGAMMA TGVASALSSRYH*AHCP*CRAFKDRLYSASNTNNKSILLRDHTWLSSVLFFMSF*IYEPQ

::::::::::::

I78580 TGVASALSSRYH

>>pir:I58221 hemoglobin gamma-A chain - human (fragment) (43 aa)

initn: 259 init1: 259 opt: 259 Z-score: 341.7 bits: 69.4 E(): 9e-12

Smith-Waterman score: 259; 97.619% identity (97.619% ungapped) in 42 aa overlap (468-509:2-43)

440 450 460 470 480 490

GGAMMA WKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFGKEFTPEVQASWQKMV

::::::::::::::::::::::::::::::

I58221 XLLGNVLVTVLAIHFGKEFTPEVQASWQKMV

10 20 30

500 510 520 530 540 550

GGAMMA TGVASALSSRYH*AHCP*CRAFKDRLYSASNTNNKSILLRDHTWLSSVLFFMSF*IYEPQ

:.::::::::::

I58221 TAVASALSSRYH

>>pir:HGBAY hemoglobin gamma chain - yellow baboon (146 aa)

initn: 451 init1: 263 opt: 263 Z-score: 339.9 bits: 70.8 E(): 1.1e-11

Smith-Waterman score: 263; 95.349% identity (95.349% ungapped) in 43 aa overlap (467-509:104-146)

440 450 460 470 480 490

GGAMMA GWKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFGKEFTPEVQASWQKM

.:::::::::::::::::::::::::::::

HGBAY AIKNLDDLKGTFAQLSELHCDKLHVDPENFRLLGNVLVTVLAIHFGKEFTPEVQASWQKM

80 90 100 110 120 130

500 510 520 530 540 550

GGAMMA VTGVASALSSRYH*AHCP*CRAFKDRLYSASNTNNKSILLRDHTWLSSVLFFMSF*IYEP

:.:::::::::::

HGBAY VAGVASALSSRYH

140

>>pir:HGMQJ hemoglobin gamma chain - Japanese macaque (146 aa)

initn: 451 init1: 263 opt: 263 Z-score: 339.9 bits: 70.8 E(): 1.1e-11

Smith-Waterman score: 263; 95.349% identity (95.349% ungapped) in 43 aa overlap (467-509:104-146)

440 450 460 470 480 490

GGAMMA GWKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFGKEFTPEVQASWQKM

.:::::::::::::::::::::::::::::

HGMQJ AIKNLDDLKGTFAQLSELHCDKLHVDPENFRLLGNVLVTVLAIHFGKEFTPEVQASWQKM

80 90 100 110 120 130

500 510 520 530 540 550

GGAMMA VTGVASALSSRYH*AHCP*CRAFKDRLYSASNTNNKSILLRDHTWLSSVLFFMSF*IYEP

:.:::::::::::

HGMQJ VAGVASALSSRYH

140

>>pir:I37036 hemoglobin gamma-2 chain - common gibbon (147 aa)

initn: 453 init1: 263 opt: 263 Z-score: 339.9 bits: 70.8 E(): 1.1e-11

Smith-Waterman score: 263; 95.349% identity (95.349% ungapped) in 43 aa overlap (467-509:105-147)

440 450 460 470 480 490

GGAMMA GWKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFGKEFTPEVQASWQKM

.:::::::::::::::::::::::::::::

I37036 AIKNLDDLKGTFAQLSELHCDKLHVDPENFRLLGNVLVTVLAIHFGKEFTPEVQASWQKM

80 90 100 110 120 130

500 510 520 530 540 550

GGAMMA VTGVASALSSRYH*AHCP*CRAFKDRLYSASNTNNKSILLRDHTWLSSVLFFMSF*IYEP

:.:::::::::::

I37036 VAGVASALSSRYH

140

>>pir:I37035 hemoglobin gamma-1 chain - common gibbon (147 aa)

initn: 458 init1: 263 opt: 263 Z-score: 339.9 bits: 70.8 E(): 1.1e-11

Smith-Waterman score: 263; 95.349% identity (95.349% ungapped) in 43 aa overlap (467-509:105-147)

440 450 460 470 480 490

GGAMMA GWKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFGKEFTPEVQASWQKM

.:::::::::::::::::::::::::::::

I37035 AIKNLDDLKGTFAQLSELHCDKLHVDPENFRLLGNVLVTVLAIHFGKEFTPEVQASWQKM

80 90 100 110 120 130

500 510 520 530 540 550

GGAMMA VTGVASALSSRYH*AHCP*CRAFKDRLYSASNTNNKSILLRDHTWLSSVLFFMSF*IYEP

:.:::::::::::

I37035 VAGVASALSSRYH

140

566 residues in 1 query sequences

96233341 residues in 283366 library sequences

Scomplib [34t21]

start: Wed Dec 8 16:17:07 2004 done: Wed Dec 8 16:36:08 2004

Total Scan time: 948.400 Total Display time: 2.120

Function used was FASTA [version 3.4t21 May 14, 2003]

What is the Output?

The first part of the output file contains a histogram showing the distribution of the z-scores between the query and search set sequences. The histogram is composed of bins of size 2 that are labeled according to the higher score for that bin (the leftmost column of the histogram). For example, the bin labeled 24 stores the number of sequence pairs that had scores of 23 or 24.

The next two columns of the histogram list the number of z-scores that fell within each bin. The second column lists the number of z-scores observed in the search and the third column lists the number of z-scores that were expected.

The body of the histogram displays a graphical representation of the score distributions. Equal signs (=) indicate the number of scores of that magnitude that were observed during the search, while asterisks (*) plot the number of scores of that magnitude that were expected.

At the bottom of the histogram is a list of some of the parameters pertaining to the search.

Below the histogram, FastA+ displays a listing of the best scores. Strand:- after the sequence name in this list indicates that the match was found between the search set sequence and the reverse complement of the query sequence.

Following the list of best scores, FastA+ displays the alignments of the regions of best overlap between the query and search sequences. /rev following the query sequence name indicates that the search sequence is aligned with the reverse complement of the query sequence.

This program displays only the region of overlap between the two aligned sequences (plus some residues on either side of the region to provide context for the alignment) unless you use -showall. The display of identities and conservative replacements between the aligned sequences depends on the value of -markx. By default (-markx=3), the pipe character (|) is used to denote identities and the colon (:) to denote conservative replacements.

INPUT FILES

[ Previous | Top | Next ]

FastA+ accepts a single protein sequence or a single nucleic acid sequence as the query sequence. The search set is either a single sequence or multiple sequences of the same type as the query. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example Genbank:*. The function of FastA+ depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST.

BLAST+ searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST+ can produce gapped alignments for the matches it finds. NetBLAST+ searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST+ can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

SSearch+ does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST+ and FastA+, it can be very slow.

TFastA+ does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA+ translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

TFastX+ does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA+, and like TFastA+, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX+ does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX+ translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.

ProfileSearch and MotifSearch use a profile (derived from a set of aligned sequences) instead of a query sequence to search a collection of sequences. FindPatterns+ uses a pattern described by a regular expression to search a collection of sequences.

StringSearch, LookUp, and Names identify sequences by searching the annotation (non-sequence) portions of sequence files or sequence databases.

BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds. NetBLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST and FastA, it can be very slow.

TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

TFastX does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA+, and like TFastA, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

RESTRICTIONS

[ Previous | Top | Next ]

The query sequence cannot be longer than 20,000 symbols. You cannot select a list size of more than 1,000 best scores nor view more than 1,000 alignments. The word size must be from 1 to 6 for nucleic acid queries, and from 1 to 2 for protein queries. The sequence type (nucleic acid or protein) of the query sequence and the search set sequences must match.

If the query sequence has more than 20,000 symbols, then the program throws an warning message: “The FastA program can efficiently handle complete alignments for sequences less than 20 KB”

For the estimates of statistical significance to be valid, the search set must contain a large sample of unrelated sequences. The statistical estimates will not be calculated at all if there are fewer 10 sequences in the search set (20 sequences if only one strand is searched).

With -nooptall, the estimates of statistical significance will not be accurate.

Fast suite of programs work with the flat file databases only. Users cannot specify Blast databases as a database specification for FastA+.

For Tru64 (OSF) FastA+ fails with an error message:

“ While running the child process: Child was terminated by signal 6 (SIGABRT)"

Error in cleaning up after application: Exception: Error reading fast program

output: Unable to open fasta output file: "/tmp/bslskAAAMGXMCf.tmp" (at

/tmp/bslskAAAMGXMCf.tmp:0)."

Workaround

There is an upper limit on the amount of memory that is allocated per process. For tru64 machine the limit for datasize is set to 128M. To increase this limit, execute

unlimit datasize (csh) or

ulimit datasize (ksh)

This will increase the limit on the datasize to 1024M. This is the maximum amount of memory that an individual process can take on Tru64 machine. So, default settings for the search set parameter (-infile2) for the fasta suite of programs may cause a crash. Please execute the programs with a smaller subset. The programs have been tested successfully using a search set of 400 thousand sequences

ALGORITHM

[ Previous | Top | Next ]

FastA+ uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to search for similarities between one sequence (the query) and any group of sequences.

Hashing Step

The first step in the search is to locate regions of the query sequence and the search set sequence that have high densities of exact word matches. The algorithm for this step of the search is a modification of the algorithm of Wilbur and Lipman (Proc. Natl. Acad. Sci. USA 80; 726-730 (1983)) and may be referred to as a hash-table look-up search or hashing. Wilbur and Lipman searches (including FastA+) belong to a class of comparisons that use direct addressing or k-tuple preprocessing to increase the speed of the search at the expense of some sensitivity.

The hashing process works as follows. After you give FastA+ a word size, it makes up a dictionary of all of the possible words of that size in the query sequence. A second dictionary is created for the opposite strand if the query is a nucleic acid sequence. Each word, such as GGATGG, is converted to a unique base-4 number that serves as an index to the corresponding dictionary entry. Each entry contains a list of numbers telling the location (coordinates) of the word in the query sequence. If the word does not occur, the entry contains only the number zero. So for each word in the searched sequences, FastA+ only has to look up the word in the dictionary to find out if it occurs in the query sequence.

It is important to realize that the hashing process cannot deal with ambiguity! To partially compensate for this limitation, FastA+ converts an ambiguous base in a sequence to its most common nonambiguous constituent before calculating the index number of the word that contains the ambiguity. For example, A is the most common nucleotide in the sequence databases, so N is converted to A during the hashing step. Similarly, the ambiguous amino acids B, Z, and X are converted to their most common unambiguous constituent, so B (D or N) gets the same hash code as N, and X (any amino acid) gets the same hash code as alanine, the most common amino acid in the protein databases.

If a word from a search set sequence occurs in the query sequence, FastA+ computes a score for the word equal to the sum of the scoring factors (see next paragraph) for each symbol in the word. It then adds this score to the score of the diagonal on which the word occurs. If a word match overlaps another word on the same diagonal, only the scoring factor(s) for the non-overlapping symbol(s) is added to the score of the diagonal. If there are intervening mismatches between matching words on a diagonal, a constant penalty for each mismatching residue is subtracted from the score.

When -pamfactor is in effect (the default for protein query sequences), the scoring factors used to score a word are the identical match scores of the scoring matrix used. Thus a word that contains relatively immutable amino acids will add a larger score to the diagonal than a word which contains amino acids which can exchange readily. The default for a nucleic acid query sequence is -nopamfactor. In this case, a single constant value is used for all symbol matches, so all words contribute the same score. The program defaults can be overridden using -pamfactor or -nopamfactor.

Scoring Step

At the end of the hashing step, the ten highest-scoring regions for the sequence pair (the regions with the highest density of exact word matches) are rescored using a scoring matrix that allows conservative replacements and runs of identities shorter than the size of a word. The ends of each region are trimmed to include only those residues that contribute to the highest score for the region, resulting in ten partial alignments without gaps. These are referred to as the initial regions. The score of the highest scoring initial region is saved as the init1 score.

Next, FastA+ determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined. The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. The score of the highest scoring region at the end of this step is saved as the initn score.

Aligning Step

After computing the initial scores, FastA+ determines the best segment of similarity between the query sequence and the search set sequence, using a variation of the Smith-Waterman algorithm. This "local alignment in a band" procedure is described in Chao, Pearson, and Miller (CABIOS 8; 481-487 (1992)). The score for this alignment is reported as the opt score.

By default, FastA+ determines the opt score immediately if the initn score is greater than a given threshold. The opt scores are then used as the basis for keeping a list of the best matches found. The program calculates the default threshold from the length of the query sequence and the ktup setting. You can override this threshold by adding a positive, nonzero number after -optall, for example -optall=20. A threshold of 1 is the most sensitive setting. Setting the threshold higher than this will speed up the search, at the risk of missing some matches.

Alternatively, you can use -nooptall to direct the program to use the initn scores as the basis for retaining the best matches. In this case, the opt scores are calculated for the matches with the best initn scores only after all of the search set sequences have been scanned. This speeds up the search, but at the cost of sensitivity, and the statistical estimates for such a search will not be valid. When -nooptall is specified, the best scores are sorted and reported in order of their initn scores, even though the opt score is calculated.

Lastly, FastA+ uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair. (See William R. Pearson, Protein Science 4; 1145-1160 (1995) for an explanation of how this z-score is calculated.) By default, the z-score is calculated from the opt score; with -nooptall, the z-score is calculated from the initn score instead.

The distribution of the z-scores tends to closely approximate an extreme-value distribution; using this distribution, the program can estimate the number of sequences that would be expected to have, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() score.

When all of the searches set sequences have been compared to the query, the list of best scores is printed. If alignments were requested, the alignments are also printed. For searches with a protein query sequence against a protein search set, a full Smith-Waterman local alignment (not restricted to a band, and therefore allowing unlimited gap lengths) is performed, and a Smith-Waterman score is reported along with the other scores and the alignment itself. (This alignment may not be the same alignment that the "local alignment in a band" algorithm used to calculate the opt score during the search.) By default, the alignment for nucleic acid searches and TFastA+ is the same "local alignment in a band" that was performed to calculate the opt score. With -swalign, you can make the program perform the full Smith-Waterman alignment on nucleic acid sequences at the cost of increased computation time.

In evaluating the E() scores, the following rules of thumb can be used: for searches of a protein database of 10,000 sequences, sequences with E() less than 0.01 are almost always found to be homologous. Sequences with E() between 1 and 10 frequently turn out to be related as well. Optimization is important: with -nooptall, E() overestimates the significance of the match, so that unrelated nucleic acid sequences frequently have scores less than 0.0005.

A detailed description of the FastA+ algorithm is William R. Pearson, "Rapid and Sensitive Sequence Comparison with FASTP and FASTA+," in Methods in Enzymology, 183; 63-98, Academic Press, San Diego, California, USA, 1990.

CONSIDERATIONS

[ Previous | Top | Next ]

Accelrys GCG (GCG) version of FastA+ searches using both strands of nucleic acid queries unless you use -onestrand. Dr. Pearson's FASTA+ searches with one strand only.

The E() scores are affected by similarities in sequence composition between the query sequence and the search set sequence. Unrelated sequences may have "significant" scores because of composition bias.

If there is a database entry that overlaps your query in several places, but there are large gaps between the matching regions, only the best overlap appears in the alignment display.

There are two ways to control the size of the list of best scores. By default, scores are listed until a specific E() value is reached. You may set the value in response to the program prompt or by using -expect; otherwise the program uses 10.0 for protein searches, 2.0 for nucleic acid searches. (If you are running the program interactively, it will show no more than 40 scores initially, and ask if you want to see more scores if there are any more that are less than the E() value.)

If you use -listsize, the E() value is ignored, and the program will list the number of scores you requested.

You can control the number of alignments using -noalign and -align. The program behaves differently depending on whether it is being run noninteractively (in batch or with -Default on the command line) or interactively. In the noninteractive case, the program displays the number of alignments set by -align. (If this is not present, it shows 40 alignments or the number of scores that were listed, whichever is smaller.) If you run the program interactively, it displays the list of best scores, and then asks you how many alignments you want to see. (This prompt does not appear if you use -noalign or -align.)

Increasing Sensitivity By Adjusting Word Size

By default, FastA+ uses the maximum allowable word size in order to maximize the speed of the search. But in some situations this may not be sensitive enough to find matches to your query. In particular, if your query sequence is a short oligonucleotide or peptide and/or the query contains ambiguous residues, you may need to use -wordsize to reduce the word size that is used during the hashing step. (A smaller word size will increase the sensitivity of the search at the expense of increasing the amount of CPU time required to run the program.)

If FastA+ finds few or no matches for short query sequences, rerun the search using a word size of 2 or 3 (for oligonucleotides) or a word size of 1 (for short peptides). Because of the way ambiguous residues are treated during the hashing stage of the search, you should not use a word size larger than the longest run of nonambiguous residues in your query sequence.

Adjusting Gap Creation and Extension Penalties

Unlike other GCG programs, FastA+ does not read the default gap creation and gap extension penalties from the scoring matrix file. It uses default gap creation and extension penalties that were empirically determined to be appropriate for the default scoring matrices. If you select a different scoring matrix with -matrix, you may need to change the gap penalties. The histogram display gives a qualitative view of the quality of fit between the actual distribution of scores and the expected distribution of scores. This information may indicate whether or not suitable gap creation and extension penalties were used for the search. When the histogram shows poor agreement between the actual distribution and the theoretical distribution, you might consider using -gapweight and/or -lengthweight to specify higher gap creation and extension penalties, respectively. For example, you might increase the gap creation penalty from 12 to 16 and the gap extension penalty from 2 to 4.

Differences in Applying Gap Extension Penalties

There are two different philosophies on how to penalize gaps in an alignment. One way is to penalize a gap by the gap creation penalty plus the extension penalty times the length of the gap (gapweight+ (lengthweight x gap length)). The other way is to use the gap creation penalty plus the extension penalty times the gap length excluding the first residue in the gap (gapweight+ (lengthweight x (gap length - 1)).

"Native" GCG programs, such as Framesearch and Bestfit, handle gap extension penalties the first way, while the FastA+-family programs use the second way. Therefore a value for -lengthweight that gives good results with one of the FastA+-family programs may not give equivalent results with a native GCG program, and vice versa.

Increasing Program Speed Using Multithreading

This program is multithreaded. It has the potential to run faster on a machine equipped with multiple processors because different parts of the analysis can be run in parallel on different processors. By default, the program assumes you have one processor, so the analysis is performed using one thread. You can use -processors to increase the number of threads up to the number of physical processors on the computer.

Under ideal conditions, the increase in speed is roughly linear with the number of processors used. But conditions are rarely ideal. If your computer is heavily used, competition for the processors can reduce the program's performance. In such an environment, try to run multithreaded programs during times when the load on the system is light.

As the number of threads increases, the amount of memory required increases substantially. You may need to ask your system administrator to increase the memory quota for your account if you want to use more than two threads.

Never use -processors to set the number of threads higher than the number of physical processors that the machine has -- it does not increase program performance, but instead uses up a lot of memory needlessly and makes it harder for other users on the system to get processor time. Ask your system administrator how many processors your computer has if you aren't sure.

SUGGESTIONS

[ Previous | Top | Next ]

Identifying the Search Set

If you want to search a single database division instead of an entire database, see the "Using Database Sequences" topic of Section 2, Using Sequence Files and Databases of the User's Guide for a list of the logical names used for the databases and the divisions of each database. The search set can also consist of a group of sequence files that are not in a database. Use a multiple sequence specification to name these. For information about naming groups of sequences for the search set, see the topics "Specifying Files" and "Using Wildcards" in Section 1, Getting Started, and "Using Database Sequences," "Using Multiple Sequence Format (MSF) Files", "Using Rich Sequence Format (RSF) Files", and "Using List Files" in Section 2, Using Sequence Files and Databases of the User's Guide.

Batch Queue

FastA+ is one of the few programs in GCG that can take more than a few minutes to run. Most comparisons should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using -batch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide. Very large comparisons may exceed the CPU limit set by some systems.

Interrupting a Search: <Ctrl>C

You can type <Ctrl>C to interrupt a search and see the results from the part of the search that has already been completed. Because the program is multithreaded, the search may not be interrupted immediately, but will continue until one of the threads finishes processing its data and returns for more data.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -check to view the summary below and to specify parameters before the program executes. In the syntax summary below, square brackets ([ and ]) enclose parameter values that are optional. For each program parameter, square brackets enclose the type of parameter value specified, the default parameter value, and shortened forms of the parameter name, aliases. Programs with a plus in the name use either the full parameter name or a specified alias. If “Type” is “Boolean”, then the presence of the parameter on the command line indicates a true condition. A false condition needs to be stated as, parameter=false.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein).

For nucleotide searches, FastA+ may be more sensitive than BLAST.

Minimal Syntax: % fasta+ [-infile1=]value -Default

Minimal Parameters (case-insensitive):

-infile1 [Type: List / Default: EMPTY / Aliases: infile in1 in]

Input file specification.

Prompted Parameters (case-insensitive):

-begin [Type: Integer / Default: '1' / Aliases: beg]

Starting point of the range of interest in the input sequence.

-end [Type: Integer / Default: '-1']

End point of the range of interest in the input sequence. A value of '-1' indicates that the range extends till the end of input sequence.

-infile2 [Type: List / Default: EMPTY / Aliases: in2 db]

Search set specification.

-outfile [Type: OutFile / Default: '<sequence_name>.<program_name>' /Aliases: out]File to which output is written. A value of '-' means STDOUT.Specifying this option also turns on the 'concat' option.Default value is '-'.

Optional Parameters (case-insensitive):

-check [Type: Boolean / Default: 'false' / Aliases: che help]

Prints out this usage message.

-default [Type: Boolean / Default: 'false' / Aliases: d def]

Specifies that sensible default values be used for all parameters where possible.

-documentation [Type: Boolean / Default: 'true' / Aliases: doc]

Prints banner at program startup.

-quiet [Type: Boolean / Default: 'false' / Aliases: qui]

Tells application to print only a minimal amount of information.

-wordsize [Type: Integer / Default: EMPTY / Aliases: wor]

Size of word (k-tuple) used in the hashing step.

-expect [Type: Double / Default: '2.0' / Aliases: exp]

Shows all scores whose E() value is less than the specified value of expect.

-matrix [Type: String / Default: EMPTY / Aliases: mat]

Assigns the scoring matrix for the comparison.

-processors [Type: Integer / Default: '1' / Aliases: proc]

On a multiprocessor computer, this parameter controls the number of threads to use for database search.

-minlength [Type: Integer / Default: EMPTY / Aliases: minl]

The search set is restricted to sequences whose length is more than the value specified by this parameter.

-maxlength [Type: Integer / Default: EMPTY / Aliases: maxl]

The search set is restricted to sequences whose length is less than the value specified by this parameter.

-pamfactor [Type: Boolean / Default: 'DEFAULT_PARAM_VALUE' / Aliases: pam] This parameter governs whether a scoring matrix should be used for calculating initial diagonal scores, instead of using the identical match scores from the scoring matrix. Default is to use FASTA+ internal behavior, which differs for protein and nucleotide searches.

-gapweight [Type: Integer / Default: EMPTY / Aliases: gap]

This parameter specifies the gap creation penalty that is substracted from an alignment every time a gap is created.

-lengthweight [Type: Integer / Default: EMPTY / Aliases: len]

This parameter specifies the gap extension penalty that is substracted from an alignment every time a gap is extended by one residue.

-optall [Type: Boolean / Default: 'DEFAULT_PARAM_VALUE' / Aliases:opt] With this parameter, the program immediately performs an alignment and calculates the opt score when the initn score is greater than or equal to the value specified by this parameter. This parameter allows you to override the default threshold calculated by the program. Scores are sorted and saved by opt score during the search. -NOOPTall doesn't compute the opt score until the search is complete. In this case scores are sorted and saved by initn score instead of by opt score.

-listsize [Type: Integer / Default: '10' / Aliases: lis]

This parameter controls the number of top scores show.

Overrides the -expect parameter.

-alignments [Type: Integer / Default: '20' / Aliases: align ali]

This parameter limits the number of alignments to display in the output file to the 10 best matches in the list. Use -noalign to suppress the sequence alignments in the output file.

-showall [Type: Boolean / Default: 'DEFAULT_PARAM_VALUE' / Aliases:show] Shows entire sequences in the alignment display, instead of just the best region of overlap and its surroundings.

-native [Type: Boolean / Default: 'false']

Output native FastA+ formatted output.

-markx [Type: Integer / Default: EMPTY / Aliases: mark]

This parameter determines the alignment display mode - especially the symbols that identify matches and mismatches. The default value, -MARKx=0 uses a colon to show identities and a period (.) to show conservative replacements.

-MARKx=1 will not mark identities; instead, conservative replacements are connected with a lowercase x, and non-conservative substitutions are connected with an uppercase X.

If -MARKx=2, the residues in the second sequence are shown only if they differ from the first sequence.

-MARKx=3 displays the aligned library sequences without the query sequences; these can be used to build a primitive multiple alignment.

-MARKx=4 provides a graphical display of the boundaries of the alignments.

-MARKx=5 provides a combination of -MARKx=4 and -MARKx=0.

-MARKx=6 provides -MARKx=5 plus HTML formatting.

-MARKx=9 provides percent identity and coordinates with the initial list of high scores as well as the conventional

-MARKx=0 alignments.

Use -MARKx=10 to get aligned sequences in the FastA "parsable" output format.

-histogram [Type: Boolean / Default: 'true' / Aliases: his]

Start/Suppress printing the histogram.

-linesize [Type: Integer / Default: EMPTY / Aliases: lin]

This parameter lets you set the number of sequence symbols in each line of the alignment to any number between 60 and 200.

-batch [Type: Boolean / Default: 'false']

Allows submitting a job to a batch queue.

-onestrand [Type: Boolean / Default: 'false' / Aliases: one]

Translates only the 3 forward frames.

-swalign [Type: Boolean / Default: 'false' / Aliases: sw]

Does an unlimited Smith-Waterman alignment as the final alignment for the nucleotide searches, instead of 'alignment in a band'.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either

1) Have a data file with exactly the same name as the program default scoring matrix in your current working directory; or

2) Have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name Share_Matrix; or

3) Name a file on the command line with an expression like -matrix=mymatrix.cmp. If you don't include a directory specification when you name a file with -matrix, the program searches for the file first in your local directory, then in the directory with the logical name Share_Matrix,. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.

FastA+ reads a scoring matrix containing the values for every possible match from your working directory or the public database. The files fastadna.cmp (for nucleic acid sequences) and blosum50.cmp (for protein sequences) contain the default values for matches. blosum50.cmp is a BLOSUM50 matrix. You can use the Fetch+ program to obtain a copy of these files in order to modify them to suit your own needs.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line. Shortened forms of the parameter name, aliases, are shown, separated by commas.

-infile1, -infile, -in1, -in

Inputs file specification.

-begin, -beg

Starting point of the range of interest in the input sequence.

-end

End point of the range of interest in the input sequence. A value of '-1' indicates that the range extends till the end of input sequence.

-infile2, -in2, -db

Search set specification.

-outfile, -out

File to which output is written. A value of '-' means STDOUT.Specifying this option also turns on the 'concat' option.Default value is '-'

-wordsize=2, -wor

Sets the size of the word (k-tuple) to use for the hashing step.

-check, -che

Prints out this usage message.

-default, -d, -def

Specifies that sensible default values be used for all parameters where possible.

-documentation, -doc

Prints banner at program startup.

-quiet, -qui

This parameter is not supported.

-alignments, -ali, -align

This parameter limits the number of alignments to display in the output file to the 10 best matches in the list. Use -noalign to suppress the sequence alignments in the output file.

-histogram, -his

Start/suppress printing the histogram.

-matrix=mymatrix.cmp, -matr

Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -matrix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-expect=2.0, -exp

Shows all scores whose E() value is less than 2.0. Ignored if –listsize is used.

-processors=2, -proc

Tells the program to use 2 threads for the database search on a multiprocessor computer.

-minlength=1000, -minl

Restricts the search to search set sequences that are equal to or longer than 1000 residues.

-maxlength=5000, -maxl

Restricts the search to search set sequences that are equal to or shorter than 5000 residues.

-onestrand, -one

Searches using only the top strand of a nucleotide query sequence.

-pamfactor, -pam

Uses a scoring matrix for the calculation of initial diagonal scores. Instead of using a constant factor for each match in a word, values are obtained from a scoring matrix. This is the default option for protein sequences, while -nopamfactor is the default for nucleic acid sequences.

-gapweight=12, -gap

Specifies the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

-lengthweight=2, -len

Specifies the gap extension penalty that is subtracted from the alignment score for each residue added to an existing gap.

-optall=20, -opt

Immediately performs an alignment and calculates the opt score when the initn score is greater than or equal to 20. This parameter allows you to override the default threshold calculated by the program. Scores are sorted and saved by opt score during the search. -nooptall doesn't compute the opt score until the search is complete. In this case scores are sorted and saved by initn score instead of by opt score.

-swalign, -sw

Does an unlimited Smith-Waterman alignment as the final alignment for nucleotide searches, instead of the "alignment in a band" version of Smith-Waterman. (Note: this can be very slow.)

-listsize=40, -lis

Shows the best 40 scores. Overrides -expect.

-showall, -show

Shows entire sequences in the alignment display, instead of just the best region of overlap and its surroundings.

-markx, -mark

This parameter determines the alignment display mode - especially the symbols that identify matches and mismatches. The default value, -markx=0 uses a colon to show identities and a period (.) to show conservative replacements.

-markx=1 will not mark identities; instead, conservative replacements are connected with a lowercase x, and non-conservative substitutions are connected with an uppercase X.

If -markx=2, the residues in the second sequence are shown only if they differ from the first sequence.

-markx=3 displays the aligned library sequences without the query sequences; these can be used to build a primitive multiple alignment.

-markx=4 provides a graphical display of the boundaries of the alignments.

-markx=5 provides a combination of -markx=4 and -markx=0.

-markx=6 provides -markx=5 plus HTML formatting.

-markx=9 provides percent identity and coordinates with the initial list of high scores as well as the conventional

-markx=0 alignments.

Use -markx=10 to get aligned sequences in the FastA "parsable" output format.

-native

Output native FastA+ formatted output.

-linesize=60, -lin

Lets you set the number of sequence symbols in each line of the alignment to any number between 60 and 200.

-batch, -bat

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

Printed: September 9, 2005 16:21

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.