TFASTA+

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

ALGORITHM

CONSIDERATIONS

SUGGESTIONS

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

TFastA+ does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA+ translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

DESCRIPTION

[ Previous | Top | Next ]

Advantages of Plus “+” Programs:

 

P      Plus programs are enhanced to be able to read sequences in a variety of native formats such as GCG RSF, GCG SSF, GCG MSF, GenBank, EMBL, FastA, SwissProt, PIR, and BSML without conversion.

 

P      Plus programs remove sequence length restriction of 350,000bp.

 

If you do not need these features and wish to have more interactivity, you might wish to seek out and run the original program version.

TFastA+ uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to search for similarities between a query protein sequence and any group of nucleotide sequences. TFastA+ translates the nucleotide sequences in all six frames before performing the comparison. Each translated reading frame is treated as a separate sequence to be searched.

In the first step of this search, the comparison can be viewed as a set of dot plots, with the query as the vertical sequence and the group of sequences to which the query is being compared as the different horizontal sequences. This first step finds the registers of comparison (diagonals) having the largest number of short perfect matches (words) for each comparison. In the second step, these "best" regions are rescored using a scoring matrix that allows conservative replacements, ambiguity symbols, and runs of identities shorter than the size of a word. In the third step, the program checks to see if some of these initial highest-scoring diagonals can be joined together. Finally, the search set sequences with the highest scores are aligned to the query sequence for display.

What is a Word?

A word is any short sequence (n-mer or k-tuple) where you have set n to some small integer less than or equal to six. The word GGATGG is one of the 4,096 possible words of length six that can be created from an alphabet consisting of the four letters G, A, T, and C. The word QL is one of the 400 possible words of length two that you can make with the 20 letters of the amino acid alphabet.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using TFastA+ to identify sequences in the GenBank nucleotide sequence database that contain translated regions similar to a human globin protein:

 
TFastA+ does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA+ translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"
 
 
tfasta+ with what query sequence(s) ? ggamma.pep
Begin (* 1 *) ?
 
End (-1 for entire sequence) (* -1 *) ?
Enter value for search set (*Default DB*) ?
What should I call the output file (* <sequence_name>.<program_name> *) ?
 
 
# $GCGROOT/bin/tfasta34_native -O /var/tmp/bslskBAAO2a4Ro.tmp -E 2.0 -b 10 -T 1 /var/tmp/bslskDAAQ2a4Ro.fa "Genbank:* 17"
 TFASTA translates and searches a DNA sequence data bank
 version 3.4t21 May 14, 2003
Please cite:
 W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
 
Query library /var/tmp/bslskDAAQ2a4Ro.fa vs Genbank:* library
searching Genbank:* 17 library
 
  1>>>GGAMMA.PEP TRANSLATE of: ggamma.seq check: 7694 from: 1 to: 1700 - 566 aa
 vs  Genbank:* library
 
 103380 residues in    14 sequences
 MLE_cen statistics: Lambda= 0.1223;  K=0.0006321 (cen=0)
 
TFASTA (3.45 Mar 2002) function [optimized, BL50 matrix (15:-5)] ktup: 2
 join: 37, opt: 37, open/ext: -14/-2, width:  16
 Scan time:  0.490
The best scores are:                                       opt bits E(14)
AB107101 AB107101 Homo sapiens chromosome 3 cl (91914) [4]  129  33.4   0.065
AB107101 AB107101 Homo sapiens chromosome 3 cl (91914) [5]  128  33.2   0.073
AB107101 AB107101 Homo sapiens chromosome 3 cl (91914) [6]  108  29.7    0.82
A00001 A00001 Cauliflower mosaic virus satelli ( 335) [2]   56  20.5     1.7
A00001 A00001 Cauliflower mosaic virus satelli ( 335) [6]   49  19.3     3.6
AAB2MCG1 AF032092 Aotus azarai beta-2-microglo ( 289) [3]   46  18.7     4.4
AB000556 AB000556 Synthetic unidentified bacte ( 924) [2]   54  20.2     5.1
A16SRRNA X87617 Actinomycete (genus unknown) 1 (1497) [3]   55  20.3     6.6
AA12SRRNA X67626 A.australis mitochondrial gen ( 386) [6]   43  18.2     7.2
AAG311130 AJ311130 Apodemus agrarius mitochond ( 955) [3]   48  19.1     8.7
 
 
566 residues in 1 query   sequences
103380 residues in 14 library sequences
 Scomplib [34t21]
 start: Fri Dec  3 14:58:47 2004 done: Fri Dec  3 14:58:51 2004
 Total Scan time:  0.490 Total Display time:  2.310
 
Function used was TFASTA [version 3.4t21 May 14, 2003]

OUTPUT

[ Previous | Top | Next ]

!!SEQUENCE_LIST 1.0

# /u/biobuild/dsgcg/nightly/solaris/bin/tfasta34_native -O /var/tmp/bslskBAAO2a4Ro.tmp -E 2.0 -b 10 -T 1 /var/tmp/bslskDAAQ2a4Ro.fa "Genbank:* 17"

 TFASTA translates and searches a DNA sequence data bank

 version 3.4t21 May 14, 2003

Please cite:

 W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

 

 GGAMMA.PEP, 566 aa

 vs Genbank:* library

 

 103380 residues in    14 sequences

 MLE_cen statistics: Lambda= 0.1223;  K=0.0006321 (cen=0)

 

TFASTA (3.45 Mar 2002) function [optimized, BL50 matrix (15:-5)] ktup: 2

 join: 37, opt: 37, open/ext: -14/-2, width:  16

The best scores are:                                       opt bits E(14)

 

..

Genbank:AB107101 Begin: 9933 End: 9871

! AB107101 Homo sapiens chromosome 3 clone 15N2...(4) 125 161 129 87.4 0.065 Genbank:AB107101 Begin: 9932 End: 9879

! AB107101 Homo sapiens chromosome 3 clone 15N2...(5) 128 248 128 86.5 0.073 Genbank:AB107101 Begin: 9955 End: 9887

! AB107101 Homo sapiens chromosome 3 clone 15N2...(6) 107 241 108 67.4 0.82 Genbank:A00001  Begin: 80 End: 214

! A00001 Cauliflower mosaic virus satellite cDNA....(2) 43 43 56 61.6 1.7 Genbank:A00001  Begin: 282 End: 217

! A00001 Cauliflower mosaic virus satellite cDNA....(6) 49 49 49 54.9 3.6 Genbank:AAB2MCG1 Begin: 129 End: 188

! AF032092 Aotus azarai beta-2-microglobulin pr...(3) 40 40 46 53.2 4.4

Genbank:AB000556 Begin: 485 End: 637

! AB000556 Synthetic unidentified bacterium/pla...(2) 49 97 54 51.8 5.1

Genbank:A16SRRNA Begin: 663 End: 725

! X87617 Actinomycete (genus unknown) 16S ribos...(3) 51 71 55 49 6.6

Genbank:AA12SRRNA Begin: 366 End: 304

! X67626 A.australis mitochondrial gene for 12...(6) 43 43 43 48.1 7.2

Genbank:AAG311130 Begin: 576 End: 638

! AJ311130 Apodemus agrarius mitochondrial 12S...(3) 48 48 48 45.8 8.7

\\End of List

 

>>Genbank:AB107101 AB107101 Homo sapiens chromosome 3 clone 15N2  (91914 aa)

Frame: 4 initn: 161 init1: 125 opt: 129  Z-score: 87.4  bits: 33.4 E(): 0.065

banded Smith-Waterman score: 129;  60.870% identity (66.667% ungapped) in 23 aa overlap (364-386:9933-9871)

 

           340       350       360       370       380       390  

GGAMMA PLIPDGGKVCPGVRNN*NIWAGVDFESQLCVCVCVCARVCLCVCESVCFF*RFQPTAYRV

                                     ::::::.  :.::: :::.:  .      

AB1071 LYWEKTLGKGIRSSQCFEEDKLMGN*FVWCVCVCVCV--CVCVCLSVCLFCSYFRQAG*E

          10000      9970      9940        9910      9880         

 

           400       410       420       430       440       450  

GGAMMA HGGKKITRFKLWPVTSAARRTTTCI*WESKISGFEGS*HRLDSGWKLGV*LSGGQAGALS

                                                                  

AB1071 RLL*GVTLGSND*NKSAMQRSGQESFRKSK*QLQKPQSQNELDSLQKWKEDMSALGNTGH

   9850      9820      9790      9760      9730      9700         

 

>>Genbank:AB107101 AB107101 Homo sapiens chromosome 3 clone 15N2  (91914 aa)

Frame: 5 initn: 248 init1: 128 opt: 128  Z-score: 86.5  bits: 33.2 E(): 0.073

banded Smith-Waterman score: 128;  77.778% identity (77.778% ungapped) in 18 aa overlap (365-382:9932-9879)

 

          340       350       360       370       380       390   

GGAMMA LIPDGGKVCPGVRNN*NIWAGVDFESQLCVCVCVCARVCLCVCESVCFF*RFQPTAYRVH

                                     :::::. ::.::: ::::           

AB1071 YTGRRH*VKE*DHLSVLKKIN*WEISLYGVCVCVCVCVCVCVCLSVCFVVTLDKLVKKDF

   10020      9990      9960      9930      9900      9870        

 

          400       410       420       430       440       450   

GGAMMA GGKKITRFKLWPVTSAARRTTTCI*WESKISGFEGS*HRLDSGWKLGV*LSGGQAGALSS

                                                                  

AB1071 FEG*L*AVMIRTNQPCKEVDKSLSGRANSNYKSPKARMSLTVYRSGKKTCLLWATLVMMK

    9840      9810      9780      9750      9720      9690         

 

>>Genbank:AB107101 AB107101 Homo sapiens chromosome 3 clone 15N2  (91914 aa)

Frame: 6 initn: 241 init1: 107 opt: 108  Z-score: 67.4  bits: 29.7 E(): 0.82

banded Smith-Waterman score: 108;  56.522% identity (56.522% ungapped) in 23 aa overlap (360-382:9955-9887)

 

     330       340       350       360       370       380        

GGAMMA TNLYPLIPDGGKVCPGVRNN*NIWAGVDFESQLCVCVCVCARVCLCVCESVCFF*RFQPT

                                     ..:   ::::. ::.::: :::.      

AB1071 LCFHGLHTILGEDTR*RNKIISVF*RR*TDGKLVCMVCVCVCVCVCVCVSVCLSVL*LL*

    10040     10010      9980      9950      9920      9890       

 

     390       400       410       420       430       440        

GGAMMA AYRVHGGKKITRFKLWPVTSAARRTTTCI*WESKISGFEGS*HRLDSGWKLGV*LSGGQA

                                                                   

AB1071 TSWLRKTSLRGDFRQ**LEQISHAKKWTRVFQEEQIATTKAPKPE*A*QFTEVERRHVCF

     9860      9830      9800      9770      9740      9710       

 

>>Genbank:A00001 A00001 Cauliflower mosaic virus satellite cDNA.  (335 aa)

Frame: 2 initn:  43 init1:  43 opt:  56  Z-score: 61.6  bits: 20.5 E():  1.7

banded Smith-Waterman score: 56;  24.490% identity (27.273% ungapped) in 49 aa overlap (87-134:80-214)

 

         60        70        80        90       100        110    

GGAMMA R*ALVTRTREGRKDPVPGKSPGRFSGFVAPSDCQTVLVNLTGSW-LSTHGPRGSLTALAT

                                     .   .:.   ::.  :...   ..  .: .

A00001     FCLMENCAEGLYLREDLSLGGVGYLPAKAG*VMFPRTGDRWLASYVRYSQYYTLI*

                20        50        80       110       140        

 

         120       130       140       150       160       170    

GGAMMA CPLPLPSWATPKSRHMARRC*LPWEMP*STWMISRAPLPS*VNCTVTSCMWILRTSR*VQ

        :  . :     .:::.::                                        

A00001 APAQFASR----TRHMVRRYHGISKETLC*VV*VMTHAGRG*GLCYADLRECLSFLHRT

     170           200       230       260       290       320    

 

>>Genbank:A00001 A00001 Cauliflower mosaic virus satellite cDNA.  (335 aa)

Frame: 6 initn:  49 init1:  49 opt:  49  Z-score: 54.9  bits: 19.3 E():  3.6

banded Smith-Waterman score: 49;  22.727% identity (22.727% ungapped) in 22 aa overlap (118-139:282-217)

 

        90       100       110       120       130       140      

GGAMMA DCQTVLVNLTGSWLSTHGPRGSLTALATCPLPLPSWATPKSRHMARRC*LPWEMP*STWM

                                      :::. .  ..  ..    .::       

A00001              VLCRNDRHSRRSA*HKP*PLPACVMTHTT*QSVSFEIPWYRRTMCRV

                         310       280       250       220        

 

       150       160       170       180       190       200      

GGAMMA ISRAPLPS*VNCTVTSCMWILRTSR*VQEMFQHCCL*SRGNLDN*VLI*AQQGVSCLKIL

                                                                  

A00001 LLAN*AGAQMRV*Y*E*RT*LASQRSPVRGNITQPALAGRYPTPPSDRSSRRYNPSAQFS

     190       160       130       100        70        40        

 

>>Genbank:AAB2MCG1 AF032092 Aotus azarai beta-2-microglobulin pr  (289 aa)

Frame: 3 initn:  40 init1:  40 opt:  46  Z-score: 53.2  bits: 18.7 E():  4.4

banded Smith-Waterman score: 46;  30.769% identity (42.105% ungapped) in 26 aa overlap (295-319:129-188)

 

          270       280       290       300       310        320  

GGAMMA FTFPLFLLFVLKHLSGGRTSMVVKKMQAEGIYWLSQSGELWWPNIHC*GYSYI-SWTHIK

                                       ::..    :::      ::   .:   

AAB2MC ALS*LAVPDSV*HK*RRRVARALLQRTTLGSRWLASW---WWPC---SCYSLCLAWRLSS

             60        90       120          150          180     

 

           330       340       350       360       370       380  

GGAMMA CC*CFITNLYPLIPDGGKVCPGVRNN*NIWAGVDFESQLCVCVCVCARVCLCVCESVCFF

                                                                  

AAB2MC VSLSSRPALVLPLPLPPSVAVSVLSGFVT                              

        210       240       270                                   

 

>>Genbank:AB000556 AB000556 Synthetic unidentified bacterium/pla  (924 aa)

Frame: 2 initn:  97 init1:  49 opt:  54  Z-score: 51.8  bits: 20.2 E():  5.1

banded Smith-Waterman score: 54;  21.569% identity (22.449% ungapped) in 51 aa overlap (368-416:485-637)

 

       340       350       360       370       380       390      

GGAMMA DGGKVCPGVRNN*NIWAGVDFESQLCVCVCVCARVCLCVCESVCFF*RFQPTAYRVHGGK

                                          :   :    .  : . :. :.  ..

AB0005 RRSTLKPGHVT*RVCARCASTTV*CTATNCRRPTNCSLRCSVSTWPSR*SKTTARASPSS

          410       440       470       500       530       560   

 

       400         410       420       430       440       450    

GGAMMA KITRFKL--WPVTSAARRTTTCI*WESKISGFEGS*HRLDSGWKLGV*LSGGQAGALSSL

       .  : .   ::..:. ::...                                      

AB0005 SACRPRRTTWPLSSTRRRASSITSHSSWKPGKTCFAPPT*SP*PTPRSISARPVTV*PTA

          590       620       650       680       710       740   

 

>>Genbank:A16SRRNA X87617 Actinomycete (genus unknown) 16S ribos  (1497 aa)

Frame: 3 initn:  71 init1:  51 opt:  55  Z-score: 49.0  bits: 20.3 E():  6.6

banded Smith-Waterman score: 55;  33.333% identity (33.333% ungapped) in 21 aa overlap (393-413:663-725)

 

            370       380       390       400       410       420 

GGAMMA CVCVCVCARVCLCVCESVCFF*RFQPTAYRVHGGKKITRFKLWPVTSAARRTTTCI*WES

                                     . ::  ...  :: .:.: .:        

A16SRR RA*LWACSGYGQARVW*GRLEFLV*R*NAQISGGTPVAKAGLWATTDAEERKHGERTGLD

              600       630       660       690       720       750

 

            430       440       450       460       470       480 

GGAMMA KISGFEGS*HRLDSGWKLGV*LSGGQAGALSSLWVHLYCLLSSQQLLGNVLVTVLAIHFG

                                                                  

A16SRR TLVVHAVNVGR*VWGTFHVFCAAANALSAPPGEYGRKAKTQRN*RGPAQAAEHAD*FDAT

              780       810       840       870       900       930

 

>>Genbank:AA12SRRNA X67626 A.australis mitochondrial gene for 12  (386 aa)

Frame: 6 initn:  43 init1:  43 opt:  43  Z-score: 48.1  bits: 18.2 E():  7.2

banded Smith-Waterman score: 43;  28.571% identity (28.571% ungapped) in 21 aa overlap (253-273:366-304)

 

            230       240       250       260       270       280 

GGAMMA QWQCFRA*GVPLKI*MDNFDFEKREVEMRKMTFLY*ISVERTFTFPLFLLFVLKHLSGGR

                                      :::. .... .:.  .  ::         

AA12SR                         PRAGLK*TFLFCFTAKSSF*GGFHTLFRSILF*KM*

                                 370       340       310       280

 

            290       300       310       320       330       340 

GGAMMA TSMVVKKMQAEGIYWLSQSGELWWPNIHC*GYSYISWTHIKCC*CFITNLYPLIPDGGKV

                                                                  

AA12SR PISPISWAIP*PVLLAGVAVVLAALSF*AGWRRRYVGCVGKRWLGVSWVIDYRTGSSRWV

             250       220       190       160       130       100

 

>>Genbank:AAG311130 AJ311130 Apodemus agrarius mitochondrial 12S  (955 aa)

Frame: 3 initn:  48 init1:  48 opt:  48  Z-score: 45.8  bits: 19.1 E():  8.7

banded Smith-Waterman score: 48;  38.095% identity (38.095% ungapped) in 21 aa overlap (107-127:576-638)

 

         80        90       100       110       120       130     

GGAMMA PGRFSGFVAPSDCQTVLVNLTGSWLSTHGPRGSLTALATCPLPLPSWATPKSRHMARRC*

                                     :..  .::.     :: :.::.       

AAG311 QRTTSYSLKLKGLGGTLYPPRGACSIIDKPRSTSPSLANSAYIPPSSANPKKELK*AQEF

             510       540       570       600       630       660

 

        140       150       160       170       180       190     

GGAMMA LPWEMP*STWMISRAPLPS*VNCTVTSCMWILRTSR*VQEMFQHCCL*SRGNLDN*VLI*

                                                                   

AAG311 SIKTLGQGVANEMGRNGLHFLIKEHLRNPL*N*RIKEDLVVN*E*RA*LN*AMKYAHTAR

             690       720       750       780       810       840

 

 

 

 

566 residues in 1 query   sequences

103380 residues in 14 library sequences

 Scomplib [34t21]

 start: Fri Dec  3 14:58:47 2004 done: Fri Dec  3 14:58:51 2004

 Total Scan time:  0.490 Total Display time:  2.310

 

Function used was TFASTA [version 3.4t21 May 14, 2003]

 

 

 

INPUT FILES

[ Previous | Top | Next ]

TFastA+ accepts a single protein sequence as the query sequence. The search set is either a single nucleic acid sequence or multiple nucleic acid sequences. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example Genbank:*. If TFastA+ rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence.

 

RELATED PROGRAMS

[ Previous | Top | Next ]

TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastA+ does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA+ may be more sensitive than BLAST+.

BLAST+ searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST+ can produce gapped alignments for the matches it finds. NetBLAST+ searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST+ can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

SSearch+ does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST+ and FastA+, it can be very slow.

TFastX+ does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA+, and like TFastA+, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX+ does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX+ translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.

ProfileSearch and MotifSearch use a profile (derived from a set of aligned sequences) instead of a query sequence to search a collection of sequences. FindPatterns+ uses a pattern described by a regular expression to search a collection of sequences. HmmerSearch uses a profile hidden Markov model as a query to search a sequence database to find sequences similar to the family from which the profile HMM was built. Profile HMMs can be created using HmmerBuild.

StringSearch, LookUp, and Names identify sequences by searching the annotation (non-sequence) portions of seqence files or sequence databases.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST.

BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds. NetBLAST+ searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST and FastA, it can be very slow.

TFastX does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA+, and like TFastA+, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

RESTRICTIONS

[ Previous | Top | Next ]

The query sequence may not be longer than 20,000 symbols. You cannot select a list size of more than 1,000 best scores nor view more than 1,000 alignments. The word size must be either 1 or 2.

For the estimates of statistical significance to be valid, the search set must contain a large sample of unrelated sequences. The statistical estimates will not be calculated at all if there are fewer than 60 frames searched (equivalent to 10 sequences when all six frames are searched, or 20 sequences when only the top three frames are searched).

With -nooptall, the estimates of statistical significance will not be accurate.

Fast suite of programs work with the flat file databases only.  Users cannot specify Blast databases  as a database specification for TFastA+.

For Tru64 (OSF) TFastA+ fails with an error message:

" While running the child process: Child was terminated by signal 6 (SIGABRT)"

Error in cleaning up after application: Exception: Error reading fast program

output: Unable to open tfasta output file: "/tmp/bslskAAAMGXMCf.tmp" (at

/tmp/bslskAAAMGXMCf.tmp:0)."

 

Workaround

There is an upper limit on the amount of memory that is allocated per process. For tru64 machine the limit for datasize is set to 128M. To increase this limit, execute

unlimit datasize (csh) or 

ulimit datasize (ksh)

This will increase the limit on the datasize to 1024M. This is the maximum amount of memory that an individual process can take on Tru64 machine. So, default settings for the search set parameter (-infile2) for the fasta suite of programs may cause a crash. Please execute the programs with a smaller subset. The programs have been tested successfully using a search set of 400 thousand sequences

 

 

ALGORITHM

[ Previous | Top | Next ]

For a description of the algorithm, see the FastA+ program documentation.

CONSIDERATIONS

[ Previous | Top | Next ]

TFastA+ treats each reading frame as a different sequence. If a nucleotide sequence contains a gene coding for a protein similar to your query, but with an intervening sequence that changes the reading frame, the program will find and display two matches, one for each reading frame. If the individual matches each have fairly low scores, they may not make the list of best scores. If you suspect that the gene for your query sequence contains intervening sequences, or if you are searching a nucleotide database known to contain sequencing errors that may cause a frameshift (such as the EST division of GenBank), use TFastX+ instead of TFastA+.

TFastA+ translates stop codons in search set sequences to the sequence symbol X.

The E() scores are affected by similarities in sequence composition between the query sequence and the search set sequence. Unrelated sequences may have "significant" scores because of composition bias.

If there is a database entry that overlaps your query in several places, but there are large gaps between the matching regions, only the best overlap appears in the alignment display.

There are two ways to control the size of the list of best scores. By default, scores are listed until a specific E() value is reached. You may set the value in response to the program prompt or by using -expect; otherwise the program uses 10.0 for protein searches, 2.0 for nucleic acid searches. (If you are running the program interactively, it will show no more than 40 scores initially, and ask if you want to see more scores if there are any more that are less than the E() value.)

If you use -listsize, the E() value is ignored, and the program will list the number of scores you requested.

You can control the number of alignments using -noalign and -align. The program behaves differently depending on whether it is being run noninteractively (in batch or with -Default on the command line) or interactively. In the noninteractive case, the program displays the number of alignments set by -align. (If this is not present, it shows 40 alignments or the number of scores that were listed, whichever is smaller.) If you run the program interactively, it displays the list of best scores, and then asks you how many alignments you want to see. (This prompt does not appear if you use -noalign or -align.)

Increasing Sensitivity By Adjusting Word Size

By default, TFastA+ uses a word size of 2. If it finds few or no matches, especially if your query sequence is short, rerun the search using -wordsize=1 to increase the sensitivity. Note that this will dramatically increase the amount of CPU time required to run the program.

Adjusting Gap Creation and Extension Penalties

Unlike other GCG programs, TFastA+ does not read the default gap creation and gap extension penalties from the scoring matrix file. It uses default gap creation and extension penalties that were empirically determined to be appropriate for the BLOSUM50 scoring matrix. If you select a different scoring matrix with -matrix, you may need to change the gap penalties. The histogram display gives a qualitative view of the quality of fit between the actual distribution of scores and the expected distribution of scores. This information may indicate whether or not suitable gap creation and extension penalties were used for the search. When the histogram shows poor agreement between the actual distribution and the theoretical distribution, you might consider using -gapweight and/or -lengthweight to specify higher gap creation and extension penalties, respectively. For example, you might increase the gap creation penalty from 16 to 20 and the gap extension penalty from 4 to 6.

Differences in Applying Gap Extension Penalties

There are two different philosophies on how to penalize gaps in an alignment. One way is to penalize a gap by the gap creation penalty plus the extension penalty times the length of the gap (gapweight + (lengthweight x gap length)). The other way is to use the gap creation penalty plus the extension penalty times the gap length excluding the first residue in the gap (gapweight + (lengthweight x (gap length - 1)).

"Native" GCG programs, such as Framesearch and Bestfit, handle gap extension penalties the first way, while the FastA+-family programs use the second way. Therefore a value for -lengthweight that gives good results with one of the FastA+-family programs may not give equivalent results with a native GCG program, and vice versa.

Increasing Program Speed Using Multithreading

This program is multithreaded. It has the potential to run faster on a machine equipped with multiple processors because different parts of the analysis can be run in parallel on different processors. By default, the program assumes you have one processor, so the analysis is performed using one thread. You can use -processors to increase the number of threads up to the number of physical processors on the computer.

Under ideal conditions, the increase in speed is roughly linear with the number of processors used. But conditions are rarely ideal. If your computer is heavily used, competition for the processors can reduce the program's performance. In such an environment, try to run multithreaded programs during times when the load on the system is light.

As the number of threads increases, the amount of memory required increases substantially. You may need to ask your system administrator to increase the memory quota for your account if you want to use more than two threads.

Never use -processors to set the number of threads higher than the number of physical processors that the machine has -- it does not increase program performance, but instead uses up a lot of memory needlessly and makes it harder for other users on the system to get processor time. Ask your system administrator how many processors your computer has if you aren't sure.

SUGGESTIONS

[ Previous | Top | Next ]

Identifying the Search Set

If you want to search a single database division instead of an entire database, see the "Using Database Sequences" topic of Section 2, Using Sequence Files and Databases of the User's Guide for a list of the logical names used for the databases and the divisions of each database. The search set can also consist of a group of sequence files that are not in a database. Use a multiple sequence specification to name these. For information about naming groups of sequences for the search set, see the topics "Specifying Files" and "Using Wildcards" in Section 1, Getting Started, and "Using Database Sequences," "Using Multiple Sequence Format (MSF) Files", "Using Rich Sequence Format (RSF) Files", and "Using List Files" in Section 2, Using Sequence Files and Databases of the User's Guide.

Batch Queue

TFastA+ is one of the few programs in Accelrys GCG (GCG) that can take more than a few minutes to run. Most comparisons should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using -batch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide. Very large comparisons may exceed the CPU limit set by some systems.

Interrupting a Search: <Ctrl>C

You can type <Ctrl>C to interrupt a search and see the results from the part of the search that has already been completed. Because the program is multithreaded, the search may not be interrupted immediately, but will continue until one of the threads finishes processing its data and returns for more data.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -check to view the summary below and to specify parameters before the program executes. In the syntax summary below, square brackets ([ and ]) enclose parameter values that are optional. For each program parameter, square brackets enclose the type of parameter value specified, the default parameter value, and shortened forms of the parameter name, aliases.  Programs with a plus in the name use either the full parameter name or a specified alias. If “Type” is “Boolean”, then the presence of the parameter on the command line indicates a true condition. A false condition needs to be stated as, parameter=false.

TFastA+ does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA+ translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

Minimal Syntax: % tfasta+ [-infile1=]value –Default.
 
Minimal Parameters (case-insensitive):
 
-infile1        [Type: List / Default: EMPTY / Aliases: infile in1 in]
                Input files specification.
 
Prompted Parameters (case-insensitive):
 
-begin          [Type: Integer / Default: '1' / Aliases: beg]
                Starting point of the range of interest in the input sequence.
 
-end            [Type: Integer / Default: '-1']
End point of the range of interest in the input sequence. A value of '-1' indicates that the range extends till the end of input sequence.
 
-infile2        [Type: List / Default: EMPTY / Aliases: in2 db]
                Search set specification.
 
-outfile        [Type: OutFile / Default: '<sequence_name>.<program_name>' /
Aliases: out] File to which output is written. A value of '-' means STDOUT.
Specifying this option also turns on the 'concat' option. Default value is '-'.
 
Optional Parameters (case-insensitive):
 
-check          [Type: Boolean / Default: 'false' / Aliases: che help]
                Prints out this usage message.
 
-default        [Type: Boolean / Default: 'false' / Aliases: d def]
               Specifies that sensible default values be used for all  parameters where possible.
 
-documentation  [Type: Boolean / Default: 'true' / Aliases: doc]
                Prints banner at program startup.
 
-quiet          [Type: Boolean / Default: 'false' / Aliases: qui]
                Tells application to print only a minimal amount of information.
 
-wordsize       [Type: Integer / Default: EMPTY / Aliases: wor]
                Size of word (k-tuple) used in the hashing step.
 
-expect         [Type: Double / Default: '2.0' / Aliases: exp]
                Shows all scores whose E() value is less than the specified value of expect.
 
-matrix         [Type: String / Default: EMPTY / Aliases: mat]
                Assigns the scoring matrix for the comparison.
 
-processors     [Type: Integer / Default: '1' / Aliases: proc]
               On a multiprocessor computer, this parameter controls the number of threads to use for database search.
 
-minlength      [Type: Integer / Default: EMPTY / Aliases: minl]
               The search set is restricted to sequences whose length is more than the value specified by this parameter.
 
-maxlength      [Type: Integer / Default: EMPTY / Aliases: maxl]
               The search set is restricted to sequences whose length is less than the value specified by this parameter.
 
-pamfactor      [Type: Boolean / Default: 'DEFAULT_PARAM_VALUE' / Aliases:
pam] This parameter governs whether a scoring matrix should be used for calculating initial diagonal scores, instead of using the identical match scores from the scoring matrix.
Default is to use FASTA+ internal behavior, which differs for protein and nucleotide searches.
 
-gapweight      [Type: Integer / Default: EMPTY / Aliases: gap]
This parameter specifies the gap creation penalty that is  substracted from an alignment every time a gap is created.
 
-lengthweight   [Type: Integer / Default: EMPTY / Aliases: len]
This parameter specifies the gap extension penalty that is substracted from an alignment every time a gap is extended by one residue.
 
-optall         [Type: Boolean / Default: 'DEFAULT_PARAM_VALUE' / Aliases:opt]  With this parameter, the program immediately performs an alignment and calculates the opt score when the initn score is greater than or equal to the value specified by this parameter. This parameter allows you to override the default threshold calculated by the program. Scores are sorted and saved by opt score during the search.
 
-NOOPTall doesn't compute the opt score until the search is complete. In this case scores are sorted and saved by initn score instead of by opt score.
 
-listsize       [Type: Integer / Default: '10' / Aliases: lis]
               This parameter controls the number of top scores show. Overrides the -expect parameter.
 
-alignments     [Type: Integer / Default: '20' / Aliases: align ali]
This parameter limits the number of alignments to display in the output file to the 10 best matches in the list. Use -noalign to suppress the sequence alignments in the output file.
 
 
-showall        [Type: Boolean / Default: 'DEFAULT_PARAM_VALUE' / Aliases:
show] Shows entire sequences in the alignment display, instead of just the best region of overlap and its surroundings.
 
-native         [Type: Boolean / Default: 'false']
                Output native fasta formatted output.
 

-markx          [Type: Integer / Default: EMPTY / Aliases: mark]

               

                This parameter determines the alignment display mode - especially the symbols that identify matches and mismatches. The default value, -MARKx=0 uses a colon to show identities and a period (.) to show conservative replacements.

 

-MARKx=1 will not mark identities; instead, conservative replacements are connected with a lowercase x, and non-conservative substitutions are connected with an uppercase X.

If -MARKx=2, the residues in the second sequence are shown only if they differ from the first sequence.

-MARKx=3 displays the aligned library sequences without the query sequences; these can be used to build a primitive multiple alignment.

-MARKx=4 provides a graphical display of the boundaries of the alignments.

-MARKx=5 provides a combination of -MARKx=4 and -MARKx=0.

-MARKx=6 provides -MARKx=5 plus HTML formatting.

-MARKx=9 provides percent identity and coordinates with the initial list of high scores as well as the conventional

-MARKx=0 alignments.

Use -MARKx=10 to get aligned sequences in the FastA "parsable" output format.

 
-histogram      [Type: Boolean / Default: 'true' / Aliases: his]
                Start/Suppress printing the histogram.
 
-linesize       [Type: Integer / Default: EMPTY / Aliases: lin]
This parameter lets you set the number of sequence symbols in each line of the alignment to any number between 60 and 200.
 
-batch          [Type: Boolean / Default: 'false']
                Allows submitting a job to a batch queue.
 
-swalign        [Type: Boolean / Default: 'false' / Aliases: sw]
Does an unlimited Smith-Waterman alignment as the final alignment for the nucleotide searches, instead of 'alignment in a band'.
 
-dbtopstrand    [Type: Boolean / Default: 'false' / Aliases: dbtop]
                Translate and search only the top strand of search set sequences.
 
-dbbottomstrand [Type: Boolean / Default: 'false' / Aliases: dbbot]
                Translate and search only the bottom strand of search set sequences.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -data1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either

1) Have a data file with exactly the same name as the program default scoring matrix in your current working directory; or

 2) Have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name Share_Matrix; or

3) Name a file on the command line with an expression like -matrix=mymatrix.cmp. If you don't include a directory specification when you name a file with -matrix, the program searches for the file first in your local directory, then in the directory with the logical name Share_Matrix,. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.

TFastA+ reads a scoring matrix containing the values for every possible match from your working directory or the public database. The default matrix is blosum50.cmp, which is a BLOSUM50 matrix. You can use the Fetch+ program to obtain a copy of this file if you need to modify it for your own needs.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line. Shortened forms of the parameter name, aliases, are shown, separated by commas.

-infile1, -infile, -in1, -in

 

Inputs file specification.

 

-begin, -beg

 

Starting point of the range of interest in the input sequence.

 

-end

End point of the range of interest in the input sequence. A value of '-1' indicates that the range extends till the end of input sequence.

 

-infile2, -in2, -db

 

 Search set specification.

 

-outfile, -out

 

File to which output is written. A value of '-' means STDOUT.Specifying this option also turns on the 'concat'  option.Default value is '-'

 

-wordsize=2, -wor

Sets the size of the word (k-tuple) to use for the hashing step.

-matrix=mymatrix.cmp, -matr

Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -matrix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-check, -che, -help

 

Prints out this usage message.

 

-default, -default

 

Specifies that sensible default values be used for all parameters where possible.

 

-documentation, -doc

 

Prints banner at program startup.

 

-quiet, -qui

 

This parameter is not supported.

 

-alignments, -align -ali

 

This parameter limits the number of alignments to display in the output file to the 10 best matches in the list. Use -noalign to suppress the sequence alignments in the output file.

 

-histogram, -his

 

Start/suppress printing the histogram.

 

-expect=2.0, -exp

Shows all scores whose E() value is less than 2.0. Ignored if -listsize is used.

-processors=2, -proc

Tells the program to use 2 threads for the database search on a multiprocessor computer.

-pamfactor, -pam     

This parameter governs whether a scoring matrix should be used for calculating initial diagonal scores, instead of using the identical match scores from the scoring matrix.

Default is to use FASTA+ internal behavior, which differs for protein and nucleotide searches

-minlength=1000, -minl

Restricts the search to search set sequences that are equal to or longer than 1000 residues.

-maxlength=5000, -maxl

Restricts the search to search set sequences that are equal to or shorter than 5000 residues.

-dbtopstrand, -dbtopstrand

Translates and searches only the three forward reading frames.

-dbbotomstrand, -dbbot

Translates and searches only the three reverse complement reading frames.

-gapweight=12, -gap

Specifies the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

-lengthweight=2, -len

Specifies the gap extension penalty that is subtracted from the alignment score for each residue added to an existing gap.

-optall=20, -opt

Immediately performs an alignment and calculates the opt score when the initn score is greater than or equal to 20. This parameter allows you to override the default threshold calculated by the program. Scores are sorted and saved by opt score during the search. -nooptall doesn't compute the opt score until the search is complete. In this case scores are sorted and saved by initn score instead of by opt score.

-swalign, -sw

Does an unlimited Smith-Waterman alignment as the final alignment for TFastA+ searches, instead of the "alignment in a band" version of Smith-Waterman. (Note: this can be very slow.)

-listsize=40, -lis

Shows the best 40 scores. Overrides -expect.

-showall, -show

Shows entire sequences in the alignment display, instead of just the best region of overlap and its surroundings.

      -markx, -mark

     

                 This parameter determines the alignment display mode - especially the symbols that identify matches and mismatches. The default value, -markx=0 uses a colon to show identities and a period (.) to show conservative replacements.

            -markx=1 will not mark identities; instead, conservative replacements are connected with a lowercase x, and non-conservative substitutions are connected with an uppercase X.

            If -markx=2, the residues in the second sequence are shown only if they differ from the first sequence.

            -markx=3 displays the aligned library sequences without the query sequences; these can be used to build a primitive multiple alignment.

            -markx=4 provides a graphical display of the boundaries of the alignments.

            -markx=5 provides a combination of -markx=4 and -markx=0.

            -markx=6 provides -markx=5 plus HTML formatting.

            -markx=9 provides percent identity and coordinates with the initial list of high scores as well as the conventional

            -markx=0 alignments.

            Use -markx=10 to get aligned sequences in the FastA "parsable" output format.

 

  -native    

 

            Output native FastA+ formatted output.

 

-linesize=60, -lin

Lets you set the number of sequence symbols in each line of the alignment to any number between 60 and 200.

-batch, -bat

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

Printed: September 9, 2005 16:21


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio