FASTA PARSABLE OUTPUT

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

 

Introduction

This document may be useful for programmers and script writers, but can be skipped by most users of the FastA program family (FastA, FastX, TFastA, TFastX, and SSearch).

The standard alignment formats of the FastA program family are difficult to parse, and so it has been hard to extract the alignment information from the output file for further processing. A new command-line parameter, -MARKx=10, saves the alignments in a format which is easily parsed. The following is a description of the parsable output file.

Records

The output file has three types of records. The header record starts with >>> . It contains information about the search as a whole, which version of the program was used, which analysis parameters were used, etc. There is only one header record per output file.

An alignment record contains information pertaining to a pairwise alignment, such as the scores for the alignment. It starts with >>. There will be one alignment record for each alignment that was saved.

Following each alignment record are two aligned sequence records, which start with > . Each of these records contains the information for one of the sequences in the alignment: the length of the sequence, the beginning and end of the alignment in that sequence's coordinates, etc.

The end of the parsable records is denoted with >>><<<.

Record Parameters

Information in each record consists of parameters and their values in a specific format. Parameters consist of a parameter tag, followed by an underscore, followed by the parameter's name. The complete format is:

 
 

 

; tag_name: value(s)

Parameters originating in William Pearson's FASTA package always have a two-character tag. Current FASTA tags are:

mp - main program information: name, version, statistical info, etc.
pg - program function information: function name and version, matrix used, etc.
fa - FastA results: scores, expect values, etc.
sw - Smith-Waterman results: scores, overlap values, etc.
sq - sequence information: length, type, etc.
al - alignment information: start, stop, display offset, etc.

Redistributors of the FASTA package may create their own parameters. If they do, they must use a tag with more than two characters, for example:

 
 

 

; ebi_access:  M61687

; gcg_ver:  9.0

Currently there are no Accelrys GCG (GCG) specific parameters.

Interpreting Aligned Sequence Records

Most of the parameters specified by two-character tags correspond to values that are presented in other FastA output formats. A notable exception is parameters with the al tag:

al_start gives the location of the alignment start in the original sequence

al_stop gives the location of the end of the alignment in the original sequence

al_display_start gives the location of the first displayed residue in the original sequence. (This may not be the same as the first residue in the aligned region, because FastA provides some context for an alignment; even if the -SHOWall parameter is not used, FastA will try to provide about 30 residues on either side of the actual aligned region if the alignment is in the middle of one or the other sequence.)

Sequences may be padded with leading hyphens, if necessary. For example, if the beginning of the query sequence aligns with the tenth residue of the library sequence, then the query sequence will be padded with ten leading hyphens (-) to produce the alignment. The leading hyphens are a formatting convenience only; they are not considered in the numbering system for al_display_start, al_start, or al_stop.

As an example, here is a pair of aligned sequence records:

 
 

 

     >gtm1_mouse ..

     ; sq_len: 217

     ; sq_offset: 1

     ; sq_type: p

     ; al_start: 3

     ; al_stop: 180

     ; al_display_start: 1

     ---PMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLN

     EKFKLGLDFPNLPYLIDGSHKITQSNAILRYLARKHH---LDGETEEERI

     RADIVENQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKR

     PWFAGDKVTYVDFLAYDILDQYRMFEPKCLDA------FPNLRDFLARFE

     GLKKISAYMKSSRYIATPIFSKMAHWSNK

     >GTX2_TOBAC ..

     ; sq_len: 223

     ; sq_type: p

     ; al_start: 6

     ; al_stop: 181

     ; al_display_start: 1

     MAEVKLLGFW-YSPFSHRVEWALKIKGVKYE---YIEEDRDN--KSSLLL

     QSNPV---YKKVPVLIHNGKPIVESMIILEYIDETFEGPSILPKDPYDRA

     LARFWAKFLDDKVAAVVNTFFRKGEEQEKGK--EEVYEMLKVLDNELKDK

     KFFAGDKFGFADIAANLVGFWLGVFEEGYGDVLVKSEKFPNFSKWRDEYI

     NCSQVNESLPPRDELLAFFRARFQAVVASRSAPK

To properly display this alignment, the first P of gtm1_mouse must line up with the first V in GTX2_TOBAC, and the actual aligned region (the region that scores as the best local alignment) starts with the first I in gtm1_mouse (amino acid 3) and the first L (amino acid 6) in GTX2_TOBAC.

An Example

Here is a printout of a complete parsable output file containing three alignment records, followed by a printout of the first alignment as it is output by FastA when the default parameter -MARKx=3 is used.

 
 
>>>A41264, 496 aa vs @GLUT4.LIST library
; mp_name: FASTA
; mp_ver: GCG Package 10.0 implementation of FASTA 3.1t12
 
; pg_name: FASTA
; pg_ver: 3.15 August, 1998
; pg_matrix: GenRunData:Blosum50.Cmp
; pg_gap-pen: -12 -2
; pg_ktup: 2
; pg_optcut: 25
; pg_cgap: 37
>>Pir2:A49158
; fa_initn: 1844
; fa_init1: 1201
; fa_opt: 1915
; sw_score: 1915
; sw_ident: 0.593
; sw_overlap: 496
>A41264 ..
; sq_len: 496
; sq_offset: 1
; sq_type: p
; al_start: 4
; al_stop: 493
; al_display_start: 1
-------------MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEK
IIQAFYNRTLSQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVS
LFVNRFGRRNSMLLVNVLAFAGGALMALSKIAKAVEMLIIGRFIIGLFCG
LCTGFVPMYISEVSPTSLRGAFGTLNQLGIVVGILVAQIFGLEGIMGTEA
LWPLLLGFTIVPAVLQCVALLFCPESPRFLLINKMEEEKAQTVLQKLRGT
QDVSQDISEMKEESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQL
SGINAVFYYSTGIFERAGITQPVYATIGAGVVNTVFTVVSLFLVERAGRR
TLHLVGLGGMAVCAAVMTIALALKEK--WIRYISIVATFGFVALFEIGPG
PIPWFIVAELFSQGPRPAAMAVAGCSNWTSNFLVGMLFPYAEKLCGPYVF
LIFLVFLLIFFIFTYFKVPETKGRTFEDISRGFEEQVETSSPSSPPIEKN
PMVEMNSIEPDKEVA
>A49158 ..
; sq_len: 509
; sq_type: p
; al_start: 17
; al_stop: 507
; al_display_start: 1
MPSGFQQIGSEDGEPPQQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQK
VIEQSYNETWLGRQGPEGPSSIPPGTLTTLWALSVAIFSVGGMISSFLIG
IISQWLGRKRAMLVNNVLAVLGGSLMGLANAAASYEMLILGRFLIGAYSG
LTSGLVPMYVGEIAPTHLRGALGTLNQLAIVIGILIAQVLGLESLLGTAS
LWPLLLGLTVLPALLQLVLLPFCPESPRYLYIIQNLEGPARKSLKRLTGW
ADVSGVLAELKDEKRKLERERPLSLLQLLGSRTHRQPLIIAVVLQLSQQL
SGINAVFYYSTSIFETAGVGQPAYATIGAGVVNTVFTLVSVLLVERAGRR
TLHLLGLAGMCGCAILMTVALLLLERVPAMSYVSIVAIFGFVAFFEIGPG
PIPWFIVAELFSQGPRPAAMAVAGFSNWTSNFIIGMGFQYVAEAMGPYVF
LLFAVLLLGFFIFTFLRVPETRGRTFDQISAAFHR-----TPSLLEQEVK
PSTELEYLGPDEND
>>Pir2:A32101
; fa_initn: 1822
; fa_init1: 1188
; fa_opt: 1883
; sw_score: 1883
; sw_ident: 0.589
; sw_overlap: 496
>A41264 ..
; sq_len: 496
; sq_offset: 1
; sq_type: p
; al_start: 4
; al_stop: 493
; al_display_start: 1
-------------MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEK
IIQAFYNRTLSQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVS
LFVNRFGRRNSMLLVNVLAFAGGALMALSKIAKAVEMLIIGRFIIGLFCG
LCTGFVPMYISEVSPTSLRGAFGTLNQLGIVVGILVAQIFGLEGIMGTEA
LWPLLLGFTIVPAVLQCVALLFCPESPRFLLINKMEEEKAQTVLQKLRGT
QDVSQDISEMKEESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQL
SGINAVFYYSTGIFERAGITQPVYATIGAGVVNTVFTVVSLFLVERAGRR
TLHLVGLGGMAVCAAVMTIALALKEKW--IRYISIVATFGFVALFEIGPG
PIPWFIVAELFSQGPRPAAMAVAGCSNWTSNFLVGMLFPYAEKLCGPYVF
LIFLVFLLIFFIFTYFKVPETKGRTFEDISRGFEEQVETSSPSSPPIEKN
PMVEMNSIEPDKEVA
>A32101 ..
; sq_len: 509
; sq_type: p
; al_start: 17
; al_stop: 507
; al_display_start: 1
MPSGFQQIGSEDGEPPQQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQK
VIEQSYNATWLGRQGPGGPDSIPQGTLTTLWALSVAIFSVGGMISSFLIG
IISQWLGRKRAMLANNVLAVLGGALMGLANAAASYEILILGRFLIGAYSG
LTSGLVPMYVGEIAPTHLRGALGTLNQLAIVIGILVAQVLGLESMLGTAT
LWPLLLAITVLPALLQLLLLPFCPESPRYLYIIRNLEGPARKSLKRLTGW
ADVSDALAELKDEKRKLERERPLSLLQLLGSRTHRQPLIIAVVLQLSQQL
SGINAVFYYSTSIFELAGVEQPAYATIGAGVVNTVFTLVSVLLVERAGRR
TLHLLGLAGMCGCAILMTVALLLLERVPSMSYVSIVAIFGFVAFFEIGPG
PIPWFIVAELFSQGPRPAAMAVAGFSNWTCNFIVGMGFQYVADAMGPYVF
LLFAVLLLGFFIFTFLRVPETRGRTFDQISATFRR-----TPSLLEQEVK
PSTELEYLGPDEND
>>Pir2:B30310
; fa_initn: 1796
; fa_init1: 1179
; fa_opt: 1862
; sw_score: 1862
; sw_ident: 0.585
; sw_overlap: 496
>A41264 ..
; sq_len: 496
; sq_offset: 1
; sq_type: p
; al_start: 4
; al_stop: 493
; al_display_start: 1
-------------MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEK
IIQAFYNRTLSQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVS
LFVNRFGRRNSMLLVNVLAFAGGALMALSKIAKAVEMLIIGRFIIGLFCG
LCTGFVPMYISEVSPTSLRGAFGTLNQLGIVVGILVAQIFGLEGIMGTEA
LWPLLLGFTIVPAVLQCVALLFCPESPRFLLINKMEEEKAQTVLQKLRGT
QDVSQDISEMKEESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQL
SGINAVFYYSTGIFERAGITQPVYATIGAGVVNTVFTVVSLFLVERAGRR
TLHLVGLGGMAVCAAVMTIALALKEKW--IRYISIVATFGFVALFEIGPG
PIPWFIVAELFSQGPRPAAMAVAGCSNWTSNFLVGMLFPYAEKLCGPYVF
LIFLVFLLIFFIFTYFKVPETKGRTFEDISRGFEEQVETSSPSSPPIEKN
PMVEMNSIEPDKEVA
>B30310 ..
; sq_len: 508
; sq_type: p
; al_start: 17
; al_stop: 506
; al_display_start: 1
MPSGFQQIGSDDGEPPRQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQK
VIEQSYNATWLGRQGPGGPDSIPQGTLTTLWALSVAIFSVGGMISSFLIG
IISQWLGRKRAMLANNVLAVLGGALMGLANAVASYEILILGRFLIGAYSG
LTSGLVPMYVGEIAPTHLRGALGTLNRLAIVIGILVAQVLGLESMLGTAT
LWPLLLALTVLPALLQLILLPFCPESPRYLYIIRNLEGPARKSLKPLTGW
ADVSDALAELKDEKRKLERERPMSLLQLLGSRTHRQPLIIAVVLQLSQQL
SGINAVFYYSTSIFESAGVGQPAYATIGAGVVNTVFTLVSVLLVERAGRR
TLHLLGLAGMCGCAILMTVALLLLERVPAMSYVSIVAIFGFVAFFEIGPG
PIPWF-VAELFSQGPRPAAMAVAGFSNWTCNFIVGMGFQYVADRMGPYVF
LLFAVLLLGFFIFTFLKVPETRGRTFDQISAAFRR-----TPSLLEQEVK
PSTELEYLGPDEND
 
>>><<<
 

 

------------------------------------------------------------------------------

 

 

SCORES   Init1: 1201  Initn: 1844  Opt: 1915

Smith-Waterman score: 1915;    59.3% identity in 496 aa overlap

 

                                  10        20        30        40

A41264                    MADKKKITASLIYAVSVAAIGSLQFGYNTGVINAPEKIIQAFYNRTL

                             ::::|::|: ||  |::|||||||| ||||||:|:|:  ||:|

A49158       MPSGFQQIGSEDGEPPQQRVTGTLVLAVFSAVLGSLQFGYNIGVINAPQKVIEQSYNETW

                     10        20        30        40        50        60

 

              50            60        70        80        90       100

A41264       SQRSG----ETISPELLTSLWSLSVAIFSVGGMIGSFSVSLFVNRFGRRNSMLLVNVLAF

               |:|     :| |  ||:||:||||||||||||:|| :::: : :||: :||: ||||

A49158       LGRQGPEGPSSIPPGTLTTLWALSVAIFSVGGMISSFLIGIISQWLGRKRAMLVNNVLAV

                     70        80        90       100       110       120

 

                 110       120       130       140       150       160

A41264       AGGALMALSKIAKAVEMLIIGRFIIGLFCGLCTGFVPMYISEVSPTSLRGAFGTLNQLGI

              ||:||:|:: | : ||||:|||:|| : || :|:||||::|::|| ||||:||||||:|

A49158       LGGSLMGLANAAASYEMLILGRFLIGAYSGLTSGLVPMYVGEIAPTHLRGALGTLNQLAI

                    130       140       150       160       170       180

 

                 170       180       190       200       210       220

A41264       VVGILVAQIFGLEGIMGTEALWPLLLGFTIVPAVLQCVALLFCPESPRFLLINKMEEEKA

             |:|||:||::|||:::|| :|||||||:|::||:|| | | |||||||:| | :  |  |

A49158       VIGILIAQVLGLESLLGTASLWPLLLGLTVLPALLQLVLLPFCPESPRYLYIIQNLEGPA

                    190       200       210       220       230       240

 

                 230       240       250       260       270       280

A41264       QTVLQKLRGTQDVSQDISEMKEESAKMSQEKKATVLELFRSPNYRQPIIISITLQLSQQL

             :  |::| |  |||  ::|:|:|: |: :|:  ::|:|: | ::|||:||:::|||||||

A49158       RKSLKRLTGWADVSGVLAELKDEKRKLERERPLSLLQLLGSRTHRQPLIIAVVLQLSQQL

                    250       260       270       280       290       300

 

                 290       300       310       320       330       340

A41264       SGINAVFYYSTGIFERAGITQPVYATIGAGVVNTVFTVVSLFLVERAGRRTLHLVGLGGM

             |||||||||||:||| ||: ||:||||||||||||||:||::||||||||||||:||:||

A49158       SGINAVFYYSTSIFETAGVGQPAYATIGAGVVNTVFTLVSVLLVERAGRRTLHLLGLAGM

                    310       320       330       340       350       360

 

                 350         360       370       380       390       400

A41264       AVCAAVMTIALALKEK--WIRYISIVATFGFVALFEIGPGPIPWFIVAELFSQGPRPAAM

               || :||:|| | |:   : |:|||| |||||:||||||||||||||||||||||||||

A49158       CGCAILMTVALLLLERVPAMSYVSIVAIFGFVAFFEIGPGPIPWFIVAELFSQGPRPAAM

                    370       380       390       400       410       420

 

                   410       420       430       440       450       460

A41264       AVAGCSNWTSNFLVGMLFPYAEKLCGPYVFLIFLVFLLIFFIFTYFKVPETKGRTFEDIS

             |||| |||||||::|| | |: :  ||||||:| |:|| |||||:::||||:||||::||

A49158       AVAGFSNWTSNFIIGMGFQYVAEAMGPYVFLLFAVLLLGFFIFTFLRVPETRGRTFDQIS

                    430       440       450       460       470       480

 

                   470       480       490

A41264       RGFEEQVETSSPSSPPIEKNPMVEMNSIEPDKEVA

              :|::     :||    | :| :|:: : ||::

A49158       AAFHR-----TPSLLEQEVKPSTELEYLGPDEND

                         490       500

Printed: May 27, 2005  12:12


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio