REFORMAT

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

HEADING

DIVIDING LINE

SEQUENCE

SEQUENCE CHARACTERS

EXAMPLE

OUTPUT FILE

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

CONSIDERATIONS

FORMAT CONTROL

CHECKSUM

EMBEDDED COMMENTS

COMMAND-LINE SUMMARY

SCORING MATRICES

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

Reformat rewrites sequence file(s), scoring matrix file(s), or enzyme data file(s) so that they can be read by GCG programs.

DESCRIPTION

[ Previous | Top | Next ]

Reformat rewrites sequence or data files to make them usable by the Accelrys GCG (GCG). It can also be used to alter the appearance of single sequence files. The following are some of the manipulations that Reformat can perform:

- Converting single sequence files that were prepared or edited with a text editor into GCG format.

- Converting between multiple sequence (MSF), rich sequence (RSF) and single sequence GCG formats.

- Correcting the sequence type (protein or nucleic acid) of single sequence files that have no type or that were incorrectly typed when they were created.

- Converting nucleic acid sequences between DNA (T, t) and RNA (U, u) representations.

- Converting protein sequences between one-letter and three-letter amino acid representations.

- Converting sequences to all uppercase or all lowercase characters.

- Removing gap characters from sequence files.

In order to use Reformat on single sequence files, the files must contain a heading, a dividing line, and a sequence, as described below. You can use a text editor to make your "foreign" sequence files conform to this arrangement.

HEADING

[ Previous | Top | Next ]

The heading of a sequence file may contain any number of lines of text at the top of the file to describe the sequence. The heading must not contain two adjacent periods (..) anywhere within it. This area is optional.

DIVIDING LINE

[ Previous | Top | Next ]

The heading is followed by a dividing line: a line containing two adjacent periods (..). Any information on the line other than the two periods is lost during reformatting. The dividing line may be omitted if there is absolutely no heading. All GCG data files contain a dividing line to separate the data from a documentary heading.

SEQUENCE

[ Previous | Top | Next ]

After the dividing line comes the sequence in any format you wish. It is conventional to use uppercase letters for known parts of the sequence and lowercase letters for uncertain parts. As in the example below, the sequence may have documentary comments embedded within it. You may either use two adjacent slash characters (//) to mark the end of the sequence data or just allow the sequence to go on until the end of the file.

SEQUENCE CHARACTERS

[ Previous | Top | Next ]

The alphabet of legitimate sequence characters and their meanings are defined in Appendix III. GCG programs support the IUB-IUPAC standard ambiguity codes for the representation of nucleic acid ambiguities and the standard one-letter amino acid codes. Reformat, like all other GCG programs, will ignore all characters that are not in the alphabet of legitimate sequence characters .

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Reformat to rewrite a sequence file prepared with a text editor (see the INPUT FILE topic below) to GCG format:

 
 
% reformat
 
 REFORMAT what sequence file(s) ?  reformat.txt
 
    reformat.txt  length: 1636 bp
 
%

OUTPUT FILE

[ Previous | Top | Next ]

Here is part of the output file from the example above:

 
 
!!NA_SEQUENCE 1.0
 
Human fetal Beta globin G gamma
from Shen, Slightom and Smithies,  Cell 26; 191-203.
Analyzed by Smithies et al. Cell 26; 345-353.
 
The region below is used to demonstrate REFORMAT.  It
starts at base 2101 of the sequence reported in Cell (gamma.seq).
 
reformat.txt  Length: 1636  September 29, 1998 17:28  Type: N  Check: 398  ..
 
       1  AGGAAGCACC CTTCAGCAGT TCCACA
                                      >Cap (G gamma)>
                                      CACT CGCTTCTGGA ACGTCTGAGG
 
      51  TTATCAATAA GCTCCTAGTC CAGACGCC
                                        >coding (G gamma)>
                                        AT GGGTCATTTC ACAGAGGAGG
 
    ////////////////////////////////////////////////////////////
 
    1551  CTTTCAAGGA TAGGCTTTAT TCTGCAAGCA ATACAAATAA TAAATCTATT
 
    1601  CTGCTAAGAG ATCAC
                          <POLYA (G gamma)<
                          ACATG GTTGTCTTCA GTTCTT
 

INPUT FILES

[ Previous | Top | Next ]

Here is part of the input file used for the example above:

 
 
Human fetal Beta globin G gamma
from Shen, Slightom and Smithies,  Cell 26; 191-203.
Analyzed by Smithies et al. Cell 26; 345-353.
 
The region below is used to demonstrate REFORMAT.  It
starts at base 2051 of the sequence reported in Cell.
 
                            ..
 
AGGAAGCACC CTTCAGCAGT TCCACA>Cap (G gamma)>CACT CGCTT
CTGGA ACGTCTGAGG
TTATCAATAA GCTCCTAGTC CAGACGCC>coding (G gamma)>AT
 
////////////////////////////////////////////////////////
 
GCTCACTGCC CATGATGCAG
AGCTTTCAAG GATAGGCTTT ATTCTGCAAG CAATACAAAT AATAAATCTA
TTCTGCTAAG AGATCAC<POLYA (G gamma)<ACATGGTTGTCTTCAGTTCTT

RELATED PROGRAMS

[ Previous | Top | Next ]

All GCG programs that write single sequence files, such as Assemble, BackTranslate, PileUp, Reverse, Shuffle, and Translate, write these files in GCG format.

BreakUp reads a GCG-format sequence file containing more than 350,000 sequence characters and writes it as a set of separate, shorter, overlapping sequence files that can be analyzed by GCG programs.

DataSet creates a GCG data library from any set of sequences in GCG format. FormatDB+ combines any set of GCG sequences into a database that you can search with BLAST.

RESTRICTIONS

[ Previous | Top | Next ]

A sequence may not contain more than 350,000 sequence characters. BreakUp can convert a GCG-format sequence file containing more than 350,000 sequence characters into a set of separate, shorter overlapping sequence files. Embedded comments more than 125 characters long are truncated to 125 characters. Input lines may not be more than 511 characters.

CONSIDERATIONS

[ Previous | Top | Next ]

Filename Extensions

Nucleic acid and protein sequences are generally named with the filename extensions .seq and .pep, respectively.

Use Staden Format Directly

The command % seqformat Staden sets your process so that most programs accept input sequences in Staden format without the need for reformatting. The command % seqformat GCG restores the system to expect sequences in GCG format.

You can use Reformat on Staden files (or any files that contain only sequence characters) without modification as long as all the sequence characters in the file belong to the IUB-IUPAC code representation. If your Staden file contains any of Staden's ambiguity codes, use the FromStaden program instead.

Use FastA Format Directly

The command % seqformat FastA sets your process so that most programs accept input sequences in FastA format without the need for reformatting. The command % seqformat GCG restores the system to expect sequences in GCG format.

Input from stdin

Reformat accepts input from stdin with -INfile=-. If the stdin input does not contain a heading that is separated from the sequence by a line containing two dots (..), then also use -NOHEAding.

Multiple Sequence Format (MSF) and Rich Sequence Format (RSF) Files

Reformat can be used to convert between MSF, RSF, single sequence format and list files. When single sequence files are specified using a list file, any sequence attributes specified in the list file (e.g. begin and end ranges) are ignored during the conversion to the new file type. When converting from an RSF file any sequence features are lost. Access to sequence features is currently available only from within SeqLab. (Refer to Section 2 of the Users' Guide, Using Sequence Files and Databases, for details. See "Using Multiple Sequence Format (MSF) Files" for help in specifying sequences in MSF files, "Using Rich Sequence Format Files" for help with RSF files, and "Using List Files" for information about list files.)

Following are several examples of the commands you might type to convert between MSF or RSF and single sequence format files. These examples use the files hsp70.msf, hsp70.rsf and pretty.list, which can be copied to your local directory with the % fetch command.

To copy all of the sequences in hsp70.msf into separate sequence files, use

% reformat hsp70.msf{*}

To copy all of the sequences in hsp70.rsf into separate sequence files, use

% reformat hsp70.rsf{*}

To copy the sequence Hs70_Plafa from hsp70.msf into a single sequence file, use

% reformat hsp70.msf{hs70_plafa}

To convert pretty.list into an RSF file, use

% reformat -RSF @pretty.list

If you edit hsp70.msf with a text editor to manually adjust the alignment, you must use Reformat to rewrite the MSF file so that it can be used with GCG programs by using

% reformat -MSF hsp70.msf{*}

FORMAT CONTROL

[ Previous | Top | Next ]

For single sequence files and MSF files, you can control the number of sequence characters per line and the number of characters in each block by setting parameters on the command line. Additionally for single sequence files, you can control how many blank lines appear between sequence lines. Reformat defaults to groups of 10 characters in lines of 50, with one blank line between each sequence line.

CHECKSUM

[ Previous | Top | Next ]

For each sequence in an MSF, RSF or single sequence file, Reformat calculates a checksum based on the exact sequence. Reformat always adds the checksum to the file containing the sequence. All GCG programs that read sequences recalculate the checksum and compare it to the value written by Reformat to ensure the integrity of the data. If there is disagreement between the newly calculated and previously written checksum values, the program stops and displays an error message. There is one chance in ten thousand that two different sequences would have the same checksum.

EMBEDDED COMMENTS

[ Previous | Top | Next ]

You may embed comments of up to 125 characters within a sequence in an single sequence file by enclosing them in special comment-delimiting characters. Comments are very helpful for documenting sequences, especially sequences assembled from several sources or sequences containing many genes.

Comment Delimiting Characters

Embedded comments can begin with one of the characters <, >, or $. Each comment must begin and end with the same character.

Suggestions

The embedded comments below seem useful for the sequences we have annotated.

 
 
        >coding>         beginning of coding sequence
        <coding<         termination of coding sequence
        >Cap>            cap site
        >IVS>            intervening sequence donor
        <IVS<            intervening sequence acceptor
        <PolyA<          poly-A addition site
        >Transcript>     beginning of transcript
        <Transcript<     end of transcript
        >Promoter>       promoter
        >Ribosome>       ribosome binding site
 

Comment Limitations

Comments must start and end with the same delimiting character and may not be more than 125 characters long. Comments that are too long are truncated to 125 characters. Reformat searches through the whole file, if need be, for the second delimiting character that closes the field of a comment. Reformat prints a warning for unclosed comments, but not for comments that are too long.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % reformat [-INfile=]reformat.txt -Default
 
Prompted Parameters:  None
 
Local Data Files:
 
-DATa=translate.txt       names file of three-letter to one-letter codes
 
Optional Parameters:
 
[-OUTfile=]newseqname     names the output file
-EXTension=.seq           specifies a file name extension for the output
-LIStfile[=reformat.list] writes a list file of output sequence names
-MSF                      reformats sequences into an MSF output file
-RSF                      reformats sequences into an RSF output file
-PROtein or -NUCleotide   insists that the sequences are reformatted as
                          protein or nucleotide sequences
-DEGap                    removes gap characters (. and ~) from the sequence
-LINesize=50              sets number of characters per line
-BLOcksize=10             sets number of characters per block
-BLAnklines=1             puts blank lines between the sequence lines
-NONUMbering              suppresses numbering
-NOCOMments               suppresses comments
-DNA                      changes U into T
-RNA                      changes T into U
-UPPer                    makes all sequence characters uppercase
-LOWer                    makes all sequence characters lowercase
-ONEIntothree             translates one-letter peptides into three-letter
-THReeintoone             translates three-letter peptides into one-letter
-NOHEAding                doesn't include header information for input
                            sequence from stdin
 

 

-COMparison               reformats a scoring matrix instead of a sequence

                            (used with -PROtein or -NUCleotide, insists

                            that the matrix is reformatted as a protein

                            or nucleotide scoring matrix)

  -GAPweight=8              specifies the gap creation penalty associated

                              with the scoring matrix

  -LENgthweight=2           specified the gap extension penalty associated

                              with the scoring matrix

  -SCAle=10                 multiplies each value in the scoring matrix

                              by 10 (use any number from .01 to 100.0)

  -PROtein or -NUCleotide   insists that the sequences are reformatted as

  -EQUALSformat             writes the scoring matrix in a form that may be

                              more easily read

-OLDCMPformat             converts a pre-Version 9 scoring matrix into

                            a Version 9 scoring matrix (all options used

                            with -COMparison can also be used with

                            -OLDCMPformat. -PROtein or -NUCleotide must be

                            specified with -OLDCMPformat

-TRANSlate=filename.txt   names the translation table

-NOMONitor                suppresses the screen trace showing each output

                            file

SCORING MATRICES

[ Previous | Top | Next ]

After modifying a scoring matrix, you may want to reformat it to give it a nicer appearance. To use Reformat for this purpose, run the program with % reformat -COMparison. (See Appendix VII for more information about scoring matrices.)

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

In the rare event that you are using Reformat to convert a three-letter amino acid sequence into a one-letter sequence, Reformat looks for translate.txt as a local data file.

The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate=mycode.txt. Translation tables are discussed in more detail in Appendix VII.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-OUTfile=newseqname

Selects an output filename other than the name of the input file. This option is most useful for single sequence conversions.

-EXTension=.seq

Selects a filename extension other than the input filename extension. This option if most useful for multiple sequence conversions to a list file

-LIStfile=reformat.list

Writes a list file with the names of the output sequence files. This list file is suitable for input to other GCG programs that support list files (see Section 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then Reformat makes one up using reformat for the file name and .list for the file name extension. If -MSF is on the command line, this parameter is ignored and a list file will not be written.

-MSF

Reformats all input sequences into a multiple sequence format (MSF) output file.

-RSF

Reformats all input sequences into a rich sequence format (RSF) output file.

-PROtein or -NUCleotide

Explicitly reformats the sequences as proteins or nucleic acids.

-DEGap

Removes all gap characters (. and ~) from sequences.

-LINesize=50

Lets you set the number of sequence characters per line to any number between 1 and 120 in MSF and single sequence files.

-BLOcksize=10

Lets you set the number of sequence characters in each block to any number between 1 and the line size in MSF and single sequence files.

-BLAnklines=1

Leaves zero or more blank lines between the sequence lines in single sequence files.

-NONUMbering

Suppresses the numbering next to each sequence line in single sequence files.

-NOCOMments

Removes any comments from single sequence files.

-DNA

Substitutes T for U and t for u in sequences.

-RNA

Substitutes U for T and u for t in sequences.

-UPPer

Puts all sequence characters into uppercase.

-LOWer

Puts all sequence characters into lowercase.

-ONEIntothree

Converts a protein sequence in one-letter code to three-letter code (see Appendix III). GCG programs use protein sequences in one-letter code only.

-THReeintoone

Converts a protein sequence from three-letter code to one-letter code (see Appendix III). GCG programs use protein sequences in one-letter codes only.

-COMparison

Reformats a scoring matrix.

-GAPweight

Specifies a default gap creation penalty associated with a scoring matrix. This penalty is written in the auxiliary data block of scoring matrix files. If you don't specify a default gap creation penalty with -GAPweight, the program calculates a reasonable default and writes it in the auxiliary data block. (See Appendix VII for information about the auxiliary data block in scoring matrix files.)

-LENgthweight

Specifies the default gap extension penalty associated with a scoring matrix. This penalty is written in the auxiliary data block of scoring matrix files. If you don't specify a default gap extension penalty with -LENgthweight, the program calculates a reasonable default and writes it in the auxiliary data block. (See Appendix VII for information about the auxiliary data block in scoring matrix files.)

-SCAle=10

Multiplies each value in the scoring matrix and the gap penalties in the auxiliary data block by 10. (See Appendix VII for information about the auxiliary data block in scoring matrix files.) You can specify any value from 0.01 to 100.0 and each value in the matrix and the gap penalties are multiplied by this number and rounded to the nearest integer.

-PROtein or -NUCleotide

Reformats the matrix as either a protein or nucleotide scoring matrix. (See Appendix VII for information about scoring matrix types.)

-EQUALSformat

Writes the scoring matrix in a format which is less compact but may be more easily read. Files converted with this option are readable by all GCG programs.

-OLDCMPformat

Converts a pre-Version 9 scoring matrix to the Version 9 scoring matrix format. By default, each floating point value in the pre-Version 9 matrix is first multiplied by 10 and then rounded to the nearest integer. You must add either -PROtein or -NUCleotide to specify the type of the converted scoring matrix. (See Appendix VII for information about scoring matrix types.) All of the optional parameters that may be used with -COMparison may also be used with -OLDCMPformat.

-NOHEAding

Expects input sequences from stdin to contain no header information.

-TRANSlate=filename.txt

Usually, translation is based on the translation table in a default or local data file called translate.txt. This parameter allows you to use a translation table in a different file. (See Appendix VII for information about translation tables.)

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

Printed:  May 27, 2005 14:21 


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio