Using Sequences

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

Table of Contents

Overview

Types of sequence files

Using database sequences

Specifying database sequences by name

Specifying database sequences by accession number

Using single sequence files

Creating and editing single sequences

Specifying single sequence files

Specifying sequence type (Nucleotide or Protein)

Using list files

Creating and editing list files by hand

Programs that create list files

Specifying list files

Using Rich Sequence Format (RSF) files

Programs that create RSF files

Editing RSF files

Specifying RSF files

Using Multiple Sequence Format (MSF) files

Programs that create MSF files

Editing MSF files

Specifying MSF sequences

Copying database sequence files

Creating sequences from databases

Viewing sequences

Viewing sequences in your directory

Reformatting sequence files to GCG format

Reformatting sequence files

For advanced users

Using personal databases

Creating personal databases

Specifying personal databases

Refining a sequence list


Overview

[ Top | Next ]

This section teaches you about the heart of the Accelrys GCG (GCG): using sequences. It provides information that you must know to work with sequence databases (such as GenBank, Uniprot, PIR, etc.) and to use your own sequences with GCG programs for specific analysis.

You'll learn how to

  • Work with the different types of sequence files GCG Programs accept.
  • Use specific GCG programs to find a related group of sequences from the databases and copy them to your directory.
  • Look at the contents of sequence files.
  • Reformat sequences between GCG file format and other program formats.

Types of sequence files

[ Previous | Top | Next ]

GCG supports sequence files in the following sequence formats:

GCG Single Sequence Format File (SSF). A sequence file that contains exactly one sequence, begins with annotation lines and the start of the sequence is marked by a line ending with two dot ("..") characters. This line also contains the sequence identifier, the sequence length and a checksum.

·         GCG Rich Sequence Format files (RSF). Includes one or more sequences that are richly annotated. In addition to the sequence data, RSF files store descriptive information about each sequence, such as sequence weight, author/creator, and database features information. RSF files are useful for viewing sequences and their features in the SeqLab Editor.

·         GCG Multiple Sequence Format files (MSF). Includes two or more sequences aligned together. MSF files are created by GCG programs such as PileUp, ClustalW+, and SeqConv+.

·         List Files: Includes a list of sequence names and their locations, but no sequence data. List files can also include sequence specifications containing wildcards and nested list files (or list files within list files). 

·         Bioinformatics Sequence Markup Language (BSML) Format files **. This is an open, Extensible Markup Language (XML) format for communicating genomic information. The BSML format can capture the richness of genomic research data in documents that preserve the biological meaning and relationships of the content.

·         GenBank Format files **. May includes one or more sequences. Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, and a table of features that identifies coding regions and other sites of biological significance.

·         FastA Format File **.  May includes one or more sequences. This format contains a one line header followed by lines of sequence data. Sequences in FastA formatted files are preceded by a line starting with a " >" symbol.

·         EMBL ** Format File.  Includes one or more nucleotide sequences. The format is based on the format used in the EMBL Nucleotide Sequence Library. Uniprot/SwissProt Format File **. Includes one or more protein sequences. Similar to the EMBL format Databases.  Groups of sequences can be grouped together into a database.  More information on using sequences stored in databases can be found in the next section.

** Note: These formats [GenBank, EMBL, BSML, and SwissProt] can be used directly as input to plus programs, but must first be converted to one of the GCG formats (SSF, RSF, or MSF) to use with non-plus programs. You can convert sequence files into GCG format using the tools available in GCG such as SeqConv+, Reformat, FromGenBank, FromEMBL etc. For more information, see the "Reformatting Sequence Files to GCG Format" later in this section.

 

Using database sequences

Sequence databases

GCG provides you access to nucleotide and protein database sequences. When this User's Guide was created, the following databases were available:

·       GenBank. Composed of nucleic acid sequences from the GenBank Genetic Sequence Data Bank. GenBank exchanges sequence information on a daily basis with EMBL and the DNA Data Bank of Japan (DDBJ). You can search all of GenBank or narrow your search to one of its divisions. GenBank is administered by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) in Bethesda, Maryland, USA. GenBank is released six times a year.

Uniprot: It is a central repository of protein sequence and function created by joining the information contained in SwissProt, TrEMBL, and PIR.  UniProt is administered by the UniProt Consortium, which is comprised of the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR).

·       GenPept. Contains mostly non-annotated translations of GenBank sequences and is maintained by the NCBI. GenPept sequences are in FastA format, with each sequence cross-referenced to its original GenBank entry.

Online database tables

To refer to sequences in these databases, use the logical names listed in the online Nucleic Acid Databases and Protein Databases tables.

To display the online database tables:

 

Choose one of the following:

Type

% typedata genhtml:moredata/databases.txt

 

Access the online help by typing % genhelp or % genmanual then choose Databases from the top menu.

In the Nucleic Acid Databases and Protein Databases tables, you will notice that in some cases there is more than one logical name to refer to a database; use whichever you are most comfortable with. For example, to refer to sequences in GenBank, you could use the logical name GenBank or GB.

Note: Because databases are site-dependent, the online database tables may not include all the databases available to you, or your site may name the databases differently. In addition, because the divisions of GenBank are subject to change, these tables may not be complete.

 

To find out more about the databases, read the release notes that accompany each database release. If your site receives the Database Update Service, these release notes are located in the directory with the logical name genscriptdoc. For each database, you will find a file of release notes with the name of the database and the extension ".release". For example, to find out more about the GenBank database, type

 

% typedata genscriptdoc:genbank.release | more

Example Database Sequence

Each sequence in the databases contains not only the sequence data but also taxonomic information about the organism and the bibliographic citation. Below is an example of the sequence Dro5S from the Invertebrate division of GenBank.

Specifying database sequences by name

 

You can specify database sequence entries by name. Note, however, that a sequence name is subject to change from one database release to the next. For instance, let's say an existing database sequence is merged with another sequence; the complete, merged sequence may acquire the name of the second sequence while the first sequence name is omitted. A more stable way of tracking a sequence from one database release to next by its accession number, as is described in "Specifying Database Sequences by Accession Number" in this section.

To specify a database sequence entry by name:

Choose one of the following.

Note: Database names are case-insensitive. That is, you can type them in uppercase, lowercase, or mixed case.

  • Single Sequence. Type the name of the database or database division (for example, GenBank), a colon (:), and the name of the sequence (for example, Dro5S)--GenBank:Dro5S. You'll notice that in some cases there is more than one logical name to refer to a database. Thus, you could refer to this same sequence as GB:Dro5S.

There are also a number of logical names that refer to the individual divisions of GenBank, UNIPROT, and PIR. For example, GB_In refers only to those sequences in the Invertebrate division of the GenBank database, such as GB_In:Dro5S. To refer to this same division in GenBank you would type Invertebrate or In, for instance In:Dro5S.

  • Multiple Sequences. If a program prompt asks you "What sequence(s)?", it implies that the program can accept multiple sequences. You can specify multiple sequences in the databases using an asterisk (*) wildcard. For example, GenBank:Hiv* refers to all sequences in GenBank whose names start with "Hiv." Or, GenBank:* refers to all sequences in the GenBank database.

Specifying database sequences by accession number

 

The sequence names of entries in the databases sometimes change from one database release to next, and the same entry may have a different name in GenBank. Because of this, publications refer to sequences by accession number. Using accession numbers offers three advantages over sequence names:

  • Accession numbers are more stable than entry names. Where sequence entry names may be deleted from a database between one release and the next, accession numbers always stay with a sequence.
  • Accession numbers are consistent between EMBL and GenBank, whereas entry names may not be.
  • The Uniprot protein database has cross references to SwissProt and TrEMBL entries based on accession numbers.

Specifying a database sequence by accession number is much like specifying one by name. Database names and accession numbers are case-insensitive. That is, you can type them in uppercase, lowercase, or mixed case.

To specify a database sequence by accession number:

Type the name of the database (for example, GB, which is the GenBank database), a colon (:), and the accession number (for example, U00069)--GB:U00069.

Note: You cannot use wildcards to specify sequences by accession number.

 

Using Single Sequence Files

 

Much of the work you perform may revolve around single sequences, which are sequence files stored in your personal directories. There are three ways to create single sequence files: 1) by using a text editor 2)  by using the Reformat, or SeqConv+ programs, or 2) by using SeqLab, the graphical user interface to GCG.

You can store single database sequences in your personal directories as well as import single sequences created by other sequence analysis software and reformat them using SeqConv+ or Reformat programs to use with GCG. For more information on importing sequences, see the "Reformatting Sequence Files to GCG Format" section in this section.

Creating and editing single sequences

You can create sequences from scratch in GCG or edit existing sequences. Each sequence must have a "type" associated with it, denoting the sequence as either a nucleotide or a protein. To specify the sequence type, you can add the parameter -NUCleotide or -PROtein to the command line when you run Reformat. If you forget to do so, the programs determine the type for you based on the symbols in the sequence. Note that because nucleotide and protein sequences share some symbols, the programs can guess incorrectly at the sequence type.

To create a new sequence or edit an existing one:

Choose from the following.

  • Use the text editor of your choice to create a file, then reformat it into GCG format using the Reformat or SeqConv+ program.
    1. Type the sequence information in the text editor of your choice, for example vi. Include the following information:

Heading. (optional) May contain any number of lines of text at the top of the file describing the sequence.

Dividing Line. Consists of a single line containing two periods in succession (..) to separate heading information from the sequence. This line is required only if you include heading information.

Sequence. Contains the sequence information in any format. Each line of the sequence cannot be longer than 512 characters.

    1. Save the file.

·       Use SeqConv+ to rewrite the raw sequence file (created by editor) into a GCG format file. For more information on SeqConv+, see the Program Manual.

·       Use SeqLab, the graphical user interface to GCG. For more information, see "Creating and Editing Sequences" in Section 2, Editing Sequences and Alignments of the SeqLab Guide.

·       Use SeqConv+, a utility program that provides batch conversions between different sequence formats. The motivation for the program is to allow an end user to easily convert between file formats to easily import data into Accelrys’ bioinformatics applications. Use this to convert our internally used formats (e.g. BSML, RSF) into formats more commonly accepted by third-party tools. The supported file formats include BSML, SwissProt, GenBank, FastA, EMBL, MSF, and RSF.

OR

    1. Use Reformat to rewrite the sequence file into GCG format. To do so, type % reformat -NUCleotide filename or % reformat -PROtein filename. For more information on Reformat, see the Program Manual.

Note: You also can use a text editor to modify existing sequence files, although we do not recommend this method. Once you modify a sequence with a text editor, the checksum of the sequence changes, and GCG programs will not recognize the sequence. Therefore, if you use a text editor to modify a sequence, you must use the Reformat program to rewrite the file into GCG format.

OR

Use SeqConv+ to rewrite the raw sequence file (created by editor) into a GCG format file. You can find how to use SeqConv+ to create specific format (GenBank, Uniprot, PIR, etc). For more information on SeqConv+, see the Program Manual.

  • Use SeqLab, the graphical user interface to GCG. For more information, see "Creating and Editing Sequences" in Section 2, Editing Sequences and Alignments of the SeqLab Guide.

·         Use SeqConv+, the utility program that provides batch conversions between different sequence formats. The motivation for the program is to allow an end user to easily convert between file formats to easily import data into Accelrys’ bioinformatics applications. In addition, the converter allows the user to convert our internally used formats (e.g. BSML, RSF) into formats more commonly accepted by third-party tools. The supported file formats will include BSML, GenBank, FastA, EMBL, and RSF.

 

Specifying single sequence files

To specify a sequence file in response to a program prompt:

Choose one of the following.

  • Single Sequence. If you are running a program in the directory containing the sequence file, type the name of the file, for example, gamma.seq. If the sequence file is in a directory other than where you currently are running the program, type the directory and file specification, for example, /smith/project/gamma.seq.
  • Multiple Sequence Files. If a program prompt asks you "What sequence(s)?", it implies that the program can accept multiple sequences. You can name several sequence files by using an asterisk (*) wildcard. For example, gam* refers to all the sequence files in your directory starting with "gam".

TIP - Sometimes the sequence files do not have characters in common; that is, you cannot use a wildcard to name several of them. If this is the case, you can create a list file to name multiple sequences. For more information, see "Using List Files" in this section.

 

Specifying sequence type (Nucleotide or Protein)

The sequence type (nucleotide or protein) is an inherent part of a sequence. You can determine the type of a sequence by looking at the sequence file. Sequences in GCG format contain a dividing line between optional text heading and the sequence data. Consider the following example of a typical dividing line:

Gamma.Seq Length:  11375  August 2, 1998 10:09 Type: N Checksum: 6474 ..

The sequence type should appear on the dividing line as either Type: N for nucleotide or Type: P for protein. If the dividing line doesn't contain a Type: field, GCG programs infers the sequence type from the characters in the sequence. This inference may not always be correct.

If the Type: field of any sequence is incorrect or missing, you should correct it with the Reformat program.

To specify sequence type as either nucleotide or protein:

·         Use the Reformat program. Type % reformat -NUCleotide filename or % reformat -PROtein filename. For more information on Reformat, see the Program Manual.

·         Use SeqConv+ Program.


Using list files

A list file, formerly known as a file of sequence names, is what its name implies: a file containing a list of sequence names and their locations. You can think of list files as a way to organize your sequences on a project-by-project basis. List files are similar to bookmark files used by web browsers. They contain a link to the desired sequence(s), but do not contain the actual sequence itself

You will find list files useful for specifying sequences from multiple files in one file that you can use as input to a program. List files can contain any number of the following types of sequences:

  • Single sequences from the databases or your personal directories, for example, GB_In:Dro5S or /smith/project/gamma.seq.
  • Database sequence names using asterisk (*) wildcards, for example GenBank:Hum*. Note that you cannot use wildcards to include multiple sequence files from your personal directories, for example /smith/project/*.seq.
  • Names of other list files, for example, @hsp70.list.
  • Sequences in RSF or MSF files, for example
    pileup.msf{ssa4}
    or hsp.rsf{*}.

You can use list files with any program that accepts multiple sequences as input. A program prompt asking "What sequence(s)?" implies that the program accepts multiple sequences.

Below is an example of a list file.

In addition to sequence specifications, each sequence in a list file may optionally contain sequence attributes. These attributes include:

Begin Position. (Begin:n) Shows the base position you want to start with, where n= 1 to the length of the sequence.

End Position. (End:n) Shows the base position you want to end with, where n = 1 to the length of the sequence.

Strand. (Strand:+ or -) Defines the forward or reverse complement nucleic acid sequence strand, where + = forward strand and - = reverse strand.

Sequence Topology: Linear or Circular. (Circ:T or F) Defines the strand as linear or circular, where T = circular and F = linear.

Sequence Weight. (Wgt:n.n) Defines the sequence weight, or the significance of the sequence in comparison to other sequences. That is, you may not want all sequences accounted for equally to determine a result. Therefore, you can give some sequences greater weight than others. This attribute is of use only when you are using two or more sequences in the analysis.

Join. (Join:Sequence_Name) Indicates that the sequence segment should be concatenated with the next sequence in the list that has an identical Join:Sequence_Name attribute. Several contiguous sequences specified in a list file with the same Join:Sequence_Name attribute are concatenated together. (Assemble, Translate, and LookUp are the only GCG programs that use the Join attribute. SeqLab uses the Join attribute to concatenate list file sequences in the Editor.)

Note: In version 9.0 or later, the following programs use some or all of these sequence attributes in the command-line version of the GCG: Assemble, CodonFrequency, Distances, Diverge, FrameSearch, PileUp, PlotSimilarity, ProfileMake, Seg, Translate, and Xnu.

 

Creating and editing list files by hand

To create a list file with a text editor:

  1. Open a new file with the text editor of your choice, for example vi.
  1. Type the appropriate information. A list file contains the following optional and required elements (see the list file example earlier in this section):

File Type. (required) Begins with the line (all uppercase) !!SEQUENCE_LIST 1.0. SeqLab uses the file type to improve performance when loading files. Do not edit or delete the file type. This line must appear on the first line of the file.  In previous software versions, this line was optional. 

Description. (optional) Contains informative text, including the date of creation, describing what is in the file.

Dividing Line. (required) Includes two periods (..) that must appear on the line preceding the sequence list.

Tip – If you have list files that don’t have the required first line and the dividing line, you can run the command fixlist yourlistfile to add these. This command assumes Perl 5.005 or above is installed in your system and is available in your PATH. If your directory has lot of list files, all ending with .list, run the command fixlist *.list to fix all of them. Original files are backed up with a .bak extension.

Sequence List. (required) Includes the single sequences from your personal directory or a database, sequence specifications using wildcards, RSF files, MSF files, or list files. You must provide the database or directory specification. You can add sequences in any order.

Sequence Attributes. (optional) Can include the begin and end position, indicate the forward or reverse strand, define the strand as linear or circular, give the sequence a weight in comparison with other sequences, and indicate whether the sequence is concatenated with other sequences in the list.

Sequence Comments. (optional) Includes an exclamation point (!) followed by a short comment or definition of the sequence(s) or list file.

  1. Save and exit the file.

To edit a list file, either one you have manually created or one created by a program:

Use a text editor of your choice and modify the file as necessary.

Tip - One way to specify a subset of sequences is to "comment out" those unwanted sequences within the list file. If you comment out sequences instead of deleting them, you can use them at a later time.

To comment out sequences:

  1. Open the list file in the text editor of your choice and find the sequences you do not want to use.
  1. Type an exclamation point (!) in front of the name of each sequence you do not want. For example

  1. Save the file and exit the text editor.
  1. To specify the list file, type an ‘at’ sign (@) followed by the list filename and extension, for example @hsp70.list. The program will use only those sequences that are not commented out.

Programs that create list files

Some GCG programs can produce output in list file format. Any program that creates multiple sequence output files and can organize those sequence specifications in a list file supports the -LIStfile parameter. You can then use that list file as input to other programs.

Programs that can create list output files and their parameters (if necessary) are listed below.

Program

Parameter
(if necessary)

Assemble

-LIStfile

BLAST, BLAST+

 

Corrupt

-LIStfile

FastA, FastA+

 

FastX,FastX+

 

FindPatterns

-NAMes

FindPatterns+

 

LookUp

 

Motifs

-NAMes

MotifSearch

 

Names

 

Pretty

-UGLy

ProfileSearch

 

PSIBLAST

 

Reformat

-LIStfile

Sample

-LIStfile

Seg

-LIStfile

SeqConv+

-listfile

SeqManip+

-listfile

Simplify

-LIStfile

SSearch,

 

StringSearch

 

TFastA, TFastA+

 

TFastX, TFastX+

 

Translate

-LIStfile

WordSearch

 

Xnu

-LIStfile

Note: Some of the programs listed above, such as ProfileSearch, may include additional program-specific information in the output list file. Others, such as FastA and BLAST, may include sequence alignments. This extra information does not affect the list file's performance.

Specifying list files

To specify a list file in response to a program prompt:

Type an ‘at’ sign (@) and the name of the list file and extension, for example @hsp70.list.

Note: You cannot use wildcards to specify a list file. For example, you cannot specify @hsp*.list.

Using Rich Sequence Format (RSF) files

A Rich Sequence Format (RSF) file contains one or more sequences that may or may not be related. In addition to the sequence data, each sequence can be richly annotated with descriptive sequence information such as:

  • Creator/author of the sequence
  • Sequence weight
  • Creation date
  • One-line description of the sequence
  • Offset, or the number of leading gaps in a sequence that is part of an alignment or fragment assembly project
  • Known sequence features

RSF files are powerful for using with SeqLab, the graphical user interface to GCG. Because they store positional information, you can display RSF files within SeqLab's Editor Mode to view and edit sequence alignments and features. The features annotation allows you to graphically view and align sequences based on features as well as run programs on sequence regions selected by feature. You also will find RSF files useful for distributing sequences to colleagues, since these files contain each sequence's data and descriptive information.

Note: If you plan on using SeqLab for the bulk of your analyses, it is best to save your files as RSF if possible. RSF files are more richly annotated than list files or MSF files, which do not save sequence features information as part of the file.

Below is an example of an RSF file.

You may find the following components in an RSF file:

  • File Type. (required) Begins with the line (all uppercase) !!RICH_SEQUENCE 1.0. SeqLab uses the file type to improve performance when loading files. Do not edit or delete the file type. It must appear on the first line of the file.
  • Dividing Line. (required) Includes two periods (..) that must appear on the line preceding all sequence information and data. Optional comments may appear between the file type and dividing line.
  • Sequence Attributes. Includes descriptive information about the sequence, such as name, sequence description, sequence type, creator, offset, creation-date, strand, weight, and comments. If the sequence is from a database, this section also includes any taxonomic and bibliographic information about the sequence, compiled by the original database.
  • Features. (optional) Contains the features information, including sequence range, description, and graphical depiction. Consider the following example of features information from an RSF file:

The colors, shapes, and fill patterns depicted in SeqLab's Editor are defined in a resource file called feature.cols. To customize these attributes, copy feature.cols to your current directory by typing % fetch feature.cols. Then edit the file in the text editor of your choice, for example vi. The file is internally documented.

  • Sequence. (required) Contains the sequence data.

Programs that create RSF files

 

To create an RSF file:

Choose from the following.

  • SeqLab. You can save files in RSF format from within SeqLab's Editor. For more information, see "Saving Your Work" in Section 2, Editing Sequences and Alignments in the SeqLab Guide.
  • GCG programs. GCG programs and their parameters (if necessary) that create RSF files are listed below.

Program

Parameter
(if necessary)

CoilScan

-RSF

FindPatterns

-RSF

FrameSearch

-RSF

FromTrace

-RSF

HmmerAlign

-RSF

HmmerPfam

-RSF

HTHScan

-RSF

Map

-RSF

Motifs

-RSF

MotifSearch

-RSF

NetFetch

 

PeptideMap

-RSF

PeptideStructure

-RSF

Prime

-RSF

PrimePair

-RSF

Reformat

-RSF

SPScan

-RSF

Translate

-RSF

TransMem

-RSF

SeqConv+

-RSF

Note: The corresponding plus versions of the above programs also create RSF files.

 

Editing RSF files

To edit an RSF file:

Use SeqLab. If you load an RSF file into SeqLab's Editor, it graphically displays the sequences in the file. For more information, see Section 2, Editing Sequences and Alignments in the SeqLab Guide.

You can also use a text editor to modify an RSF file. If you do, however, the file's checksum changes, and GCG programs will not recognize the file. Therefore, if you use a text editor to modify an RSF file, you must use the Reformat program with the -RSF parameter to rewrite the file into GCG format.

Specifying RSF files

 

To specify a single sequence, a subset of sequences, or all sequences within an RSF file:

Choose one of the following.

  • Single Sequence. To specify a single sequence within an RSF file, type the name of the RSF file and extension followed by the name of a sequence in curly brackets, for example opsin.rsf{opsf_human}.
  • Multiple Sequences. To specify a subset of sequences or all sequences within an RSF file, type the name of the RSF file and extension followed by a file specification and/or asterisk (*) wildcard in curly brackets. For example, opsin.rsf{opsg*} specifies all sequences in opsin.rsf beginning with "opsg"; opsin.rsf{*human*} specifies all sequences in opsin.rsf where "human" is part of the sequence name; and opsin.rsf{*} specifies every sequence in opsin.rsf.

Using Multiple Sequence Format (MSF) files

 

You can combine multiple sequences in a single file, called a Multiple Sequence Format (MSF) file. MSF files include not only the sequence name but also the sequence itself, which is usually aligned with the other sequences in the file. You can specify a single sequence within an MSF file, a subset of sequences, or all sequences. Like other sequences, those in an MSF file can be used with other GCG programs.

The following illustration shows an MSF file created with PileUp.( Note: You can use ClustalW+ to create and output an MSF file)

You may find the following components in an MSF file:

  • File Type. (required) Begins with the line (all uppercase) !!NA_MULTIPLE_ALIGNMENT 1.0 for nucleic acid sequences or !!AA_MULTIPLE_ALIGNMENT 1.0 for amino acid sequences. SeqLab uses the file type to improve performance when loading files. Do not edit or delete the file type. If present, it must appear on the first line of the file.
  • Description. (optional) Contains informative text describing what is in the file. You can add this information to the top of the MSF file using a text editor.
  • Dividing Line. (required) Must include the following attributes:

    MSF. Displays the number of bases or residues in the multiple sequence alignment.

    Checksum. Displays an integer value that characterizes the contents of the file.

    Two periods (..). Acts as a divider between the descriptive information and the following sequence information.
  • Name/Weight. (required) Must include the name of each sequence included in the alignment, as well as its length and checksum (both non-editable) and weight (editable).

    Note that the checksum of the individual sequences is important as a safety measure to ensure that you do not change the sequence data inadvertently. If this has happened, you will not be able to use the sequence(s) within the MSF file. You then can use the Reformat program to reformat the sequences and create a new checksum to reflect the file's edited contents.
  • Separating Line. (required) Must include two slashes (//) to divide the name/weight information from the sequence alignment.
  • Multiple Sequence Alignment. (required) Must include each sequence named in the above Name/Weight lines. This alignment allows you to view the relationship among sequences.

Programs that create MSF files

 

To create an MSF file:

Choose from the following.

  • SeqLab. You can export files to MSF from within the Editor of the graphical user interface to GCG. For more information, see "Exporting Sequences to BSML, FastA, MSF, SwissProt, SSF (GCG), EMBL, GenBank, SPTrEMBL, or RSF File Format "in Section 2, Editing Sequences and Alignments of the SeqLab Guide.
  • GCG programs. GCG programs and their parameters (if necessary) that create MSF files are listed below.

Program

Parameter
(if necessary)

ClustalW+

 

HmmerAlign

 

PileUp

 

PrettyBox

 

ProfileGap

-MSF

ProfileSegments

-MSF

Reformat

-MSF

SeqConv+

-format= MSF

Note: If you use % reformat -MSF (or SeqConv+) to create an MSF file, it does not align the sequences.

 

Editing MSF files

To edit an MSF file:

You also can use a text editor to modify an MSF file. If you do so, however, the file's checksum changes, and GCG programs will not recognize the file. Therefore, if you use a text editor to modify an MSF file, you must use the Reformat with the –MSF parameter or the SeqConv+ program with the –format = MSF parameter to rewrite it into MSF format.

 

Specifying MSF sequences

 

To specify a single sequence, a subset of sequences, or all sequences within an MSF file:

Choose from the following.

  • Single Sequence. To specify a single sequence within an MSF file, type the name of the MSF file and extension followed by the name of a sequence in curly brackets, for example, picorna.msf{cb3}.
  • Multiple Sequences. To specify a subset of sequences or all sequences within an MSF file, type the name of the MSF file and extension followed by a file specification and/or asterisk (*) wildcard in curly brackets. For example, picorna.msf{pl*} specifies all sequences in picorna.msf beginning with "pl", whereas picorna.msf{*} specifies every sequence in picorna.msf.

Note: You cannot use wildcards to name an MSF filename (that is, you cannot specify pic*.msf). You can use wildcards only between the curly brackets { }.

TIP - One way to specify a subset of sequences is to "comment out" those unwanted sequences within the MSF file. If you comment out sequences instead of deleting them, you can use them at a later time.

To comment out sequences:

  1. Open the MSF file in the text editor of your choice and find the sequences you do not want to use in the Name/Weight area toward the top of the file.
  1. Type an exclamation point (!) in front of the "Name:" of each sequence you do not want. For example

  1. Save the file and exit the text editor.
  1. In response to a program prompt, type the MSF filename and extension. Note: For non plus programs you will have to specify the MSF filename and extension followed by an asterisk (*) wildcard in curly brackets, for example picorna.msf{*}. The program will use only those sequences which are not commented out.

Copying database sequence files

[ Previous | Top | Next ]

GCG makes it easy for you to copy sequences from databases to your directory. You can copy single or multiple sequences from your local databases using Fetch or Fetch+ or from NCBI using NetFetch or NetFetch+. The plus versions of these commands are preferable as they support unlimited sequence length and a wider variety of file formats. The older Fetch and NetFetch programs are provided primarily for backward compatibility Fetch+ or from NCBI using NetFetch+.

 

Creating sequences from databases

To copy sequences:

Choose from the following.

  • Single Sequence. To copy a single sequence from your local databases, type % fetch+ entry_name, for example % fetch+ Dro5S.

TIP - If you know the database in which a sequence resides, you can speed its retrieval by including the database in the entry name specification, for example % fetch+ In:Dro5S.

To copy a single sequence from NCBI, type % netfetch+ entry_name or % netfetch+ accession_number, for example, % netfetch+ 12136. The sequence is retrieved and stored in an RSF file in your current directory.

  • Multiple Sequences. To copy multiple sequences from your local databases, use a wildcard in the specification, for example, % fetch+ hum* or % fetch+ Vi:HIV*.

TIP - You also can copy multiple sequences from the databases by creating a list file of those sequences of interest (see "Using List Files" in this section for more information). This method is useful if the sequence names do not have characters in common. Then, to copy the sequences from the database, type % fetch+ @list_filename, for example % fetch+ @hiv-gag.list. The sequences in the list file are copied to your current directory as separate sequences.

To copy multiple sequences from NCBI, indicate the name of a NetBLAST output file, for example % netfetch+ zea2_maize.blastp. The sequences are retrieved and stored in an RSF file in your current directory.


 

Viewing sequences

[ Previous | Top | Next ]

You may want to read the reference information associated with a sequence or view the sequence itself. You can easily view the contents of sequence files by using the TypeData+ program. Using this command, you can view database sequences or those in your personal directories, including single sequences, RSF files, MSF files, or list files.

Note: You can also use SeqLab, the graphical interface to the package to view and edit sequences. For more information, see the SeqLab Guide.

 

Viewing database sequences

To view database sequences:

Type % typedata+ entry_name, for example % typedata+ GB_IN:Dro5S. The sequence data, including reference information, scrolls on your screen. Note that you cannot edit a file using the TypeData+ command.

You can control screen output in the following ways:

  • To temporarily stop the scrolling of the data, press <Ctrl>s.
  • To resume scrolling, press <Ctrl>q.
  • To view sequence data one screen length at a time, type % typedata+ filename | more. To progress through the screens, press the <Space Bar>.
  • To exit TypeData+, press <Ctrl>c.

For more information on controlling screen output, see "Controlling Screen Output" in the "Quick Reference" section of Section 1, Getting Started.

Viewing sequences in your directory

To view the contents of single sequence files, list files, RSF files, or MSF files in your directories:

Type % more filename, for example % more gamma.seq. The sequence data, including reference information, displays one screen at a time. To advance from screen to screen, press the <Space Bar>.


Reformatting sequence files to GCG format

[ Previous | Top | Next ]

At some point in your work with GCG, you may need to reformat sequence files into GCG format to ensure they can be used as input to all GCG programs. This may happen when

  • You create a sequence file using an automated sequencer.
  • You obtain a sequence directly from a database service (such as Uniprot, GenBank, or PIR web page) or through another program 
  • You create a sequence file using a text editor.
  • You modify a GCG-formatted sequence file using a text editor. (Note that this is not a recommended practice.)

 

Reformatting sequence files

You can use a number of differently formatted sequences with GCG --sequences created with a text editor or automated sequencer; sequences in a different software format, or sequences in the database formats of GenBank, Uniprot, PIR, or SwissProt.

Each sequence in GCG must have a "type" associated with it, denoting the sequence as either a nucleotide or a protein. To specify the sequence type, you can add the parameter -NUCleotide or -PROtein to the command line when you run Reformat. If you forget to do so, the programs will determine the type for you based on the symbols in the sequence. Note that because nucleotide and protein sequences share some symbols, the programs can guess incorrectly at the sequence type.

To reformat sequence files:

Choose one of the following.

  • Sequences with no format. If you create or modify a sequence using an automated sequencer or a text editor, use the Reformat or SeqConv+ program to rewrite the sequence file to GCG format. For more information on Reformat, or SeqConv+ see the Program Manual.

Note: If the sequence file is not in a standard sequence file format, as listed at the beginning of this section under the heading “Types of Sequence Files”, then you first must open the file in a text editor and insert a line that contains two periods (..) above the sequence information. Then use SeqConv+ to rewrite the sequence to GCG format.

  • Sequences from a database service or another program. Choose one of the following:
    •  SeqConv+ is a program that can be used for the inter conversion of sequences to and from FastA, EMBL, SwissProt, GenBank, RSF, SSF, and MSF formats.
    • FromEMBL. Reformats sequences from the distribution (flat file) format of the EMBL or SWISS-PROT databases to GCG format.
    • FromGenBank. Reformats sequences in the flat file format of the GenBank database to GCG format.
    • FromFastA. Reformats sequences in FastA format to GCG format.
    • Note: You can use FastA sequences directly with GCG non-plus programs, without reformatting them by adding -FASTA to the command line.
    • FromPIR. Reformats sequences from the protein database of the Protein Identification Resource (PIR) to GCG format.

 

For advanced users

[ Previous | Top ]

The information in this section is intended for users who are familiar with using sequences within GCG. This section teaches you how to

  • Create and use your own personal databases.
  • Refine a sequence list.

Using personal databases

You can create your own personal databases, similar to GenBank and Uniprot databases, for searching with GCG. This option is a particular advantage if you frequently work with large list files. A large set of sequences is more compact to store and faster to search if it is assembled into a database. Thus, you can convert your large list files into databases for faster searching capabilities. When sequences are assembled into a database, all GCG programs work with them exactly as they work with public databases (GenBank, Uniprot, Genpept, etc.).

 

Creating personal databases

The program DataSet+ creates databases from any set of sequences you specify.

To create a personal database:

  1. Type % dataset+ -config. The program displays the prompt "Assemble Dataset+ from what sequence(s)?"
  1. Choose from the following:
    • Type the sequence specification of the list, RSF, or MSF file you want to convert to a database, for example @hsp70.list or pileup.msf{*}.
    • Type a file specification from a public database using an asterisk (*) wildcard. For example, SW:Hs70* creates a database of all 70 KD heat shock protein sequences in Uniprot, if that database is available at your site.

The program displays the prompt “Enter logical name for FFDB”.

  1. Type the logical name you want to refer to the database, for example HSP. This prompt sets the logical name of your personal database.

Your personal database logical names are automatically assigned in a directory called “.wp” in your home directory.

 

Configuring personal databases

You can also assign a logical name to an already existing database, by adding logical_name = /path_to_database/logical_name to $HOME/.wp/dblogicals.conf.

 

Specifying personal databases

Specifying a personal database you created using DataSet+ is the same as specifying a sequence from a public database such as GenBank, Uniprot, PIR, etc.

To specify a personal database:

Type the logical name of your database, followed by a colon (:), followed by the sequence(s) of interest. For instance, using the example above, you could type HSP:Hs70_Brelc to specify a single sequence in the personal database, or HSP:* to specify all sequences in the personal database. For more information, see "Using Database Sequences" in this section.

 

Refining a sequence list

You can refine list files, RSF files, or MSF files to fit your analysis needs:

  • You can use the output file from one program as input to another to refine a sequence list. For example, you could identify human globin sequences with LookUp. The output list from this session could be refined with FindPatterns to include only those globin sequences containing EcoRI sites.

For more information on the above programs, see the Program Manual.

  • You can combine two or more list files or RSF files by using a text editor such as vi. See the appropriate text editor documentation for more information on appending files.

Note: You cannot combine MSF files in this way.

Note: You cannot "comment out" sequences in RSF files in this way.


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio