DATASET

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[Top | Next ]

DataSet creates a GCG data library from any set of sequences in GCG format.

DESCRIPTION

[ Previous | Top | Next ]

A large set of sequences is more compact to store and faster to search if the sequences are assembled into a personal database like the databases we provide with Accelrys GCG (GCG). When sequences are assembled into a personal database, all of GCG database tools will work with them just as they do with the databases we provide (GenBank, PIR-Protein, GenPept, NRL3D, SP-TREMBL, and Uniprot).

DataSet assembles any set of sequences you specify into a personal database. The sequences in the output files from DataSet are meant to be accessed the same way as sequences in other GCG databases. When you answer the prompt, What should I call the database?, you are giving it a logical name that will be used to refer to the database forever. The command assigning the logical name (globin in the example session below) is written into a file called .datasetrc in your home directory. This causes the logical name to be set automatically every time you initialize the package. GCG sequence specification syntax (likeGlobin:h*) then can be used to identify the sequences in the database. (See Section 2, Using Sequence Files and Databases of the User's Guide for more information about sequence specification.)

EXAMPLE

[ Previous | Top | Next ]

Here is a session using DataSet to assemble most of the human globin sequences in GenBank into a separate personal database called globin:

% dataset

 Assemble DATASET from what sequence(s) ?  GenBank:Humhb*

 What should I call the database ?  globin

       humhb16aa

        humhb1az

         humhb24

       /////////

        humhbl2a

        humhbp68

 Running DBINDEX to calculate indices for "globin".

 HomeDir:.datasetrc was modified to assign "globin" and "gl".

 Running SEQCAT to make "globin" available for STRINGSEARCH.

 DATASET complete:

        Sequences: 103

     Total length: 215798

 Output file: globin.header, .offset, .names, .numbers, .seqcat, 1 page of .seq

 and .ref

OUTPUT

[ Previous | Top | Next ]

DataSet writes six files in your current working directory. globin_000.ref contains the documentation for all of the sequences specified by GenBank:Humhb*. globin_000.seq contains the sequences. globin.names, globin.numbers, and globin.offset are index files used by GCG to find individual sequences in the database. globin.header provides GCG with information such as logical names, release dates, and formatting information. globin.seqcat is the definition file that is searched by the StringSearch program.

INPUT FILES

[ Previous | Top | Next ]

DataSet accepts as input multiple sequences of the same type. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*.

If the input is a list file, DataSet applies anyBegin, End, and Strand attributes it finds within that file. However, with one exception, the Command Line qualifiers -BEGin, -END, -REVerse, and -NOREVerse will override any conflicting attributes found in the list file. The single exception is that if an -END qualifier specified on the Command Line is less than a Begin attribute found in the list file, the output sequence will begin and end at the base indicated by the Begin.

RELATED PROGRAMS

[ Previous | Top | Next ]

DataSet+ creates a GCG data library from any set of sequences in GCG format or from sequences in various flatfile formats (including GenBank, EMBL, SwissProt, and FastA). Entries in the data library can have sequences longer than 350,000 symbols.

Fetch copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen.

StringSearch identifies sequences by searching for character patterns such as "globin" or "human" in the sequence documentation. Names identifies GCG data files and sequence entries by name. It can show you what set of sequences is implied by any sequence specification.

FormatDB+ combines any set of GCG sequences into a database that you can search with BLAST. BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it fin.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST. WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.

Fetch+ copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen.

FastA+ does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA+ may be more sensitive than BLAST+.

RESTRICTIONS

[ Previous | Top | Next ]

You cannot move or rename the database files unless you change the database name assignment appropriately in HomeDir:.datasetrc.

Upon completion, DataSet tries to run GCG database utilities DBIndex and SeqCat as spawned processes in order to calculate indices for the new database. These two programs must complete successfully before the sequences in the new database can be used with GCG programs.

The .offset, .names, and .numbers files are not ASCII text and cannot be viewed on your terminal screen or modified in any way.

The architecture of GCG data libraries requires that all of the files that make up a data library be in the same directory and share the same base name. These files are differentiated only by their filename extensions. The individual data libraries that make up a database, such as Genbank, need not be located in the same directory. For each data library you may specify a different location in the file GenDBConfigure:dbnames.map.

The location of a data library must be specified by a logical name. If you need to define a new logical name for a new location, add that definition to the file GenDBConfigure:dblogicals and run the command xNewDBLogx. This is necessary, for example, if you put the data on a new disk.

CONSIDERATIONS

[ Previous | Top | Next ]

The format of GCG database index files changed starting with Version 8.0 of GCG. Personal databases that were created with DataSet prior to Version 8.0 must be converted to the new format. To make a database compatible with the current version of GCG, use the program DBIndex to create new index files for the database.

DATABASE NAMES

[ Previous | Top | Next ]

The sequences in your new personal database are meant to be accessed the same way as any other GCG database sequences. GCG recognizes that a sequence specification like Globin:humhbb is a database sequence specification by examining the logical name globin. If globin is assigned to a complete filename without a filename extension and if there are five files that start with that name and end with the extensions .ref, .seq, .offset, .numbers, and .names, then GCG assumes globin is a database and tries to find the entry humhbb.

Usually, DataSet permanently assigns a name like globin for you when you run the program, but you can assign a database logical name by yourself with a command like this one:

% name -s globin /usr/user/burgess/seq/globin

In the example session, a logical name globin is assigned by adding a command to the file HomeDir:.datasetrc. If that file did not exist before the session, DataSet would create a new one. Whenever you initialize GCG, this name is assigned correctly.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use-CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % dataset [-INfile=]GenBank:humhb* [-OUTfile=]globin -Default

Prompted Parameters:

-BEGin=1 -END=148         sets the range of interest for a single sequence

-REVerse                  uses the back strand of a single sequence

Local Data Files: None

Optional Parameters:

-BEGin=1 -END=148         sets the range of interest for all sequences

-REVerse                  uses the reverse strand for all sequences

-IDToken="DEFINITION"     sets the heading's definition line identifier

-TYPe=n                   sets the dataset type

-TOPROtein                translates nucleotide input to protein output

-TRANSlate=translate.txt  specifies the codon translation table file (used with

                           the -TOPROtein parameter)

-LN=globin                defines the long name

-SN=gl                    defines the short name

-APPend                   appends data to an existing dataset

-NOMONitor                suppresses the screen monitor

-NOSUMmary                suppresses the screen summary

-PAGELimit=kbytes         don't write any sequence data in any page beyond this

                           limit (1 to 2096802, default = 655360)

-FASTA                    creates the .seq file in FastA format

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-BEGin=1

Sets the beginning position for all input sequences. When the beginning position is set from the command line, DataSet ignores beginning positions specified for individual sequences in a list file.

-END=100

Sets the ending position for all input sequences. When the ending position is set from the command line, DataSet ignores ending positions specified for sequences in a list file.

-REVerse

Sets the program to use the reverse strand for each input sequence. When -REVerse or -NOREVerse is on the command line, DataSet ignores any strand designation for individual sequences in a list file.

-IDTokens="DEFINITION"

Many GCG programs annotate their output with a single line of documentation from each sequence. This is the same line that is searched in the definition search of the StringSearch program. For sequences in GenBank format, this line begins with DEFINITION. You can set this identifier to capture whatever line interests you from the heading of your input sequences. If there is no line that starts with your identification token or if you do not use this parameter, then the first non-blank line in your sequence file is used as the sequence definition.

-TYPE=N

Sets the type of the dataset to N for nucleic acid datasets, and P for protein datasets. The type field is stored in the .header file, and used when reading in sequences to set the type of the sequence to protein or nucleic acid.

-TOPROtein

Translates all six potential reading frames for each nucleotide entry. Peptide sequences representing translations of the three forward reading frames are designated with the original entry name followed by f1, f2, or f3 while those corresponding to the three reverse reading frames have names containing r1, r2, or r3.

-TRANSlate=translate.txt

Specifies a file containing the codon translation matrix.

-LN=globin

Defines the long logical name that is used to refer to data in this dataset.

-SN=gl

Defines the short logical name that is used to refer to data in this dataset.

-APPend

Appends data to an existing dataset.

-PAGELlimit=kbytes

Do not write any sequence data in any page beyond this limit (1 to 2096802, default = 655360).

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

-FASTA

Writes entries to the .seq file in FASTA format.

Printed: May 27, 2005 12:00

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.