DATASET+

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[Top | Next ]

DataSet+ creates a GCG data library from any set of sequences in GCG format or from sequences in various flatfile formats (including GenBank, EMBL/SwissProt, and FastA). Entries in the data library can have sequences longer than 350,000 symbols.

DESCRIPTION

[ Previous | Top | Next ]

Advantages of Plus “+” Programs:

P Plus programs are enhanced to be able to read sequences in a variety of native formats such as GCG RSF, GCG SSF, GCG MSF, GenBank, EMBL, FastA, SwissProt, PIR, and BSML without conversion.

P Plus programs remove sequence length restriction of 350,000bp.

If you do not need these features and wish to have more interactivity, you might wish to seek out and run the original program version.

A large set of sequences is more compact to store and faster to search if the sequences are assembled into a personal database like the databases we provide with Accelrys GCG (GCG). When sequences are assembled into a personal database, all of GCG database tools will work with them just as they do with the databases we provide (GenBank, PIR-Protein, GenPept, and Uniprot).

DataSet+ assembles any set of sequences you specify into a personal database. The sequences in the output files from DataSet+ are meant to be accessed the same way as sequences in other GCG databases. When you answer the prompt,Enter logical name for FFDB, you are giving it a logical name that will be used to refer to the database forever. The command assigning the logical name (globin in the example session below) can be written into a personal configuration file ($HOME/.wp/dblogicals.conf). When this is done, it causes the logical name to be set automatically every time you initialize GCG. GCG sequence specification syntax (likeGlobin:h*) then can be used to identify the sequences in the database. (See Section 2, Using Sequence Files and Databases of the User's Guide for more information about sequence specification.)

EXAMPLE

[ Previous | Top | Next ]

Here is a session using DataSet+ to assemble most of the human globin sequences in GenBank into a separate personal database called globin:

  14:26~37> dataset+ -config

Dataset+ creates a flatfile database from any set of sequences.

Assemble dataset+ from what sequence(s) ? Genbank:humhb*

Enter value for directory (*  *) ?

Enter logical name for FFDB (*  *) ? globin

          Input Entries: 98

         Output Entries: 98

       Excluded Entries: 0

           Total Length: 208508

                 Errors: 0

Running indexing on database '/u/kayyagari/globin'

Added database name mapping for 'globin' to user configuration file

'/usr/users/kayyagari/.wp/dblogicals.conf’

OUTPUT

[ Previous | Top | Next ]

When run as shown in the example, DataSet+ writes seven files in your current working directory. globin_000.ref contains the documentation for all of the sequences specified by Genbank:Humhb*. globin_000.seq contains the sequences. globin.names, globin.numbers, and globin.offset are index files used by GCG to find individual sequences in the database. globin.header provides GCG with information such as logical names, release dates, and formatting information. globin.seqcat is the definition file that is searched by the StringSearch program.

INPUT FILES

[ Previous | Top | Next ]

DataSet+ accepts as input multiple sequences of the same type. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example Genbank:*.

RELATED PROGRAMS

[ Previous | Top | Next ]

DataSet creates a GCG data library from any set of sequences in GCG format or from sequences in various flatfile formats (including GenBank, EMBL/SwissProt, and FastA). Entries in the data library can have sequences longer than 350,000 symbols.

Fetch copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen.

Fetch+ copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen.

StringSearch identifies sequences by searching for character patterns such as "globin" or "human" in the sequence documentation. Names identify GCG data files and sequence entries by name. It can show you what set of sequences are implied by any sequence specification.

FormatDB+ combines any set of GCG sequences into a database that you can search with BLAST+. BLAST+ searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST+ can produce gapped alignments for the matches it finds.

FastA+ does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA+ may be more sensitive than BLAST+.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST.

RESTRICTIONS

[ Previous | Top | Next ]

You cannot move or rename the database files unless you change the database name assignment appropriately in $HOME/.wp/dblogicals.conf.

The .offset, .names, and .numbers files are not ASCII text and cannot be viewed on your terminal screen or modified in any way.

The architecture of GCG data libraries requires that all of the files that make up a data library be in the same directory and share the same base name. These files are differentiated only by their filename extensions.

The location of a data library must be specified by a logical name. For personal data libraries, using the -config command-line option will make DataSet+ add the needed logical definition to the $HOME/.wp/dblogicals.conf file. By default, dataset+ creates the new database in the current working directory. To create the database in another directory, you need to add the parameter –dir = /full path to the directory/. This will ensure that the database is correctly mapped in the config file.

CONSIDERATIONS

[ Previous | Top | Next ]

The format of GCG database index files changed starting with version 8.0 of GCG. Personal databases that were created prior to version 8.0 must be converted to the new format. To make a database compatible with the current version of GCG, use the program DBIndex or DBIndex+ to create new index files for the database.

DATABASE NAMES

[ Previous | Top | Next ]

The sequences in your new personal database are meant to be accessed the same way as any other GCG database sequences. GCG recognizes that a sequence specification like Globin:humhbb is a database sequence specification by examining the logical name globin. If globin is assigned to a complete filename without a filename extension and if there are five files that start with that name and end with the extensions .ref, .seq, .offset, .numbers, and .names, then GCG assumes globin is a database and tries to find the entry humhbb.

Usually, DataSet+ permanently assigns a name like globin for you when you run the program, but you can assign a database logical name yourself by adding a line like

globin = /usr/user/burgess/seq/globin

to the $HOME/.wp/dblogicals.conf file.

In the example session, a logical name globin is assigned by adding a logical name line to the file $HOME/.wp/dblogicals.conf. If that file did not exist before the session, DataSet+ would create a new one. Whenever you initialize GCG, this name globin is assigned correctly.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -check to view the summary below and to specify parameters before the program executes. In the syntax summary below, square brackets ([ and ]) enclose parameter values that are optional. For each program parameter, square brackets enclose the type of parameter value specified, the default parameter value, and shortened forms of the parameter name, aliases. Programs with a plus in the name use either the full parameter name or a specified alias. If “Type” is “Boolean”, then the presence of the parameter on the command line indicates a true condition. A false condition needs to be stated as, parameter=false.

NOTE: Default values for some parameters in a –check summary will come from the user’s system. This is true for -day, -month, and -year, as well as -mapfile.

DataSet+ creates a flatfile database from any set of sequences.

Minimal Syntax: % dataset+ [-infile=]value -Default

Minimal Parameters (case-insensitive):

-infile [Type: List / Default: EMPTY / Aliases: infile1 in]

Input file specification

Prompted Parameters (case-insensitive):

-directory [Type: String / Default: EMPTY / Aliases: dir]

Directory in which the FFDB will be created. The default is to use the current working directory.

-logical [Type: String / Default: EMPTY / Aliases: ln]

Logical name which will be used to refer to this FFDB

Optional Parameters (case-insensitive):

-check [Type: Boolean / Default: 'false' / Aliases: che help]

Prints out this usage message.

-default [Type: Boolean / Default: 'false' / Aliases: d def]

Specifies that sensible default values be used for all parameters where possible.

-documentation [Type: Boolean / Default: 'true' / Aliases: doc]

Prints banner at program startup.

-quiet [Type: Boolean / Default: 'false' / Aliases: qui]

Tells application to print only a minimal amount of information.

-mapfile [Type: String / Default:

'$GCGROOT/etc/dbnamemap.conf']

Specifies the database name map file to use if the 'mapname' parameter is specified. The default value to use is the database name map file in the system configuration directory.

-mapname [Type: String / Default: EMPTY]

Specifies to look this entry up in the database name map file and use the directory and logical name settings in there.Those settings can still be overridden with command-line parameters.

-shortlogical [Type: String / Default: EMPTY / Aliases: sn short]

A shorter logical name to refer to this FFDB.

-relname [Type: String / Default: EMPTY / Aliases: reln]

Descriptive release name for this database.

-release [Type: String / Default: '1.0' / Aliases: rel]

Release number.

-year [Type: Integer / Default: '2004']

Year in which this database was released. The default is to use the current year.

-month [Type: Integer / Default: '11']

Month in which this database was released. The default is to use the current month.

-day [Type: Integer / Default: '30']

Day on which this database was released. The default is to use the current day.

-index [Type: Boolean / Default: 'true']

Specifies whether the FFDB should be indexed after the sequences have been added to it.

-force [Type: Boolean / Default: 'false' / Aliases: f]

Forces application to go through the post-processing steps even if no new sequences were added to the FFDB.

-config [Type: Boolean / Default: 'false']

Specifies that application will modify the user's config files to include a mapping for this FFDB logical name.

-pagelimit [Type: Integer / Default: '655360' / Aliases: page]

Specifies the maximum size of any FFDB page file in kilobytes.

-exclude [Type: String / Default: EMPTY / Aliases: exc]

Specifies to exclude all sequences whose name appears in the specified file name. The specified file should contain one name per line.

-dbformat [Type: String / Default: 'GCG' / Aliases: dbf]

Specifies the storage format to use for all sequence data in the FFDB pages. The default value allows for efficient storage of all sequence data types and possible compression through the encoding parameter. Note that some values of dbformat are incompatible with sequence types, in which case error messages will be displayed. Valid values are: gcg: Native GCG format that can be used with all data nbrf: Format used for PIR data only data: Like GCG but slower fasta: Sequence data is stored in FASTA format.

-encoding [Type: String / Default: 'ASCII' / Aliases: enc]

Specifies the encoding format to use for the sequence data when dbformat=gcg. Valid values are: ascii: Stpres bases without linefeeds, spaces, or compression. Allow for faithful retrieval of sequence data. This is the most suitable for amino acid data.

2bit: Compresses each nucleotide base into two bits. This loses information such as base capitalization. Bases that aren't ACGT are lost. 4bit: Compresses each nucleotide base into 4 bits, allowing for storage of extended IUPAC characters. fasta: Similar to ascii.

-mode [Type: String / Default: 'default']

Control the behavior when the specified output FFDB already exists. Valid values are: append: All sequences specified by infile are added to the FFDB overwrite: The existing FFDB is removed and replaced by all sequences specified by infile default: If the FFDB already exists, a fatal error message is printed.

-informat [Type: String / Default: EMPTY / Aliases: infmt]

The input format for STDIN, if applicable. Valid values are: GB GENPEPT FSA EMBL SPT SW RSF SSF.

-annotformat [Type: String / Default: EMPTY / Aliases: annotfmt]

The format in which annotation is stored in the database. Valid values are: GB GENPEPT EMBL SPT SW CODATA.

-summary [Type: Boolean / Default: 'true']

Specifies whether the application should print a summary of the processed sequences.

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line. Shortened forms of the parameter name, aliases, are shown, separated by commas.

Dataset+ creates a flatfile database from any set of sequences.

        -infile, -in, -infile1

Input file specification.

        -directory, -dir

Directory in which the FFDB will be created. The default is to use the current working directory.

        -logical, -ln

Logical name which will be used to refer to this FFDB.

        -check, -che, -help

Prints out the usage summary.

        -default, -d, -def

Specifies that sensible default values be used for all parameters where possible.

        -documentation, -doc

Prints banner at program startup (default). Skip banner with: -doc=false

        -quiet, -qui

This parameter is not supported.

        -mapfile

Specifies the database name map file to use if the 'mapname' parameter is specified. The default value to use is the database name map file in the system configuration directory.

        -mapname

Specifies to look this entry up in the database name map file and use the directory and logical name settings in there. Those settings can still be overridden with command-line parameters.

        -shortlogical, -short, -sn

A shorter logical name to refer to this FFDB.

        -relname, -reln

Descriptive release name for this database. Alias: -reln

        -release, -rel

Release number.

        -year

Year in which this database was released. The default is to use the current year.

        -month

Month in which this database was released. The default is to use the current month.

        -day

Day on which this database was released. The default is to use the current day.

        -index

Specifies whether the FFDB should be indexed after the sequences have been added to it

        -force, -f

Forces application to go through the post-processing steps even if no new sequences were added to the FFDB.

        -config

Specifies that application will modify the user's config files to include a mapping for this FFDB logical name.

        -pagelimit, -page

Specifies the maximum size of any FFDB page file in kilobytes. Alias: -page

        -exclude, -exc

Specifies to exclude all sequences whose name appears in the specified file name. The specified file should contain one name per line. Alias: -exc

        -dbformat, -dbf

gcg: Native GCG format that can be used with all data

nbrf: Format used for PIR data only

data: Like GCG but slower

fasta: Sequence data is stored in FASTA format

        -encoding, -enc

Specifies the encoding format to use for the sequence data when dbformat=gcg.

Valid values are:

ascii: Stores bases without linefeeds, spaces, or compression. Allows for faithful retrieval of sequence data. This is the most suitable for amino acid data.

2bit: Compresses each nucleotide base into two bits. This loses information such as base capitalization. Bases that aren't ACGT are lost.

4bit: Compresses each nucleotide base into 4 bits, allowing for storage of extended IUPAC characters.

fasta: Similar to ascii

        -mode

Controls the behavior when the specified output FFDB already exists. Valid values are:

append: All sequences specified by infile are added to the FFDB

overwrite: The existing FFDB is removed and replaced by all sequences specified by infile.

default: If the FFDB already exists, a fatal error message is printed

        -informat, -infmt

The input format for STDIN, if applicable. Valid values are: GB GENPEPT FSA EMBL SPT SW RSF SSF

        -annotformat, -annotfmt

The format in which annotation is stored in the database. Valid values are: GB GENPEPT EMBL SPT SW CODATA

        -summary

Writes a summary of the program's completion to the screen. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -summary=false.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: May 27, 2005 12:02

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.