SAMPLE

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[ Top | Next ]

Sample extracts sequence fragments randomly from sequence(s). You can set a sampling rate to determine how many fragments Sample extracts.

DESCRIPTION

[ Previous | Top | Next ]

Sample is a validation tool we use to extract small random samples of sequence data. It uses a random number generator to extract fragments of constant length randomly from somewhere within a sequence. You can set the length of the fragments extracted. You can set the sampling rate to 1 in 10, for example, to make Sample extract its fragment from every 10th sequence in the set of sequences you have specified.

The output is a set of sequence files, each containing a single fragment. Each file documents where its fragment came from. The current time is used to seed the random number generator, so each run with Sample should yield different results.

If you give Sample a single input sequence, you can choose the range, strand, and output file name. Otherwise, Sample uses the top strand of the whole sequence and names the output file with the sequence name followed by the file name extension .sample. For a single input sequence, you can choose to extract more than one sequence fragment. In this case, the output files are named with the sequence name and the number of the extracted fragment, followed by the file name extension .sample (e.g., ecoompa1.sample, ecoompa2.sample, ...).

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Sample to extract a sample of 300 base pair fragments from every 100th bacterial sequence in GenBank:

% sample

  Sample from what sequence(s) ?  Bacterial:*

  Extract fragments of what length (* 300 *) ?

  Sample one in every how many sequences (* 1 *) ?  100

       Gb_BA:AB000222   Len:   2,558

       Gb_BA:AB001637   Len:   1,677

      ///////////////////////////////

       GB_BA:YK16SRRN Len:  1,495

         GB_BA:ZMOFRK  Len:   1,080

 SAMPLE complete with:

       Input sequences: 45,946

      Output sequences: 459

       Fragment length: 300

              Reversed: 0

          Not Reversed: 459

         Sampling Rate: 1 in 100

   Output files called: "*.sample"

OUTPUT

[ Previous | Top | Next ]

Each .sample output file would contain a 300 base pair fragment from a bacterial sequence. Here is the first one:

!!NA_SEQUENCE 1.0

 (300 bp) SAMPLE of: hihi0043  check: 5029  from: 1  to: 1065

 starting at position: 122  ending at position: 421

ID   HIHI0043   standard; DNA; PRO; 1065 BP.

AC   L44687; L42023;

NI   g1004185

DT   04-OCT-1995 (Rel. 45, Created)

DT   04-OCT-1995 (Rel. 45, Last updated, Version 1)

DE   Haemophilus influenzae Rd predicted coding region HI0043 gene, . . .

hihi0043.sample  Length: 300  October 5, 1998 13:12  Type: N  Check: 5029  ..

       1  CTCAACTTGA ACAAGCATTG AAACCAAAAT CCAGTTTTAG AAAAACTTTA

      51  TTAAAATTTA CTGCACTTTT ATTTGGCTTG GCGACGGTTG CGCAATCCGT

     101  GCAGTGGATT TGGGATAGCT ATCAAAAACA TCAATGGATT TATCTTGCTT

     151  TTGCTTTAGT CAGTTTGATT ATCATTTTAT TGGGTATTAA AGAGATTATT

     201  TGTGAGTGGC GACGTTTAGT TCGTTTAAAA AAACGTGAGC AATGGCAACA

     251  ACAAAGTCAG CAGATTTGGT TAGAAAGTGC GGTAAAAAAT GGTGATGTTT

INPUT FILES

[ Previous | Top | Next ]

Sample accepts a single sequence or multiple sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*. The function of Sample depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

Corrupt randomly introduces small numbers of substitutions, insertions, and deletions into nucleotide or protein sequence(s). Shuffle randomizes the order of the symbols in a sequence without changing the composition.

RESTRICTIONS

[ Previous | Top | Next ]

If you give Sample more than one sequence as input, Sample only extracts one fragment from any particular sequence. The sequences chosen are not random. For a sampling rate of 1 in 100, the first sequence after every 100 sequences that is longer than the set fragment length is used to extract a fragment. Contact us if you would like to have Sample sample in some other way.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % sample [-INfile=]ba:* -Default

Prompted Parameters:

-LENgth=300               extracts fragments of length 300

Prompted Parameters: (for single sequences)

-BEGin=1 -END=11375      sets the range of interest

-REVerse                 uses the reverse strand (for nucleotides)

-SAMplingrate=100        extracts 100 fragments from the input sequence

[-OUTfile=]gamma.sample  names the output file

Prompted Parameters: (multiple sequences only)

-SAMplingrate=100  extracts fragments from 1 in every 100 sequences

Local Data Files:  None

Optional Parameters:

-BOTHstrands            selects from both strands (nucleotide sequences only)

-EXTension=.sample      sets the default output file name extension

-LIStfile[=sample.list] writes a list file of output sequence names

-VALidate               displays details for each sampling action

-NOMONitor              suppresses screen monitor of input sequence names

-NOSUMmary              suppresses the screen summary

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-LENgth=300

Specifies the length of extracted fragments.

-SAMplingrate=100

Sets the sampling rate for the specified set of sequences to 1 in 100.

-BOTHstrands

Sample normally extracts fragments from nucleotide sequences only from the top strand. With this parameter it will select the fragments randomly from both strands.

-EXTension=.sample

This program normally creates output file names by using the original input file name for the base name and the program name for the name extension. Use this parameter to specify some other file name extension.

-LIStfile=sample.list

Writes a list file with the names of the output sequence files. This list file is suitable for input to other Accelrys GCG (GCG) programs that support list files (see Section 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then Sample makes one up using sample for the file name and .list for the file name extension.

-VALidate

Displays the location of each sample on your screen (name of sequence sampled, beginning and end coordinates and strand of sample taken, etc.).

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: May 27, 2005 14:23

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.