CORRUPT

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INPUT FILES

RELATED PROGRAMS

RESTRICTIONS

CONSIDERATIONS

SUGGESTIONS

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

Corrupt randomly introduces small numbers of substitutions, insertions, and deletions into nucleotide or protein sequence(s).

DESCRIPTION

[ Previous | Top | Next ]

Corrupt uses a random number generator to add errors to nucleotide sequences. You can set the number of substitutions and length errors independently. Length errors can either be insertions or deletions; these two changes are now collectively referred to as indels in the literature of mathematical biology. The position of each error is picked at random somewhere within the range and on the strand that you chose. The length of each indel is chosen at random from one to the maximum indel size. If the indel is positive (insertion), then the symbols added are also chosen at random.

The output files contain a complete record of the errors introduced. The chosen and actual number of substitutions may vary since one in four substitutions will not change the sequence. The output file also shows the total amount of length added (or subtracted) when all of the indels are taken together. The current time is used to seed the random number generator, so each run with Corrupt yields different results.

If you give Corrupt a single input sequence, you can choose the range, strand, and output file name. Otherwise, Corrupt uses the top strand of the whole sequence and names the output file with the sequence's name followed by the file name extension .corrupt.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Corrupt to corrupt the first 200 bases of gamma.seq:

 
 
% corrupt
 
  Corrupt what sequence(s) ?  gamma.seq
 
                    Begin (* 1 *) ?
                  End (* 11375 *) ?  200
                 Reverse (* No *) ?
 
  How many substitutions do you want (* 1 *) ?  3
 
  How many length errors do you want (* 1 *) ?  3
 
  What should I call the output file (* gamma.corrupt *) ?
 
%

OUTPUT

[ Previous | Top | Next ]

The file gamma.corrupt would contain the corrupted contents of the first 200 symbols in gamma.seq. Here is the output from this session:

 
 
!!NA_SEQUENCE 1.0
 CORRUPT of: gamma.seq  check: 6474  from: 1  to: 200
 
Human fetal beta globins G and A gamma
from Shen, Slightom and Smithies,  Cell 26; 191-203.
Analyzed by Smithies et al. Cell 26; 345-353.
 
 Substitutions:  G at 188,  T at 115,  G at 170,
 
        InDels:  C inserted at 161,  TCA removed  at 116,  G inserted at 16,
 
 InDels: 3   Substitutions: 3   MaxIndel: 3
 
 Actual substitutions: 3  Length change from indels: -1
 
gamma.corrupt  Length: 199  August 20, 1998 13:02  Type: N  Check: 2187  ..
 
       1  GGATCCTAGA TATTCGCTTA GTCTGAGGAG GAGCAATTAA GATTCACTTG
 
      51  TTTAGAGGCT GGGAGTGGTG GCTCACGCCT GTAATCCCAG AATTTTGGGA
 
     101  GGCCAAGGCA GGCAGTCCTG AGGTCAAGAG TTCAAGACCA ACCTGGCCAA
 
     151  CATGGTGACA ATCCCATCGC TACAAAAATA CAAAAAGTAG ACAGGCATG
 

INPUT FILES

[ Previous | Top | Next ]

Corrupt accepts a single nucleotide sequence or multiple nucleotide sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenBank:*. If Corrupt rejects your nucleotide sequence, turn to Appendix VI to see how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

Sample extracts sequence fragments randomly from sequence(s). You can set a sampling rate to determine how many fragments Sample extracts. Shuffle randomizes the order of the symbols in a sequence without changing the composition. You can enter sequences from the keyboard or from a digitizer.

RESTRICTIONS

[ Previous | Top | Next ]

Corrupt only works on nucleotide sequences. Contact us if you would like to have it upgraded to also work with proteins. The output is renumbered to start at one.

If an indel is longer than 250 nucleotides, only the first 250 nucleotides of the indel are shown in the output file.

CONSIDERATIONS

[ Previous | Top | Next ]

Corrupt makes the substitutions first followed by the insertions and deletions. The substitution algorithm is this: any of the four bases is chosen at random and then put into any position in the sequence randomly. This means that, on average, about one in four substitutions will not change the nucleotide .

SUGGESTIONS

[ Previous | Top | Next ]

You may find what happened hard to understand if you make a lot of indels. The best way we know of to reconstruct a corruption is to start with the original sequence. You can use Gap to display the original and corrupted sequences next to one another.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % corrupt [-INfile=]gamma.seq -Default
 
Prompted Parameters: (for single sequences only)
 
-BEGin=1 -END=11375       sets the range of interest
-REVerse                  uses the back strand of nucleotide sequences
[-OUTfile=]gamma.corrupt  specifies the output file name
 
Other Prompted Parameters:
 
-SUBstitutions=1          sets the number of substitutions to introduce
-INDels=1                 sets the number of length errors to introduce
 
Local Data Files: None
 
Optional Parameters:
 
-MAXindel=3               sets the size of maximum insertion/deletion
-NOTRAce                  suppresses the record of errors in the output file
-EXTension=.corrupt       sets the output file name extension
-LIStfile[=corrupt.list]  writes a list file of output sequence names
-NOMONitor                suppresses screen monitor (of input sequence
                            names)
-NOSUMmary                suppresses the screen summary

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-SUBstitutions=1

Specifies the number of character substitutions to introduce.

-INDels=1

Sets the number of insertions and deletions (length errors) to introduce.

-MAXindel=3

Sets the maximum size of an insertion or deletion. The maximum is three unless you change it with this parameter.

-NOTRAce

Normally Corrupt writes a complete record in the output file of each substitution, insertion, and deletion. You can suppress this information with -NOTRAce.

-EXTension=.corrupt

This program normally creates output file names by using the original input file name for the base name and the program name for the name extension. Use this parameter to specify some other file name extension.

-LIStfile=corrupt.list

Writes a list file with the names of the output sequence files. This list file is suitable for input to other Accelrys GCG (GCG) programs that support list files (see Section 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then Corrupt makes one up using corrupt for the file name and .list for the file name extension.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

Writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: May 27, 2005  11:59


[ Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio