APPENDIX VII

[ Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

DATA FILES

OVERVIEW

VIEWING OR MODIFYING DATA FILES

RESTRICTION ENZYMES

PROTEOLYTIC ENZYMES AND REAGENTS

TRANSCRIPTION FACTOR DATABASE (TFD)

CODON FREQUENCY TABLES

TRANSLATION TABLES

SCORING MATRICES

PROTEIN ANALYSIS DATA FILES

PROSITE

PROFILES

VERSION 2.0 PROFILES


DATA FILES

[Top | Next ]

This appendix contains descriptions of the following types of data files used by Accelrys GCG (GCG) programs:

 
 
    Restriction Enzymes                     Scoring Matrices
 
 
    Proteolytic Enzymes and Reagents        Protein Analysis Data Files
 
 
    Transcription Factor Database (TFD)     Codon Frequency Tables
 
 
    Translation Tables                      PROSITE
 
 
    Profiles                                Version 2.0 Profiles
 
 

OVERVIEW

[ Previous | Top | Next ]

Most GCG programs analyze nucleic acid or protein sequences stored in files or in sequence databases. Additionally, many programs require nonsequence information, or data files, which they use to analyze the sequences. For example, the nucleic acid mapping programs require two data files: enzyme.dat, which contains restriction enzyme names and their corresponding recognition sites; and translate.txt, which associates codons with their corresponding amino acids.

All programs that require a data file have a default file they use, so as a new user, you don't need to worry about supplying one. These default files are public data files. Public data files are located in the public directory with the logical name GenRunData and may be accessed by everyone who uses the package. When you run a program that requires a data file, it automatically finds the appropriate default file in this directory without you having to specify the directory and file name.

GCG also supplies alternative public data files you can have a program use instead of the default. These files are located in the directory with the logical name GenMoreData. There may be times when you want to use an alternative public data file rather than the default file. For example, if you're using the CodonPreference program to analyze a Drosophila sequence, you may want to use the alternative codon frequency table drosophila_high.cod, rather than the default table, ecohigh.cod, which is more appropriate for bacterial sequences.

In each of the following data file descriptions, we provide the names of the default data files used by programs as well as alternative public data files you can specify separately. You will find the following subtopics in each data file's description:

Default data file. You can find all default public data files in the directory with the logical name GenRunData.

Alternative data file. You can find alternative public data files in the directory with the logical name GenMoreData.

You also can create your own data file or personalize a public data file by copying it to your working directory and modifying it. These files are known as local data files. For instance, you could copy the restriction enzyme data file called enzyme.dat to your directory and delete all of the enzymes in it that are not available in your laboratory. Or, let's say you're working with the FindPatterns program and you create a data file of patterns specific to your research. This personal data file, then, would be available only to you.

VIEWING OR MODIFYING DATA FILES

[ Previous | Top | Next ]

To view a public data file online, use the TypeData program, for example % typedata enzyme.dat. To copy a public data file to your directory, use the Fetch program, for example, % fetch enzyme.dat. Then, open the file in the text editor of your choice to view or modify the file to your needs. For information on how to use an alternative data file with a program, see Section 4, Using Data Files in the User's Guide.


RESTRICTION ENZYMES

[ Previous | Top | Next ]

Function


Nucleotide mapping programs read the list of available restriction enzymes along with their recognition sites, cut positions, and overhangs from an enzyme data file.

Programs that use this file


Map, MapSort, and MapPlot.

Default data file


enzyme.dat

Alternative data files


None.

Format


Heading: An enzyme data file consists of an optional documentary heading. A divider of two adjacent periods (..) separates the heading from the enzymes.

Name: The first field on each line contains the name of the restriction enzyme; the name should have no more than 132 characters. Only one enzyme should appear per line.

Offset: The name is followed by an offset number, which tells the mapping programs where to cut the top strand when the recognition site is found.

Recognition site: The offset is followed by the enzyme recognition sequence. Nucleic acid recognition sequences, like all nucleotide sequences, are represented in 5' to 3' orientation. The recognition sequences should be shorter than 350 characters. They may contain any IUPAC-IUB alphabetic nucleotide character. See Appendix III of the Program Manual for a complete list of supported sequence symbols.

Nonsequence characters in the recognition site: Mapping programs read the offset and overhang fields to find out where each enzyme actually cuts, but the recognition sequences contain non-sequence characters to help humans see the cut points. An apostrophe (') indicates the cut point on the top strand; an underscore ( _ ) indicates the cut point on the bottom strand (when the enzyme does not leave a blunt end). These apostrophes and underscores are ignored by mapping programs and may therefore be absent.

Overhang: The fourth field in the list of enzymes tells the number of bases (positive or negative) from the cut point on the top strand to the point where the bottom strand is cut. A 0 (zero) would leave a blunt end; a 3 would give a 5' overhang of 3 bases; a -3 would leave a 3' overhang of 3 bases. If the recognition site is a palindrome, the overhang field is ignored. If the overhang field is absent or is a non-numeric character (? or . are most often used), the bottom strand is not searched.

Display of isoschizomers: The public file has a semicolon in front of all but one member of each family of isoschizomers. (Isoschizomers, in this context, are restriction endonucleases with the same recognition sequence.) Mapping programs normally ignore isoschizomers whose names are preceded by a semicolon. These isoschizomers are available if you select them individually by name or if you type ** in response to the enzyme prompt.

Isoschizomers, suppliers, and literature: Any information on the line to the right of an exclamation point (!) is documentary and is ignored by mapping programs. The documentary information on each record of the public file contains the names of other isoschizomers, if any are known, along with the commercial suppliers and literature references for the enzyme. (See "Restriction Enzyme Suppliers" and "Restriction Enzyme Literature" below.)

Format requirements: The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from all other fields on the same line by at least one blank space. Blank lines are tolerated. Most GCG programs ignore information to the right of an exclamation mark (!) so you can add comments to the data file.

Asymmetric recognition sequences: If the forward and reverse recognition sites are not the same, then there are two records, one showing the forward and the other the reverse strand. These records must be adjacent to one another in the enzyme file. (See BcgI for an example.) You can give several recognition sites with the same name, but you must put all entries with the same name on adjacent lines of the enzyme data file.

Suggestions


You can put semicolons in front of the enzymes to which you do not have access so that they are not displayed when you create restriction enzyme maps.

Because GCG mapping programs using the default data file display only one member of each family of isoschizomers, these programs find all possible recognition sites but not all possible cut points. If you find an enzyme displayed near a point of interest, you might want to examine the enzyme file to see if another cut point is available.

Restriction Enzyme Suppliers


Many of the restriction enzymes displayed by GCG mapping programs are available commercially. The file enz_sources.txt shows the main suppliers of restriction enzymes together with the enzymes they make available. This file is for your information only; it is not read by any GCG program.

You can use Fetch to copy this file to your working directory and then search it with a text editor.

Restriction Enzyme Literature


Most of the restriction enzymes displayed by GCG mapping programs are described in the scientific literature. The citations for each enzyme are the numbers that appear last on each record of the enzyme data file enzyme.dat. You can find these citations in the file enz_refs.txt. This file is for your information only; it is not read by any GCG program.

You can use Fetch to copy this file to your working directory and then search it with a text editor.

Acknowledgments


Dr. Richard Roberts at New England Biolabs developed and maintains REBASE, the restriction enzyme database from which the enzyme data in GCG are drawn.


PROTEOLYTIC ENZYMES AND REAGENTS

[ Previous | Top | Next ]

Function


Peptide mapping programs read enzyme and reagent names, recognition patterns, and cut positions from an enzyme data file.

Programs that use this file


PeptideMap, MapSort, MapPlot, and PeptideSort.

Default data files

 
 
    Program                            Data file
 
 
    PeptideMap, MapSort, and MapPlot   proenzyme.dat
    PeptideSort                        proenzall.dat
 
 

Note: Proenzall.dat, is a more complete list of proteolytic agents, containing several agents that cut at the same place.

Alternative data files


None.

Format


Heading: An enzyme data file consists of an optional documentary heading. A divider of two adjacent periods (..) separates the heading from the enzymes.

Name: The first field on each line contains the name of the enzyme. Only one enzyme should appear per line.

Offset: The name is followed by an offset number, which tells the mapping programs where to cut the peptide when the recognition pattern is found.

Cleavage site: The offset is followed by the enzyme recognition sequence. Recognition sequences, like all peptide sequences, are represented in amino -> carboxyl orientation. They may contain any standard amino acid character, but no ambiguity characters (B and Z). See Appendix III of the Program Manual for a complete list of supported sequence symbols.

Nonsequence characters in the recognition site: Mapping programs read the offset field to find out where each enzyme actually cleaves, but the recognition sequences contain non-sequence characters to help humans see the cleavage points. An apostrophe (') indicates the cut point. These apostrophes are ignored by mapping programs and may therefore be absent.

Overhang: The fourth field is the overhang which is used in nucleotide restriction enzyme data files. It has no function for proteolytic reagents.

Display of isoschizomers: Mapping programs normally ignore enzymes whose names are preceded by a semicolon (;). These enzymes are available if you select them individually by name or if you type ** in response to the enzyme prompt.

Documentation: Any information on the line to the right of an exclamation point (!) is documentary and is ignored by mapping programs.

Format requirements: The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from all other fields on the same line by at least one blank space. Blank lines are tolerated. Most GCG programs ignore information to the right of an exclamation mark (!) so you can use these marks to create comments within the data.

Multiple specificities: You may include more than one occurrence of an enzyme name if the enzyme has more than one specificity. All records with the same name must appear on adjacent lines of the enzyme data file. If you want to distinguish specificities (for instance trypsin cutting when arginines are blocked), you can create a unique name the distinguishes trypsin cutting at lysine from trypsin cutting at arginine.

Suggestions


You can put semicolons in front of all the enzymes and reagents that you do not have access to or that you do not want to use. GCG programs will ignore those enzymes and reagents.

GCG programs PeptideMap, MapSort, and MapPlot search for every point of specific cleavage but not every cleavage pattern. PeptideSort tries to identify each known single-digest cleavage pattern. Send us suggestions for other specificities and cleavage patterns that you think these files should include.


TRANSCRIPTION FACTOR DATABASE (TFD)

[ Previous | Top | Next ]

Function


This data file provides a list of the recognition sequences for eukaryotic sequence-specific transcription factors from the Transcription Factor Database (TFD).

Programs that use this file


FindPatterns. (Map, MapSort, and MapPlot can also read this file.)

Default data file


None.

Alternative data files


tfsites.dat

Format


Heading: tfsites.dat consists of an optional documentary heading. A divider of two adjacent periods (..) separates the heading from the transcription sites.

Name: The first field on each line contains the name of the site; the name should have no more than 132 characters. Only one site should appear per line.

Offset: The name is followed by an offset number, which tells programs where to mark the top strand when the recognition site is found.

Recognition site: The offset is followed by the recognition sequence. Nucleic acid recognition sequences are represented in 5' to 3' orientation. The recognition sequences should be shorter than 350 characters. They may contain any IUPAC-IUB alphabetic nucleotide character. See Appendix III of the Program Manual for a complete list of supported sequence symbols.

Overhang: The fourth field should be set to zero to signal that both strands should be searched.

Display of isoschizomers: The public file has a semicolon (;) in front of frequently found sites. Mapping programs normally do not display sites whose names are preceded by a semicolon. If you want to use any of these sites, use the Fetch program to copy tfsites.dat to your working directory and use a text editor to remove the semicolons you want.

Literature: Any information on the line to the right of an exclamation point (!) is documentary and is ignored by mapping programs. The documentary information on each record of the public file contains a common name as well as a literature reference for the site.

Format requirements: The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from all other fields on the same line by at least one blank space. Blank lines are tolerated. Most GCG programs ignore information to the right of an exclamation mark (!) so you can add comments to the data file.

Suggestions


You can use Fetch to copy tfsites.dat to your working directory and then rename it pattern.dat. FindPatterns will then read it automatically and use it as the default data file.

Also note that you should always search both strands (FindPatterns does this by default) as most transcription factor sites are strand specific.

Acknowledgments


Dr. David Ghosh developed and maintains TFD.


CODON FREQUENCY TABLES

[ Previous | Top | Next ]

Function


Codon frequency tables reflect the known codon preferences of an organism.

Programs that use these tables


BackTranslate, CodonPreference, and Frames.

Default data file


ecohigh.cod

Alternative data files

drosophila_high.cod

human_high.cod

maize_high.cod

yeast_high.cod

celegans_high.cod

celegans_low.cod

Format


Heading: A codon frequency table consists of an optional documentary heading. A divider of two adjacent periods (..), separates the heading from the table. For example
 
 

AmAcid  Codon  Number     /1000     Fraction  ..
 
Gly     GGG    13.00       1.89      0.02

AmAcid: The first field of information on each line of the table contains a three-letter code for an amino acid.

Codon: The second field contains an unambiguous codon for that amino acid.

Number: The third field lists the number of occurrences of that codon in the genes from which the table is compiled.

/1000: The fourth field lists the expected number of occurrences of that codon per 1,000 codons in genes whose codon usage is identical to that compiled in the codon frequency table.

Fraction: The last field contains the fraction of occurrences of the codon in its synonymous codon family.

Each field of information is separated from every other field by at least one blank space.

Suggestions


You can use the CodonFrequency program to create a codon frequency table from a set of input nucleotide sequences and/or existing codon frequency tables. You also can create or modify a codon frequency table with a text editor. If you choose to use a text editor, you need provide only the first three fields of information on each line of the table. The lines can be in any order; only codons whose use is greater than zero need be present. You should then generate the complete codon usage table -- five fields of information, one line for each codon, and all lines ordered by amino acid -- by using the table you created as the input to the CodonFrequency program.


TRANSLATION TABLES

[ Previous | Top | Next ]

Function


Translation tables are used by GCG programs for three purposes:

1. To define the relationships between codons and amino acids.

2. To define the relationships between one-letter and three-letter amino acid codes.

3. To identify potential start codons and stop codons.

Programs that use these tables


BackTranslate, CodonFrequency, CodonPreference, Diverge, Frames, Map, MapPlot, MapSort, Publish, Reformat, and Translate.

Default data file


translate.txt

Alternative data files

 
 
Data file               Function
 
 
transmitodros.txt     drosophila mitochondrial translation table
 
 
transl_table_02.txt   vertebrate mitochondrial translation table
 
 
transl_table_03.txt   yeast mitochondrial translation table
 
 
transl_table_04.txt   mold, protozoan, and coelenterate mitochondrial
                      and mycoplasma/spiroplasma translation table
 
 
transl_table_05.txt   invertebrate mitochondrial translation table
 
 
transl_table_06.txt   ciliate, dasycladacean, and hexamita translation table
 
 
transl_table_09.txt   echinoderm mitochondrial translation table
 
 
transl_table_10.txt   euplotid translation table
 
 
transl_table_11.txt   bacterial translation table
 
 
transl_table_12.txt   alternative yeast translation table
 
 
transl_table_13.txt   ascidian mitochondrial translation table
 
 
transl_table_14.txt   flatworm mitochondrial translation table
 
 
transl_table_15.txt   blepharisma mitochondrial translation table
 
 
transl_table_16.txt   chlorophycean mitochondrial translation table
 
 
transl_table_21.txt   trematode mitochondrial translation table
 
 
transl_table_22.txt   scenedesmus obliquus mitochondrial translation table
 
 
transl_table_23.txt   thraustochytrium aureum mitochondrial translation table
 
 

To specify an alternative translation data file, add the parameter -TRANSlate=filename.txt on the command line.

Format


Heading: A translation table consists of an optional documentary heading. A divider of two adjacent periods (..), separates the heading from the table. For example

Symbol 3-letter Codons !IUPAC .. A Ala GCG GCC GCA GCG !GCX Symbol: The first field of information on each line of the table is a single-letter amino acid sequence symbol.

3-letter: The second field is the three-letter amino acid code for that sequence symbol.

Codons: The third field must contain a list of all unambiguous codons for the amino acid; this list must come before the exclamation point (!).

!IUPAC: In the fourth field, the exclamation point delimits where the unambiguous codons end and where the ambiguous codons start. The ambiguous codons are provided for documentary purposes only and are completely ignored by GCG programs. Each field is separated from every other field by at least one blank space. Any of the 31 GCG sequence symbols (see Appendix III of the Program Manual) may be associated with a three-letter code and one or more unambiguous codons. Each codon and each sequence symbol may be used only once.

Output


Potential start codons are written only in lowercase letters. Stop codons are translated as the asterisk (*) symbol.


SCORING MATRICES

[ Previous | Top | Next ]

(formerly Symbol Comparison Tables)

Function


Many sequence comparison programs make comparisons between pairs of sequence symbols by looking up a value in a scoring matrix. The matrix assigns an integer value for the match quality of every possible pair of symbols. If you are comparing nucleotides, the matrix might contain 1's for matching symbols and 0's (zeros) for mismatching symbols. However, if you are comparing amino acids, a number could be assigned that is based on chemical similarity or evolutionary distance. The number might be negative if two residues were very dissimilar.

Programs that use these files


BestFit, Compare, FastA, FrameAlign, FrameSearch, Gap, GapShow, PileUp, PlotSimilarity, Pretty, Prime, ProfileMake, Repeat, Segments, StemLoop, TFastA, and the Consensus operation (in the Edit menu) in the Editor mode of SeqLab.

Default data files


For nucleotides:

 
 
    Program         Default data file
 
 
    BestFit         swgapdna.cmp
    Compare         compardna.cmp
    FastA           fastadna.cmp
    Gap             nwsgapdna.cmp
    GapShow         swgapdna.cmp or
 
 
                    nwsgapdna.cmp
   
    PileUp          pileupdna.cmp
    PlotSimilarity  plotsimdna.cmp
    Pretty          prettydna.cmp
    Prime           prime.cmp
    ProfileMake     profiledna.cmp
    Repeat          repeatdna.cmp
    Segments        segdna.cmp
    StemLoop        stemloop.cmp
 
 

For proteins:


All analysis programs, except FastA and TFastA, use blosum62.cmp as the default data file. FastA and TFastA use blosum50.cmp. The Consensus operation (in the Edit menu) in the Editor mode of SeqLab uses identpep.cmp.

Alternative data files for nucleotides


To specify an alternative scoring matrix file, add the parameter -MATRix=filename.txt on the command line.

Global alignments with Segments, ProfileGap, and ProfileSegments


By default, Segments creates local alignments, analogous to those created by BestFit. You can direct Segments to create global alignments, analogous to those created by Gap, by using the command-line parameter -WHOle. Segments then uses the scoring matrix seggapdna.cmp, containing no negative values for mismatches.

ProfileGap and ProfileSegments can be directed to create global alignments by using the command-line parameter -GLObal. If you want to create global alignments using these programs, you might want to create the profile in ProfileMake using the alternative scoring matrix profilegapdna.cmp.

randomdna.cmp


This matrix is most appropriate for programs creating local alignments (BestFit, Segments, ProfileGap, and ProfileSegments). Since all mismatches between IUPAC-IUB nucleotide symbols are given a value of -3 and all matches are given a value of +10, local alignments created using this matrix will be extended further than those created with any of the default scoring matrices for these programs.

Alternative data files for proteins


To specify an alternative scoring matrix file, add the parameter -MATRix=filename.txt on the command line.

BLOSUM matrices


GCG provides a set of BLOSUM matrices for the comparison of peptide sequences, derived from substitutions observed in more than 2,000 blocks of aligned sequences (Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks (Proceedings of the National Academy of Sciences USA 89; 10915-10919) are provided as alternative peptide scoring matrices in the files blosum30.cmp, blosum35.cmp, blosum40.cmp, blosum45.cmp, blosum55.cmp, blosum60.cmp, blosum65.cmp, blosum70.cmp, blosum75.cmp, blosum80.cmp, blosum85.cmp, blosum90.cmp, and blosum100.cmp. To complete this set, blosum50.cmp and blosum62.cmp are also provided as the default scoring matrices for some analysis programs in GCG.

pam120.cmp and

pam250.cmp


These matrices are the log odds form of the mutation data matrix for 120 PAMs and 250 PAMs (Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. [1979] in Atlas of Protein Sequence and Structure, and Dayhoff, M. O. Ed, pp. 345-352 (Figure 84), National Biomedical Research Foundation, Washington D.C., respectively).

structgappep.cmp


This matrix, described by Risler, et al. (Journal of Molecular Biology 204; 1019-1029), is derived from an analysis of amino acid substitutions after superposition of homologous protein structures. To construct this matrix the authors converted only substitutions whose alpha carbon atoms are very close to one another after superposition of the structures. Based on results from test alignments using Gap and BestFit, the authors suggest that this scoring matrix may prove superior to others in finding weak similarities in distantly related proteins.

oldpep.cmp


An alternative peptide scoring matrix in the file oldpep.cmp can be provided to GCG programs as a local data file. This matrix was derived from the default peptide scoring matrix in Version 8 of GCG. Each value in the Version 8 matrix of floating point values was multiplied by 10 and rounded to the nearest integer to determine the comparison values in oldpep.cmp. Perfect matches in oldpep.cmp have a comparison value of 15, and no matches in the matrix have a higher value than perfect matches.

Format


A scoring matrix file consists of a documentary heading, a dividing line with two adjacent periods (..), an optional auxiliary data block that specifies the default gap creation and extension penalties associated with the scoring matrix, and the matrix itself. GCG nucleotide and amino acid symbols are described in Appendix III of the Program Manual.

GCG programs can use two different types of scoring matrices: BLAST format and GCG format.

BLAST-format scoring matrices

If you have a native BLAST-format scoring matrix, for example BLOSUM62, it can be used directly by GCG programs without converting it to GCG format. However, one advantage to converting native BLAST-format scoring matrices to GCG format is that you can explicitly set gap creation and gap extension penalties within the file (see "Auxiliary Data Block: Setting Gap Creation and Extension Penalties" below). GCG-format scoring matrices

GCG-formatted scoring matrices can be of two forms: rectangular or "equals." You can use either form of scoring matrix with GCG programs with no difference in program performance or results.


Rectangular scoring matrices. The rectangular form organizes the sequence symbols along an x axis (columns) and y axis (rows), where each symbol along the x axis is compared with each symbol along the y axis. There is a row and column for every sequence symbol that has at least one non-zero comparison value. The value of each pair of symbols compared is placed at the intersection of the appropriate row and column. All relationships that are not explicitly defined in the matrix are assigned a value of 0. Every comparison value is separated from every other value by at least one blank space. Blank lines are tolerated.

Consider the example below:
 
 

 
   A  B  C  D  E  F  G  H ...
A  4 -2  0 -2 -1 -2  0 -2
B -2  6 -3  6  2 -3 -1 -1
C  0 -3  9 -3 -4 -2 -3 -3
D -2  6 -3  6  2 -3 -1 -1
E -1  2 -4  2  5 -3 -2  0
F -2 -3 -2 -3 -3  6 -3 -1
G  0 -1 -3 -1 -2 -3  6 -2
H -2 -1 -3 -1  0 -1 -2  8
 ...
 

The intersection of row D with column D has a value of 6, which represents an identical match for a D-D pairwise comparison. However, the pairwise comparison between non-identical symbols often is given a lower value, for example a C-D comparison is -3.


Notice that the values are identical at the C-D comparison and at the D-C comparison: -3. Previous versions of the supported only triangular forms of scoring matrices to eliminate this repetition. However, to make publicly available scoring matrices, which are in a rectangular format, easier to use, GCG now supports only rectangular-format scoring matrices. See "Converting Scoring Matrices" later in this section for converting pre-Version 9 scoring matrices to the new format.

Equals-form scoring matrices. The second form of GCG-format scoring matrix supported is "equals" form, so named because within the matrix, each pairwise comparison equals a value. For instance, in the example below, a A-A symbol comparison is assigned, or equals, a value of 4.
 
 

AA=      4      AB=     -2      AD=     -2      AE=     -1      AF=     -2
AH=     -2      AI=     -1      AK=     -1      AL=     -1      AM=     -1
AN=     -2      AP=     -1      AQ=     -1      AR=     -1      AS=      1
AW=     -3      AX=     -1      AY=     -2      AZ=     -1      BB=      6
BC=     -1      BD=      6      BE=      2      BF=     -3      BG=     -1
 ...
 

All relationships that are not explicitly defined in the matrix are assigned a value of 0. Every comparison value is separated from every other value by at least one blank space. Blank lines are tolerated. Some people find the equals form of scoring matrix easier to read than the rectangular form.

Auxiliary Data Block: Setting Gap Creation and Extension Penalties


You can specify gap creation and gap extension penalties within a scoring matrix to ensure that programs reading the scoring matrix use those values as defaults. If you do not specify these penalties, the program calculates reasonable defaults based on the values in the matrix.

Gap creation and gap extension penalties must follow a specific format within a scoring matrix. These penalties must appear in an auxiliary data block, which appears after the dividing line with the two adjacent periods (..) and before the line of sequence symbols in the scoring matrix, as shown below:
 
 

 ..
 
 {
 GAP_CREATE 12
 GAP_EXTEND 4
 }
 
   A  B  C  D  E  F  G  H ...
 

If you create your own scoring matrix, or if you modify an existing one, you must maintain this format for specifying gap creation and extension penalties.


Note that even though gap creation and extension penalties may be set within a scoring matrix, you can override them on the command line. To do so, use the parameters -GAPweight and -LENgthweight on the command line when you run a program that uses scoring matrices.

Suggestions

Creating new scoring matrices


Use the CompTable program to create scoring matrices. You also can use a text editor to create a scoring matrix; if you do so, use the Reformat program with the command-line parameter -COMparison to rewrite the file into GCG format. Both CompTable and Reformat round the values in the matrix to the nearest integer.

Modifying existing scoring matrices


Several programs may use the same default scoring matrix. However, although the matrices may be identical, the default matrix for each program is contained in a separate file. This allows you to modify a local version of the matrix for one program without affecting the matrix used by another program.

If you make modifications to a matrix, use the Reformat program with the command-line parameter -COMparison to rewrite your scoring matrix data file into GCG format.

Converting scoring matrices

Converting pre-Version 9 scoring matrices to the new format


In Version 9 all scoring matrices provided with the package in GenRunData and GenMoreData are already converted to the new format. However, you must convert all of the scoring matrices in your personal directories, including your personal directory with the logical name MyData, to the new rectangular format. When you do so, you will need to specify the scoring matrix as either nucleotide or protein. GCG programs will not accept pre-Version 9 scoring matrices, and they will display the following error message if you try to use one:
 
 

*** ERROR, READSCOREMAT cannot read the scoring matrix in the file
 "filename"!
 
If this is a scoring matrix created before Version 9,
try converting it with "% reformat /OLDCMPformat /PROtein" or
                       "% reformat /OLDCMPformat /NUCleotide"

To convert pre-Version 9 scoring matrices to the new format, type

% Reformat -OLDCMPformat -NUCleotide scoring_matrix

or

% Reformat -OLDCMPformat -PROtein scoring_matrix

Converting scoring matrices to make them more readable


GCG programs can accept two forms of GCG-format scoring matrix files: rectangular and "equals." There is no difference in analysis or performance between the forms. However, some people find "equals" format easier to read, and the package provides a way to convert between the two forms.

To convert rectangular scoring matrices to the more readable "equals" format, type

% Reformat -COMParison -EQUALSformat scoring_matrix

To convert "equals" format scoring matrices to rectangular format, type

% Reformat -COMParison scoring_matrix

Converting BLAST-format scoring matrices to GCG-format


GCG also works with native BLAST-formatted scoring matrices. Although converting BLAST-formatted scoring matrices to GCG-format is unnecessary, you may find it useful to do so. GCG-formatted scoring matrices allow you to specify gap creation and extension penalties within the scoring matrix file.

To convert BLAST-formatted scoring matrices to GCG-format, type

% Reformat -COMParison -NUCleotide scoring_matrix

or

% Reformat -COMParison -PROtein scoring_matrix 


PROTEIN ANALYSIS DATA FILES

[ Previous | Top | Next ]

Function


These data files enable programs to locate motifs in protein sequences and to make predictions about peptide isolation, secondary structure, hydrophobicity, and antigenicity.

Programs that use these tables


PeptideSort, Isoelectric, PepPlot, HelicalWheel, CoilScan, SPScan, and HTHScan.

Default data file

 
 
 
 
 
Program         Default data file       Function
 
 
PeptideSort     aminoacid.dat           amino acid residue properties
 
 
    extinctcoef.dat         extinction coefficients for   amino acids
 
 
                isoelectric.dat         residue-specific pK values for the  prediction of a peptide's
                                            isoelectric point
 
 
Isoelectric     isoelectric.dat         residue-specific pK values for the prediction of a peptide's
                                            isoelectric point
 
 
PepPlot         pepplot.dat             residue-specific values for the
                                            prediction of protein secondary structure, hydrophobicity, and        helical hydrophobic moment
 
 
                ges.dat                 residue-specific values for identifying nonpolar transbilayer helices
 
 
                garnier.dat             residue-specific values for secondary structure prediction using the method of Garnier
 
 
HelicalWheel    helicalwheel.dat        residue-specific attributes
                                        for the display of a peptide sequence as a helical wheel
 
 
CoilScan        mtidkcoils.dat          weight matrix of amino acid
                                        coiled-coil propensities
 
 
SPScan          speuk.dat               weight matrix for eukaryotic signal peptides
 
 
                spgpos.dat              weight matrix for Gram-positive
                                        bacterial signal peptides
 
 
                spgneg.dat              weight matrix for Gram-negative
                                        bacterial signal peptides
 
 
HTHScan         htharac.dat             weight matrix for AraC       family H-T-Hs
 
 
                hthlysr.dat             weight matrix for LysR family H-T-Hs
 
 
                hthhomeobox.dat         weight matrix for Homeobox
                                        family H-T-Hs
 
 

Alternative data files

 CoilScan     mtkcoils.dat

 

 

Format


All data files consist of an optional documentary heading, a dividing line with two adjacent periods (..), and the data. The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from all other fields on the same line by at least one blank space. 


PROSITE

[ Previous | Top | Next ]

Function


You can search protein sequences for motifs that are represented in the PROSITE Dictionary of Protein Sites and Patterns.

Programs that use this file

Motifs

Default data file

prosite.patterns

Alternative data files

None.

Format


The format of GCG pattern files is described in the documentation for programs that use these files.

The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from the other fields on the same line by at least one blank space. Blank lines are tolerated. Most GCG programs ignore information to the right of an exclamation mark (!), so you can use these marks to create comments within the data. You cannot edit prosite.patterns unless your text editor can handle very large records.

Heading: This data file has an optional documentary heading, followed by a dividing line with two adjacent periods (..).

Name: The first field on each line contains the name of the restriction enzyme; the name should have no more than 132 characters. Motifs prefixed by a semicolon ( ; ) are short patterns which are expected to occur in most protein sequences by chance alone. Such frequently found patterns are not displayed by the Motifs program unless you run Motifs with the command-line parameter -FREquent. Only one motif should appear per line.

Offset: The name is followed by an offset number, which tells Motifs where to mark the sequence when the motif expression is found.

Pattern: Patterns should be shorter than 350 characters. They may contain any alphabetic amino acid character. See Appendix III of the Program Manual for a complete list of supported sequence symbols.

For a complete description of the syntax in which motifs are represented, see the topic DEFINING PATTERNS in Motifs in the Program Manual.

Note that some motifs require multiple patterns to identify them. If this is so, these patterns will have the same name and must appear on adjacent lines.

PDoc: The fourth field tells the name of the PROSITE abstract for the pattern. You can copy this file to your directory with the Fetch command, or you can display it with the TypeData command.

Suggestions


prosite.seqcat contains a short description of each motif in prosite.patterns. Use the Fetch command to copy the prosite.seqcat file to your directory or use the TypeData command to view the file online.

The use of Motifs is so straightforward that there are few occasions when you will need to modify this file.

Acknowledgments


Dr. Amos Bairoch of the University of Geneva publishes and maintains the PROSITE Dictionary of Protein Sites and Patterns . PROSITE is distributed by the European Bioinformatics Institute in Cambridge, England.


PROFILES

[ Previous | Top | Next ]

Function


This database contains validated profiles derived from the motifs in the PROSITE Dictionary of Protein Sites and Patterns.

Programs that use this file


ProfileScan

Default data file


profilescan.fil

Alternative data file


oldprofilescan.fil

Format


Heading: The optional heading documents the contents of each column. A divider of two adjacent periods (..) separates the heading from the profiles.

Name: The first column contains the location and name of each profile (see SUGGESTIONS below). These names correspond to the names of the patterns in the prosite.patterns file. The profile name must contain fewer than 255 characters.

High and Intrst: By default, ProfileScan reports only alignments with normalized scores greater than the HIGH value. If you add the -INTEResting parameter to the command line, ProfileScan will report alignments that score higher than the INTRST value.

Gap and Len: These values specify, respectively, the gap creation and extension penalties used to align the motif profile to the query sequence.

A, B, C, AVE, and SD: These values specify the parameters for length-dependent normalization of the alignment scores. See ProfileSearch in the Program Manual for a description of the derivation of these values and their use in normalizing the alignment scores.

Suggestions


Individual profile files are maintained in the directory with the logical name ProfileDir. To view a profile's documentation, use the Fetch command to copy a profile file to your directory, for example -% fetch apple.prf, or use the TypeData command to view the file online.

Acknowledgments


Dr. Michael Gribskov of the San Diego Supercomputing Center prepared and validated these profiles. Dr. Amos Bairoch of the University of Geneva publishes and maintains the PROSITE Dictionary of Protein Sites and Patterns .


VERSION 2.0 PROFILES

[ Previous | Top]

Function


This database contains validated profiles derived from the motifs in the PROSITE Dictionary of Protein Sites and Patterns. Profiles are a special kind of scoring matrix used by several different programs. The addition of MEME and MotifSearch to GCG required the introduction of a new format of profile that allows multiple profiles to be kept in one file.

Programs that use these files


MEME generates version 2.0 profiles, while MotifSearch is intended to process them. ProfileSearch, ProfileGap and ProfileSegments can all read ONLY THE FIRST profile from a version 2.0 file.

Default data file


Not Applicable

Alternative data file


Not Applicable

Format


Heading: The file should begin with a line containing either "!!AAPROFILE 2.0" or "!!NAPROFILE 2.0". Thereafter, you may include any information you like, concluding the heading section with a divider of two adjacent periods (..)

Auxiliary Data Block: The ADB begins with a line having nothing but a "{", and ends with a line having

nothing but a "}". These MUST appear in the first column of their respective lines.


The ADB must contain four parsable data lines. The first gives the Length of the profile (sometimes thought of as the width !), in the form "Length: <value>". The next two lines control the gap creation and extension penalties for the profile, and the fourth gives the labels of the columns used in the profiles. The column labels should be separated by blank spaces. The first label should alwasy be "Cons" (for Consensus), and this should appear at the beginning of the line -- no indentation please.

Here is an example of a simple ADB, with some of the column labels replaced by an ellipsis:

{
  Length: 9
  Gap: 1.00              Len: 1.00
  GapRatio: 0.0          LenRatio: 0.0
Cons   A      C      D      E      F    . . .      W      Y   Gap  Len
}

The ADB may contain any number of "Comment" lines, indicated by a "!" in the first column


Profile The profile itself is made up of rows of log-odds values, with each row corresponding to a position in the profile and (with three exceptions) each column corresponding to a valid symbol for that position. The exceptions are the first column (which contains a letter identifying the consensus symbol for the row) and the last two columns, which give the multiplying factor for the gap creation and extension penalties for the row. (Note that MEME's output profiles are always ungapped, and thus will always have 100 (the maximum value) in the last two columns). The last row in a profile does NOT correspond to a position in the profile. Instead it contains counts for the number of appearances of each letter at any position in the sequences from which the profile was derived. This information is not used by any programs at this time, but it nonetheless must be there. Note that this dummy row is NOT included in the Length count given in the Auxiliary Data Block.
 
 

Printed: May 27, 2005  11:36


[ Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio