APPENDIX VII

Table of Contents

DATA FILES

OVERVIEW

VIEWING OR MODIFYING DATA FILES

RESTRICTION ENZYMES

PROTEOLYTIC ENZYMES AND REAGENTS

TRANSCRIPTION FACTOR DATABASE (TFD)

CODON FREQUENCY TABLES

TRANSLATION TABLES

SCORING MATRICES

PROTEIN ANALYSIS DATA FILES

PROSITE

PROFILES

VERSION 2.0 PROFILES

DATA FILES

[Top | Next ]

This appendix contains descriptions of the following types of data files used by Accelrys GCG (GCG) programs:

    Restriction Enzymes                     Scoring Matrices

    Proteolytic Enzymes and Reagents        Protein Analysis Data Files

    Transcription Factor Database (TFD)     Codon Frequency Tables

    Translation Tables                      PROSITE

    Profiles                                Version 2.0 Profiles

OVERVIEW

[ Previous | Top | Next ]

Most GCG programs analyze nucleic acid or protein sequences stored in files or in sequence databases. Additionally, many programs require nonsequence information, or data files, which they use to analyze the sequences. For example, the nucleic acid mapping programs require two data files: enzyme.dat, which contains restriction enzyme names and their corresponding recognition sites; and translate.txt, which associates codons with their corresponding amino acids.

All programs that require a data file have a default file they use, so as a new user, you don't need to worry about supplying one. These default files are public data files. Public data files are located in the public directory with the logical name GenRunData and may be accessed by everyone who uses the package. When you run a program that requires a data file, it automatically finds the appropriate default file in this directory without you having to specify the directory and file name.

GCG also supplies alternative public data files you can have a program use instead of the default. These files are located in the directory with the logical name GenMoreData. There may be times when you want to use an alternative public data file rather than the default file. For example, if you're using the CodonPreference program to analyze a Drosophila sequence, you may want to use the alternative codon frequency table drosophila_high.cod, rather than the default table, ecohigh.cod, which is more appropriate for bacterial sequences.

In each of the following data file descriptions, we provide the names of the default data files used by programs as well as alternative public data files you can specify separately. You will find the following subtopics in each data file's description:

Default data file. You can find all default public data files in the directory with the logical name GenRunData.

Alternative data file. You can find alternative public data files in the directory with the logical name GenMoreData.

You also can create your own data file or personalize a public data file by copying it to your working directory and modifying it. These files are known as local data files. For instance, you could copy the restriction enzyme data file called enzyme.dat to your directory and delete all of the enzymes in it that are not available in your laboratory. Or, let's say you're working with the FindPatterns program and you create a data file of patterns specific to your research. This personal data file, then, would be available only to you.

VIEWING OR MODIFYING DATA FILES

[ Previous | Top | Next ]

To view a public data file online, use the TypeData program, for example % typedata enzyme.dat. To copy a public data file to your directory, use the Fetch program, for example, % fetch enzyme.dat. Then, open the file in the text editor of your choice to view or modify the file to your needs. For information on how to use an alternative data file with a program, see Section 4, Using Data Files in the User's Guide.

RESTRICTION ENZYMES

[ Previous | Top | Next ]

Function

Nucleotide mapping programs read the list of available restriction enzymes along with their recognition sites, cut positions, and overhangs from an enzyme data file.

Programs that use this file

Map, MapSort, and MapPlot.

Default data file

enzyme.dat

Alternative data files

None.

Format

Heading: An enzyme data file consists of an optional documentary heading. A divider of two adjacent periods (..) separates the heading from the enzymes.

Name: The first field on each line contains the name of the restriction enzyme; the name should have no more than 132 characters. Only one enzyme should appear per line.

Offset: The name is followed by an offset number, which tells the mapping programs where to cut the top strand when the recognition site is found.

Recognition site: The offset is followed by the enzyme recognition sequence. Nucleic acid recognition sequences, like all nucleotide sequences, are represented in 5' to 3' orientation. The recognition sequences should be shorter than 350 characters. They may contain any IUPAC-IUB alphabetic nucleotide character. See Appendix III of the Program Manual for a complete list of supported sequence symbols.

Nonsequence characters in the recognition site: Mapping programs read the offset and overhang fields to find out where each enzyme actually cuts, but the recognition sequences contain non-sequence characters to help humans see the cut points. An apostrophe (') indicates the cut point on the top strand; an underscore ( _ ) indicates the cut point on the bottom strand (when the enzyme does not leave a blunt end). These apostrophes and underscores are ignored by mapping programs and may therefore be absent.

Overhang: The fourth field in the list of enzymes tells the number of bases (positive or negative) from the cut point on the top strand to the point where the bottom strand is cut. A 0 (zero) would leave a blunt end; a 3 would give a 5' overhang of 3 bases; a -3 would leave a 3' overhang of 3 bases. If the recognition site is a palindrome, the overhang field is ignored. If the overhang field is absent or is a non-numeric character (? or . are most often used), the bottom strand is not searched.

Display of isoschizomers: The public file has a semicolon in front of all but one member of each family of isoschizomers. (Isoschizomers, in this context, are restriction endonucleases with the same recognition sequence.) Mapping programs normally ignore isoschizomers whose names are preceded by a semicolon. These isoschizomers are available if you select them individually by name or if you type ** in response to the enzyme prompt.

Isoschizomers, suppliers, and literature: Any information on the line to the right of an exclamation point (!) is documentary and is ignored by mapping programs. The documentary information on each record of the public file contains the names of other isoschizomers, if any are known, along with the commercial suppliers and literature references for the enzyme. (See "Restriction Enzyme Suppliers" and "Restriction Enzyme Literature" below.)

Format requirements: The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from all other fields on the same line by at least one blank space. Blank lines are tolerated. Most GCG programs ignore information to the right of an exclamation mark (!) so you can add comments to the data file.

Asymmetric recognition sequences: If the forward and reverse recognition sites are not the same, then there are two records, one showing the forward and the other the reverse strand. These records must be adjacent to one another in the enzyme file. (See BcgI for an example.) You can give several recognition sites with the same name, but you must put all entries with the same name on adjacent lines of the enzyme data file.

Suggestions

You can put semicolons in front of the enzymes to which you do not have access so that they are not displayed when you create restriction enzyme maps.

Because GCG mapping programs using the default data file display only one member of each family of isoschizomers, these programs find all possible recognition sites but not all possible cut points. If you find an enzyme displayed near a point of interest, you might want to examine the enzyme file to see if another cut point is available.

Restriction Enzyme Suppliers

Many of the restriction enzymes displayed by GCG mapping programs are available commercially. The file enz_sources.txt shows the main suppliers of restriction enzymes together with the enzymes they make available. This file is for your information only; it is not read by any GCG program.

You can use Fetch to copy this file to your working directory and then search it with a text editor.

Restriction Enzyme Literature

Most of the restriction enzymes displayed by GCG mapping programs are described in the scientific literature. The citations for each enzyme are the numbers that appear last on each record of the enzyme data file enzyme.dat. You can find these citations in the file enz_refs.txt. This file is for your information only; it is not read by any GCG program.

You can use Fetch to copy this file to your working directory and then search it with a text editor.

Acknowledgments

Dr. Richard Roberts at New England Biolabs developed and maintains REBASE, the restriction enzyme database from which the enzyme data in GCG are drawn.

PROTEOLYTIC ENZYMES AND REAGENTS

[ Previous | Top | Next ]

Function

Peptide mapping programs read enzyme and reagent names, recognition patterns, and cut positions from an enzyme data file.

Programs that use this file

PeptideMap, MapSort, MapPlot, and PeptideSort.

Default data files

    Program                            Data file

    PeptideMap, MapSort, and MapPlot   proenzyme.dat

    PeptideSort                        proenzall.dat

Note: Proenzall.dat, is a more complete list of proteolytic agents, containing several agents that cut at the same place.

Alternative data files

None.

Format

Heading: An enzyme data file consists of an optional documentary heading. A divider of two adjacent periods (..) separates the heading from the enzymes.

Name: The first field on each line contains the name of the enzyme. Only one enzyme should appear per line.

Offset: The name is followed by an offset number, which tells the mapping programs where to cut the peptide when the recognition pattern is found.

Cleavage site: The offset is followed by the enzyme recognition sequence. Recognition sequences, like all peptide sequences, are represented in amino -> carboxyl orientation. They may contain any standard amino acid character, but no ambiguity characters (B and Z). See Appendix III of the Program Manual for a complete list of supported sequence symbols.

Nonsequence characters in the recognition site: Mapping programs read the offset field to find out where each enzyme actually cleaves, but the recognition sequences contain non-sequence characters to help humans see the cleavage points. An apostrophe (') indicates the cut point. These apostrophes are ignored by mapping programs and may therefore be absent.

Overhang: The fourth field is the overhang which is used in nucleotide restriction enzyme data files. It has no function for proteolytic reagents.

Display of isoschizomers: Mapping programs normally ignore enzymes whose names are preceded by a semicolon (;). These enzymes are available if you select them individually by name or if you type ** in response to the enzyme prompt.

Documentation: Any information on the line to the right of an exclamation point (!) is documentary and is ignored by mapping programs.

Multiple specificities: You may include more than one occurrence of an enzyme name if the enzyme has more than one specificity. All records with the same name must appear on adjacent lines of the enzyme data file. If you want to distinguish specificities (for instance trypsin cutting when arginines are blocked), you can create a unique name the distinguishes trypsin cutting at lysine from trypsin cutting at arginine.

Suggestions

You can put semicolons in front of all the enzymes and reagents that you do not have access to or that you do not want to use. GCG programs will ignore those enzymes and reagents.

GCG programs PeptideMap, MapSort, and MapPlot search for every point of specific cleavage but not every cleavage pattern. PeptideSort tries to identify each known single-digest cleavage pattern. Send us suggestions for other specificities and cleavage patterns that you think these files should include.

TRANSCRIPTION FACTOR DATABASE (TFD)

[ Previous | Top | Next ]

Function

This data file provides a list of the recognition sequences for eukaryotic sequence-specific transcription factors from the Transcription Factor Database (TFD).

Programs that use this file

FindPatterns. (Map, MapSort, and MapPlot can also read this file.)

Default data file

None.

Alternative data files

tfsites.dat

Format

Heading: tfsites.dat consists of an optional documentary heading. A divider of two adjacent periods (..) separates the heading from the transcription sites.

Name: The first field on each line contains the name of the site; the name should have no more than 132 characters. Only one site should appear per line.

Offset: The name is followed by an offset number, which tells programs where to mark the top strand when the recognition site is found.

Recognition site: The offset is followed by the recognition sequence. Nucleic acid recognition sequences are represented in 5' to 3' orientation. The recognition sequences should be shorter than 350 characters. They may contain any IUPAC-IUB alphabetic nucleotide character. See Appendix III of the Program Manual for a complete list of supported sequence symbols.

Overhang: The fourth field should be set to zero to signal that both strands should be searched.

Display of isoschizomers: The public file has a semicolon (;) in front of frequently found sites. Mapping programs normally do not display sites whose names are preceded by a semicolon. If you want to use any of these sites, use the Fetch program to copy tfsites.dat to your working directory and use a text editor to remove the semicolons you want.

Literature: Any information on the line to the right of an exclamation point (!) is documentary and is ignored by mapping programs. The documentary information on each record of the public file contains a common name as well as a literature reference for the site.

Suggestions

You can use Fetch to copy tfsites.dat to your working directory and then rename it pattern.dat. FindPatterns will then read it automatically and use it as the default data file.

Also note that you should always search both strands (FindPatterns does this by default) as most transcription factor sites are strand specific.

Acknowledgments

Dr. David Ghosh developed and maintains TFD.

CODON FREQUENCY TABLES

[ Previous | Top | Next ]

Function

Codon frequency tables reflect the known codon preferences of an organism.

Programs that use these tables

BackTranslate, CodonPreference, and Frames.

Default data file

ecohigh.cod

Alternative data files

Format

Heading: A codon frequency table consists of an optional documentary heading. A divider of two adjacent periods (..), separates the heading from the table. For example

AmAcid  Codon  Number     /1000     Fraction  ..

Gly     GGG    13.00       1.89      0.02

AmAcid: The first field of information on each line of the table contains a three-letter code for an amino acid.

Codon: The second field contains an unambiguous codon for that amino acid.

Number: The third field lists the number of occurrences of that codon in the genes from which the table is compiled.

/1000: The fourth field lists the expected number of occurrences of that codon per 1,000 codons in genes whose codon usage is identical to that compiled in the codon frequency table.

Fraction: The last field contains the fraction of occurrences of the codon in its synonymous codon family.

Each field of information is separated from every other field by at least one blank space.

Suggestions

You can use the CodonFrequency program to create a codon frequency table from a set of input nucleotide sequences and/or existing codon frequency tables. You also can create or modify a codon frequency table with a text editor. If you choose to use a text editor, you need provide only the first three fields of information on each line of the table. The lines can be in any order; only codons whose use is greater than zero need be present. You should then generate the complete codon usage table -- five fields of information, one line for each codon, and all lines ordered by amino acid -- by using the table you created as the input to the CodonFrequency program.

TRANSLATION TABLES

[ Previous | Top | Next ]

Function

Translation tables are used by GCG programs for three purposes:

1. To define the relationships between codons and amino acids.

2. To define the relationships between one-letter and three-letter amino acid codes.

3. To identify potential start codons and stop codons.

Programs that use these tables

BackTranslate, CodonFrequency, CodonPreference, Diverge, Frames, Map, MapPlot, MapSort, Publish, Reformat, and Translate.

Default data file

translate.txt

Alternative data files

Data file               Function

transmitodros.txt     drosophila mitochondrial translation table

transl_table_02.txt   vertebrate mitochondrial translation table

transl_table_03.txt   yeast mitochondrial translation table

transl_table_04.txt   mold, protozoan, and coelenterate mitochondrial

                      and mycoplasma/spiroplasma translation table

transl_table_05.txt   invertebrate mitochondrial translation table

transl_table_06.txt   ciliate, dasycladacean, and hexamita translation table

transl_table_09.txt   echinoderm mitochondrial translation table

transl_table_10.txt   euplotid translation table

transl_table_11.txt   bacterial translation table

transl_table_12.txt   alternative yeast translation table

transl_table_13.txt   ascidian mitochondrial translation table

transl_table_14.txt   flatworm mitochondrial translation table

transl_table_15.txt   blepharisma mitochondrial translation table

transl_table_16.txt   chlorophycean mitochondrial translation table

transl_table_21.txt   trematode mitochondrial translation table

transl_table_22.txt   scenedesmus obliquus mitochondrial translation table

transl_table_23.txt   thraustochytrium aureum mitochondrial translation table

To specify an alternative translation data file, add the parameter -TRANSlate=filename.txt on the command line.

Format

Heading: A translation table consists of an optional documentary heading. A divider of two adjacent periods (..), separates the heading from the table. For example

Symbol 3-letter Codons !IUPAC .. A Ala GCG GCC GCA GCG !GCX Symbol: The first field of information on each line of the table is a single-letter amino acid sequence symbol.

3-letter: The second field is the three-letter amino acid code for that sequence symbol.

Codons: The third field must contain a list of all unambiguous codons for the amino acid; this list must come before the exclamation point (!).

!IUPAC: In the fourth field, the exclamation point delimits where the unambiguous codons end and where the ambiguous codons start. The ambiguous codons are provided for documentary purposes only and are completely ignored by GCG programs. Each field is separated from every other field by at least one blank space. Any of the 31 GCG sequence symbols (see Appendix III of the Program Manual) may be associated with a three-letter code and one or more unambiguous codons. Each codon and each sequence symbol may be used only once.

Output

Potential start codons are written only in lowercase letters. Stop codons are translated as the asterisk (*) symbol.

SCORING MATRICES

[ Previous | Top | Next ]

(formerly Symbol Comparison Tables)

Function

Many sequence comparison programs make comparisons between pairs of sequence symbols by looking up a value in a scoring matrix. The matrix assigns an integer value for the match quality of every possible pair of symbols. If you are comparing nucleotides, the matrix might contain 1's for matching symbols and 0's (zeros) for mismatching symbols. However, if you are comparing amino acids, a number could be assigned that is based on chemical similarity or evolutionary distance. The number might be negative if two residues were very dissimilar.

Programs that use these files

BestFit, Compare, FastA, FrameAlign, FrameSearch, Gap, GapShow, PileUp, PlotSimilarity, Pretty, Prime, ProfileMake, Repeat, Segments, StemLoop, TFastA, and the Consensus operation (in the Edit menu) in the Editor mode of SeqLab.

Default data files

For nucleotides:

    Program         Default data file

    BestFit         swgapdna.cmp

    Compare         compardna.cmp

    FastA           fastadna.cmp

    Gap             nwsgapdna.cmp

    GapShow         swgapdna.cmp or

                    nwsgapdna.cmp

    PileUp          pileupdna.cmp

    PlotSimilarity  plotsimdna.cmp

    Pretty          prettydna.cmp

    Prime           prime.cmp

    ProfileMake     profiledna.cmp

    Repeat          repeatdna.cmp

    Segments        segdna.cmp

    StemLoop        stemloop.cmp

For proteins:

All analysis programs, except FastA and TFastA, use blosum62.cmp as the default data file. FastA and TFastA use blosum50.cmp. The Consensus operation (in the Edit menu) in the Editor mode of SeqLab uses identpep.cmp.

Alternative data files for nucleotides

To specify an alternative scoring matrix file, add the parameter -MATRix=filename.txt on the command line.

Global alignments with Segments, ProfileGap, and ProfileSegments

By default, Segments creates local alignments, analogous to those created by BestFit. You can direct Segments to create global alignments, analogous to those created by Gap, by using the command-line parameter -WHOle. Segments then uses the scoring matrix seggapdna.cmp, containing no negative values for mismatches.

ProfileGap and ProfileSegments can be directed to create global alignments by using the command-line parameter -GLObal. If you want to create global alignments using these programs, you might want to create the profile in ProfileMake using the alternative scoring matrix profilegapdna.cmp.

randomdna.cmp

This matrix is most appropriate for programs creating local alignments (BestFit, Segments, ProfileGap, and ProfileSegments). Since all mismatches between IUPAC-IUB nucleotide symbols are given a value of -3 and all matches are given a value of +10, local alignments created using this matrix will be extended further than those created with any of the default scoring matrices for these programs.

Alternative data files for proteins

To specify an alternative scoring matrix file, add the parameter -MATRix=filename.txt on the command line.

BLOSUM matrices

GCG provides a set of BLOSUM matrices for the comparison of peptide sequences, derived from substitutions observed in more than 2,000 blocks of aligned sequences (Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks (Proceedings of the National Academy of Sciences USA 89; 10915-10919) are provided as alternative peptide scoring matrices in the files blosum30.cmp, blosum35.cmp, blosum40.cmp, blosum45.cmp, blosum55.cmp, blosum60.cmp, blosum65.cmp, blosum70.cmp, blosum75.cmp, blosum80.cmp, blosum85.cmp, blosum90.cmp, and blosum100.cmp. To complete this set, blosum50.cmp and blosum62.cmp are also provided as the default scoring matrices for some analysis programs in GCG.

pam120.cmp and

pam250.cmp

These matrices are the log odds form of the mutation data matrix for 120 PAMs and 250 PAMs (Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. [1979] in Atlas of Protein Sequence and Structure, and Dayhoff, M. O. Ed, pp. 345-352 (Figure 84), National Biomedical Research Foundation, Washington D.C., respectively).

structgappep.cmp

This matrix, described by Risler, et al. (Journal of Molecular Biology 204; 1019-1029), is derived from an analysis of amino acid substitutions after superposition of homologous protein structures. To construct this matrix the authors converted only substitutions whose alpha carbon atoms are very close to one another after superposition of the structures. Based on results from test alignments using Gap and BestFit, the authors suggest that this scoring matrix may prove superior to others in finding weak similarities in distantly related proteins.

oldpep.cmp

An alternative peptide scoring matrix in the file oldpep.cmp can be provided to GCG programs as a local data file. This matrix was derived from the default peptide scoring matrix in Version 8 of GCG. Each value in the Version 8 matrix of floating point values was multiplied by 10 and rounded to the nearest integer to determine the comparison values in oldpep.cmp. Perfect matches in oldpep.cmp have a comparison value of 15, and no matches in the matrix have a higher value than perfect matches.

Format

A scoring matrix file consists of a documentary heading, a dividing line with two adjacent periods (..), an optional auxiliary data block that specifies the default gap creation and extension penalties associated with the scoring matrix, and the matrix itself. GCG nucleotide and amino acid symbols are described in Appendix III of the Program Manual.

GCG programs can use two different types of scoring matrices: BLAST format and GCG format.

BLAST-format scoring matrices

If you have a native BLAST-format scoring matrix, for example BLOSUM62, it can be used directly by GCG programs without converting it to GCG format. However, one advantage to converting native BLAST-format scoring matrices to GCG format is that you can explicitly set gap creation and gap extension penalties within the file (see "Auxiliary Data Block: Setting Gap Creation and Extension Penalties" below). GCG-format scoring matrices

GCG-formatted scoring matrices can be of two forms: rectangular or "equals." You can use either form of scoring matrix with GCG programs with no difference in program performance or results.

Rectangular scoring matrices. The rectangular form organizes the sequence symbols along an x axis (columns) and y axis (rows), where each symbol along the x axis is compared with each symbol along the y axis. There is a row and column for every sequence symbol that has at least one non-zero comparison value. The value of each pair of symbols compared is placed at the intersection of the appropriate row and column. All relationships that are not explicitly defined in the matrix are assigned a value of 0. Every comparison value is separated from every other value by at least one blank space. Blank lines are tolerated.

Consider the example below:

   A  B  C  D  E  F  G  H ...

A  4 -2  0 -2 -1 -2  0 -2

B -2  6 -3  6  2 -3 -1 -1

C  0 -3  9 -3 -4 -2 -3 -3

D -2  6 -3  6  2 -3 -1 -1

E -1  2 -4  2  5 -3 -2  0

F -2 -3 -2 -3 -3  6 -3 -1

G  0 -1 -3 -1 -2 -3  6 -2

H -2 -1 -3 -1  0 -1 -2  8

...

The intersection of row D with column D has a value of 6, which represents an identical match for a D-D pairwise comparison. However, the pairwise comparison between non-identical symbols often is given a lower value, for example a C-D comparison is -3.

Notice that the values are identical at the C-D comparison and at the D-C comparison: -3. Previous versions of the supported only triangular forms of scoring matrices to eliminate this repetition. However, to make publicly available scoring matrices, which are in a rectangular format, easier to use, GCG now supports only rectangular-format scoring matrices. See "Converting Scoring Matrices" later in this section for converting pre-Version 9 scoring matrices to the new format.

Equals-form scoring matrices. The second form of GCG-format scoring matrix supported is "equals" form, so named because within the matrix, each pairwise comparison equals a value. For instance, in the example below, a A-A symbol comparison is assigned, or equals, a value of 4.

AA=      4      AB=     -2      AD=     -2      AE=     -1      AF=     -2

AH=     -2      AI=     -1      AK=     -1      AL=     -1      AM=     -1

AN=     -2      AP=     -1      AQ=     -1      AR=     -1      AS=      1

AW=     -3      AX=     -1      AY=     -2      AZ=     -1      BB=      6

BC=     -1      BD=      6      BE=      2      BF=     -3      BG=     -1

...

All relationships that are not explicitly defined in the matrix are assigned a value of 0. Every comparison value is separated from every other value by at least one blank space. Blank lines are tolerated. Some people find the equals form of scoring matrix easier to read than the rectangular form.

Auxiliary Data Block: Setting Gap Creation and Extension Penalties

You can specify gap creation and gap extension penalties within a scoring matrix to ensure that programs reading the scoring matrix use those values as defaults. If you do not specify these penalties, the program calculates reasonable defaults based on the values in the matrix.

Gap creation and gap extension penalties must follow a specific format within a scoring matrix. These penalties must appear in an auxiliary data block, which appears after the dividing line with the two adjacent periods (..) and before the line of sequence symbols in the scoring matrix, as shown below:

..

 GAP_CREATE 12

 GAP_EXTEND 4

   A  B  C  D  E  F  G  H ...

If you create your own scoring matrix, or if you modify an existing one, you must maintain this format for specifying gap creation and extension penalties.

Note that even though gap creation and extension penalties may be set within a scoring matrix, you can override them on the command line. To do so, use the parameters -GAPweight and -LENgthweight on the command line when you run a program that uses scoring matrices.

Suggestions

Creating new scoring matrices

Use the CompTable program to create scoring matrices. You also can use a text editor to create a scoring matrix; if you do so, use the Reformat program with the command-line parameter -COMparison to rewrite the file into GCG format. Both CompTable and Reformat round the values in the matrix to the nearest integer.

Modifying existing scoring matrices

Several programs may use the same default scoring matrix. However, although the matrices may be identical, the default matrix for each program is contained in a separate file. This allows you to modify a local version of the matrix for one program without affecting the matrix used by another program.

If you make modifications to a matrix, use the Reformat program with the command-line parameter -COMparison to rewrite your scoring matrix data file into GCG format.

Converting scoring matrices

Converting pre-Version 9 scoring matrices to the new format

In Version 9 all scoring matrices provided with the package in GenRunData and GenMoreData are already converted to the new format. However, you must convert all of the scoring matrices in your personal directories, including your personal directory with the logical name MyData, to the new rectangular format. When you do so, you will need to specify the scoring matrix as either nucleotide or protein. GCG programs will not accept pre-Version 9 scoring matrices, and they will display the following error message if you try to use one:

*** ERROR, READSCOREMAT cannot read the scoring matrix in the file

 "filename"!

If this is a scoring matrix created before Version 9,

try converting it with "% reformat /OLDCMPformat /PROtein" or

                       "% reformat /OLDCMPformat /NUCleotide"

To convert pre-Version 9 scoring matrices to the new format, type

% Reformat -OLDCMPformat -NUCleotide scoring_matrix

% Reformat -OLDCMPformat -PROtein scoring_matrix

Converting scoring matrices to make them more readable

GCG programs can accept two forms of GCG-format scoring matrix files: rectangular and "equals." There is no difference in analysis or performance between the forms. However, some people find "equals" format easier to read, and the package provides a way to convert between the two forms.

To convert rectangular scoring matrices to the more readable "equals" format, type

% Reformat -COMParison -EQUALSformat scoring_matrix

To convert "equals" format scoring matrices to rectangular format, type

% Reformat -COMParison scoring_matrix

Converting BLAST-format scoring matrices to GCG-format

GCG also works with native BLAST-formatted scoring matrices. Although converting BLAST-formatted scoring matrices to GCG-format is unnecessary, you may find it useful to do so. GCG-formatted scoring matrices allow you to specify gap creation and extension penalties within the scoring matrix file.

To convert BLAST-formatted scoring matrices to GCG-format, type

% Reformat -COMParison -NUCleotide scoring_matrix

% Reformat -COMParison -PROtein scoring_matrix

PROTEIN ANALYSIS DATA FILES

[ Previous | Top | Next ]

Function

These data files enable programs to locate motifs in protein sequences and to make predictions about peptide isolation, secondary structure, hydrophobicity, and antigenicity.

Programs that use these tables

PeptideSort, Isoelectric, PepPlot, HelicalWheel, CoilScan, SPScan, and HTHScan.

Default data file

Program         Default data file       Function

PeptideSort     aminoacid.dat           amino acid residue properties

    extinctcoef.dat         extinction coefficients for   amino acids

                isoelectric.dat         residue-specific pK values for the  prediction of a peptide's

                                            isoelectric point

Isoelectric     isoelectric.dat         residue-specific pK values for the prediction of a peptide's

                                            isoelectric point

PepPlot         pepplot.dat             residue-specific values for the

                                            prediction of protein secondary structure, hydrophobicity, and        helical hydrophobic moment

                ges.dat                 residue-specific values for identifying nonpolar transbilayer helices

                garnier.dat             residue-specific values for secondary structure prediction using the method of Garnier

HelicalWheel    helicalwheel.dat        residue-specific attributes

                                        for the display of a peptide sequence as a helical wheel

CoilScan        mtidkcoils.dat          weight matrix of amino acid

                                        coiled-coil propensities

SPScan          speuk.dat               weight matrix for eukaryotic signal peptides

                spgpos.dat              weight matrix for Gram-positive

                                        bacterial signal peptides

                spgneg.dat              weight matrix for Gram-negative

                                        bacterial signal peptides

HTHScan         htharac.dat             weight matrix for AraC       family H-T-Hs

                hthlysr.dat             weight matrix for LysR family H-T-Hs

                hthhomeobox.dat         weight matrix for Homeobox

                                        family H-T-Hs

Alternative data files

 CoilScan     mtkcoils.dat

Format

All data files consist of an optional documentary heading, a dividing line with two adjacent periods (..), and the data. The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from all other fields on the same line by at least one blank space.

PROSITE

[ Previous | Top | Next ]

Function

You can search protein sequences for motifs that are represented in the PROSITE Dictionary of Protein Sites and Patterns.

Programs that use this file

Motifs

Default data file

prosite.patterns

Alternative data files

None.

Format

The format of GCG pattern files is described in the documentation for programs that use these files.

The exact column for each field on a line does not matter; only the order of the fields is important. Each field should be separated from the other fields on the same line by at least one blank space. Blank lines are tolerated. Most GCG programs ignore information to the right of an exclamation mark (!), so you can use these marks to create comments within the data. You cannot edit prosite.patterns unless your text editor can handle very large records.

Heading: This data file has an optional documentary heading, followed by a dividing line with two adjacent periods (..).

Name: The first field on each line contains the name of the restriction enzyme; the name should have no more than 132 characters. Motifs prefixed by a semicolon ( ; ) are short patterns which are expected to occur in most protein sequences by chance alone. Such frequently found patterns are not displayed by the Motifs program unless you run Motifs with the command-line parameter -FREquent. Only one motif should appear per line.

Offset: The name is followed by an offset number, which tells Motifs where to mark the sequence when the motif expression is found.

Pattern: Patterns should be shorter than 350 characters. They may contain any alphabetic amino acid character. See Appendix III of the Program Manual for a complete list of supported sequence symbols.

For a complete description of the syntax in which motifs are represented, see the topic DEFINING PATTERNS in Motifs in the Program Manual.

Note that some motifs require multiple patterns to identify them. If this is so, these patterns will have the same name and must appear on adjacent lines.

PDoc: The fourth field tells the name of the PROSITE abstract for the pattern. You can copy this file to your directory with the Fetch command, or you can display it with the TypeData command.

Suggestions

prosite.seqcat contains a short description of each motif in prosite.patterns. Use the Fetch command to copy the prosite.seqcat file to your directory or use the TypeData command to view the file online.

The use of Motifs is so straightforward that there are few occasions when you will need to modify this file.

Acknowledgments

Dr. Amos Bairoch of the University of Geneva publishes and maintains the PROSITE Dictionary of Protein Sites and Patterns . PROSITE is distributed by the European Bioinformatics Institute in Cambridge, England.

PROFILES

[ Previous | Top | Next ]

Function

This database contains validated profiles derived from the motifs in the PROSITE Dictionary of Protein Sites and Patterns.

Programs that use this file

ProfileScan

Default data file

profilescan.fil

Alternative data file

oldprofilescan.fil

Format

Heading: The optional heading documents the contents of each column. A divider of two adjacent periods (..) separates the heading from the profiles.

Name: The first column contains the location and name of each profile (see SUGGESTIONS below). These names correspond to the names of the patterns in the prosite.patterns file. The profile name must contain fewer than 255 characters.

High and Intrst: By default, ProfileScan reports only alignments with normalized scores greater than the HIGH value. If you add the -INTEResting parameter to the command line, ProfileScan will report alignments that score higher than the INTRST value.

Gap and Len: These values specify, respectively, the gap creation and extension penalties used to align the motif profile to the query sequence.

A, B, C, AVE, and SD: These values specify the parameters for length-dependent normalization of the alignment scores. See ProfileSearch in the Program Manual for a description of the derivation of these values and their use in normalizing the alignment scores.

Suggestions

Individual profile files are maintained in the directory with the logical name ProfileDir. To view a profile's documentation, use the Fetch command to copy a profile file to your directory, for example -% fetch apple.prf, or use the TypeData command to view the file online.

Acknowledgments

Dr. Michael Gribskov of the San Diego Supercomputing Center prepared and validated these profiles. Dr. Amos Bairoch of the University of Geneva publishes and maintains the PROSITE Dictionary of Protein Sites and Patterns .

VERSION 2.0 PROFILES

[ Previous | Top]

Function

This database contains validated profiles derived from the motifs in the PROSITE Dictionary of Protein Sites and Patterns. Profiles are a special kind of scoring matrix used by several different programs. The addition of MEME and MotifSearch to GCG required the introduction of a new format of profile that allows multiple profiles to be kept in one file.

Programs that use these files

MEME generates version 2.0 profiles, while MotifSearch is intended to process them. ProfileSearch, ProfileGap and ProfileSegments can all read ONLY THE FIRST profile from a version 2.0 file.

Default data file

Not Applicable

Alternative data file

Not Applicable

Format

Heading: The file should begin with a line containing either "!!AAPROFILE 2.0" or "!!NAPROFILE 2.0". Thereafter, you may include any information you like, concluding the heading section with a divider of two adjacent periods (..)

Auxiliary Data Block: The ADB begins with a line having nothing but a "{", and ends with a line having

nothing but a "}". These MUST appear in the first column of their respective lines.

The ADB must contain four parsable data lines. The first gives the Length of the profile (sometimes thought of as the width !), in the form "Length: <value>". The next two lines control the gap creation and extension penalties for the profile, and the fourth gives the labels of the columns used in the profiles. The column labels should be separated by blank spaces. The first label should alwasy be "Cons" (for Consensus), and this should appear at the beginning of the line -- no indentation please.

Here is an example of a simple ADB, with some of the column labels replaced by an ellipsis:

  Length: 9

  Gap: 1.00              Len: 1.00

  GapRatio: 0.0          LenRatio: 0.0

Cons   A      C      D      E      F    . . .      W      Y   Gap  Len

The ADB may contain any number of "Comment" lines, indicated by a "!" in the first column

Profile The profile itself is made up of rows of log-odds values, with each row corresponding to a position in the profile and (with three exceptions) each column corresponding to a valid symbol for that position. The exceptions are the first column (which contains a letter identifying the consensus symbol for the row) and the last two columns, which give the multiplying factor for the gap creation and extension penalties for the row. (Note that MEME's output profiles are always ungapped, and thus will always have 100 (the maximum value) in the last two columns). The last row in a profile does NOT correspond to a position in the profile. Instead it contains counts for the number of appearances of each letter at any position in the sequences from which the profile was derived. This information is not used by any programs at this time, but it nonetheless must be there. Note that this dummy row is NOT included in the Length count given in the Auxiliary Data Block.

Printed: May 27, 2005 11:36

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.