Sequence Symbols

Accelrys GCG (GCG) programs allow all upper- and lowercase letters, periods (.), asterisks (*), tildes (~), ampersands (&), and at (@) symbols in biological sequences. Nucleotide symbols, their complements, and the standard one-letter amino acid symbols are shown below in separate lists. The meanings of the symbols &, and @ have not been assigned at this writing (October, 1996).

GCG supports two gap characters: the period (.) and the tilde (~). GCG programs run from the command line or from the Main List mode of SeqLab treat the two gap characters identically in input sequences. GCG programs run from the Editor mode of SeqLab remove any tilde gap characters from the right end of each input sequence before performing their analyses.

In the future, programs run from either the command line or from SeqLab may differentiate the two gap characters in their analyses. The period gap character will increasingly be used as a space holder that may represent a missing character in a sequence. For example, the period gap character may represent a missed base call in a contig alignment in fragment assembly. The tilde gap character will increasingly be used as a simple place holder that never represents an actual character in a sequence. For example, two tildes may be used in a translated sequence to align each codon in a nucleotide sequence with its corresponding single-letter amino acid symbol. As another example, gaps at the ends of sequences in an alignment may be written as tildes when those gaps are due to differences in input sequence lengths rather than missing characters in the input sequences.

GCG uses the letter codes for amino acid codes and nucleotide ambiguity proposed by IUPAC-IUB. These codes are compatible with the codes used by the GenBank and PIR databases.


The meaning of each symbol, its complement, and the Cambridge equivalents are shown below. Cambridge files can be converted into GCG files and vice versa with the programs FromStaden and ToStaden.

IUB/GCG      Meaning     Complement   Staden/Sanger
    A             A             T             A
    C             C             G             C
    G             G             C             G
   T/U            T             A             T
    M           A or C          K             M
    R           A or G          Y             R
    W           A or T          W             W
    S           C or G          S             S
    Y           C or T          R             Y
    K           G or T          M             K
    V        A or C or G        B             V
    H        A or C or T        D             H
    D        A or G or T        H             D
    B        C or G or T        V             B
   X/N     G or A or T or C    X/N            N
   ./~      gap character      ./~            -

The uncertainty and frame ambiguity codes used by Staden are not supported by GCG and are converted by FromStaden to the lowercase single base equivalent.

        Staden Code          Meaning              GCG
            1               probably C              c
            2               probably T              t
            3               probably A              a
            4               probably G              g
            5                A or C                 m
            6                G or T                 k
            7                A or T                 w
            8                G or C                 s

Amino Acids

Here is a list of the standard one-letter amino acid codes and their three-letter equivalents. The synonymous codons and their depiction in the IUB codes are shown. You should recognize that the codons following semicolons (;) are not sufficiently specific to define a single amino acid even though they represent the best possible backtranslation into the IUB codes! You can redefine all of the relationships in this list in a local data file as described in Appendix VII.

Symbol 3-letter  Meaning      Codons                Depiction
  A    Ala       Alanine      GCT,GCC,GCA,GCG         !GCX
  B    Asp,Asn   Aspartic,
                 Asparagine   GAT,GAC,AAT,AAC         !RAY
  C    Cys       Cysteine     TGT,TGC                 !TGY
  D    Asp       Aspartic     GAT,GAC                 !GAY
  E    Glu       Glutamic     GAA,GAG                 !GAR
  F    Phe     Phenylalanine  TTT,TTC                 !TTY
  G    Gly       Glycine      GGT,GGC,GGA,GGG         !GGX
  H    His       Histidine    CAT,CAC                 !CAY
  I    Ile       Isoleucine   ATT,ATC,ATA             !ATH
  K    Lys       Lysine       AAA,AAG                 !AAR
  L    Leu       Leucine      TTG,TTA,CTT,CTC,CTA,CTG !TTR,CTX,YTR;YTX
  M    Met       Methionine   ATG                     !ATG
  N    Asn       Asparagine   AAT,AAC                 !AAY
  P    Pro       Proline      CCT,CCC,CCA,CCG         !CCX
  Q    Gln       Glutamine    CAA,CAG                 !CAR
  R    Arg       Arginine     CGT,CGC,CGA,CGG,AGA,AGG !CGX,AGR,MGR;MGX
  S    Ser       Serine       TCT,TCC,TCA,TCG,AGT,AGC !TCX,AGY;WSX
  T    Thr       Threonine    ACT,ACC,ACA,ACG         !ACX
  V    Val       Valine       GTT,GTC,GTA,GTG         !GTX
  W    Trp       Tryptophan   TGG                     !TGG
  X    Xxx       Unknown                              !XXX
  Y    Tyr       Tyrosine     TAT, TAC                !TAY
  Z    Glu,Gln   Glutamic,
                 Glutamine    GAA,GAG,CAA,CAG         !SAR
  *    End       Terminator   TAA, TAG, TGA           !TAR,TRA;TRR

