PRETTYBOX

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

FUNCTION

[ Top | Next ]

PrettyBox displays multiple sequence alignments as shaded boxes in Postscript format for printing or displaying with a Postscript compatible device. PrettyBox optionally calculates a consensus sequence. The program does not create the alignment; it simply displays it.

DESCRIPTION

[ Previous | Top | Next ]

PrettyBox produces a PostScript file containing a multiple sequence alignment with residues shaded on the basis of agreement to a calculated consensus sequence, allowing you to identify relationships among the sequences. You can use this program with aligned sequences in an MSF (multiple sequence format) file, an RSF (rich sequence format) file, or a list file that specifies two or more single sequence files that have had gaps added to make them all align.

The output from PrettyBox can only be printed on a PostScript printer or viewed with software capable of displaying PostScript images.

EXAMPLE

[ Previous | Top | Next ]

Here is a session with PrettyBox that displays a shaded alignment from a multiple alignment of the antigenic regions of a group of picorna virus capsid proteins.

% prettybox -CASe

 PRETTYBOX what sequences ? prettybox.msf{*}

           prettybox.msf{fa10}, len: 349

           prettybox.msf{fa12}, len: 349

           prettybox.msf{fo1k}, len: 349

              prettybox.msf{e}, len: 349

            prettybox.msf{p1m}, len: 349

            prettybox.msf{p1s}, len: 349

            prettybox.msf{p2s}, len: 349

            prettybox.msf{p3s}, len: 349

            prettybox.msf{cb3}, len: 349

            prettybox.msf{r14}, len: 349

             prettybox.msf{r2}, len: 349

Print in which orientation:

     l)andscape     p)ortrait

Please select (* L *):

Display a consensus (* No *) ?

Find consensus to what minimum plurality (* 2.00 *) ?

Where should numbers be placed:

     r)ight side     t)op      n)one

Please select (* R *):

What should I call the output PostScript file (* prettybox.ps *) ?

OUTPUT

[ Previous | Top | Next ]

If you are reading the Program Manual, you can see the PostScript output from this session that appears at the end of this program description.

INPUT FILES

[ Previous | Top | Next ]

PrettyBox accepts multiple (two or more) aligned nucleotide or protein sequences as input. These aligned sequences can be represented in an MSF, RSF or list file. For example, to specify an MSF file, such as the output file from a session with PileUp, use a command like % prettybox pileup.msf{*}. Similarly, you can specify an RSF file, such as the output file from a session with PileUp in SeqLab, using a command like % prettybox pileup.rsf{*}.

RELATED PROGRAMS

[ Previous | Top | Next ]

Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it. PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. The SeqLab editor allows you to shade sequences based on agreement to a consensus sequence or according to the results of PlotSimilarity.

RESTRICTIONS

[ Previous | Top | Next ]

You can use up to 500 sequences, although the total length of all sequences combined must be less than 2,000,000 characters. The maximum number of ranges that can be specified in a mark file (*.mrk) is 100.

ALGORITHM

[ Previous | Top | Next ]

Calculating and Displaying a Consensus

PrettyBox calculates a consensus for each column of an alignment using the scoring matrix blosum62.cmp for proteins or prettydna.cmp for nucleic acids. You can display the consensus sequence in the output by using the parameter -CONsensus. The consensus symbol for a column is determined in two steps:

1) The program finds the symbol whose comparison to all of the symbols in the column (including itself) yields the greatest number of votes. A vote is cast for each symbol comparison that is greater than or equal to some set threshold value; votes can be either 1.0 or some vote weight assigned to the sequence from which the vote comes.

2) Among the coalition of symbols that voted for the winning symbol, the most common symbol is chosen as the consensus.

If there is no coalition of votes that is larger than all of the other coalitions, or if the largest coalition of votes has fewer votes than the value know as the minimum plurality, plurality, then there is no consensus for the column.

The weights for each sequence and the minimum plurality are floating point numbers. The threshold value is an integer.

-THReshold=1

determines the scoring matrix value below which a symbol may not vote for a coalition. PrettyBox chooses a default threshold that is appropriate for the scoring matrix it reads. If you select a different scoring matrix with the -MATRix command-line parameter, the program will adjust the default threshold accordingly. Use -THReshold to specify an alternative threshold if you don't want to accept the default value.

-PLUrality=2.0

defines the number of votes (vote weights) below which there is no consensus.

Vote Weight

If several of your sequences are very similar, you may not want their votes to dominate the consensus for the column. If your input file specification to PrettyBox is a list file, you can assign each sequence a vote weight with the wgt sequence attribute within the list file itself. The vote weight is the vote that each row casts for the consensus. A weight of 1.0 is assumed if no vote weight is specified.

You can assign vote weights to sequences in an MSF file by editing the MSF file and modifying the weight on the name/weight line for each sequence at the top of the file. See "Using Multiple Sequence Format (MSF) Files" in Section 2, Using Sequence Files and Databases in the User's Guide for a complete description of MSF files.)

You can assign vote weights to sequences in an RSF (rich sequence format) file by modifying the weight attribute for each sequence within SeqLab. See "Using Rich Sequence Format (RSF) Files" in Section 2, Using Sequence Files and Databases in the User's Guide for a complete description of RSF files or see "Viewing Sequence Attribute and Reference Information" in Section 2, Editing Sequences in the SeqLab Guide for more information about modifying the weight attribute for each sequence within an RSF file. (Note: PrettyBox is not available in SeqLab).

Shading Alignments

PrettyBox uses four levels of shading to indicate the degree of similarity of individual sequence characters to the consensus character at a given position. Normally, the darkest shading (Black) is used for symbols that are exactly the same as the consensus sequence. The next darkest level of shading (Light) is used to for symbols whose comparison value is greater than or equal to the average non-identical comparison value in the scoring matrix. (See Appendix VII for more information about scoring matrices.) Symbols whose comparison scores are less than this value but are greater than equal or to 1 are shaded most lightly (Pale). All other sequence symbols are shown with White backgrounds. You can change these match display thresholds with the -PAIr parameter.

The -CASe parameter provides another way to indicate agreement to the consensus character. This parameter displays characters with the darkest level of shading in uppercase and all other characters are shown in lower case.

The -IDEntity parameter restricts shading to only those columns in which each sequence has the same character. When both -IDEntity and -CASe are used, only the characters in shaded columns are shown in uppercase.

CONSIDERATIONS

[ Previous | Top | Next ]

All output from PrettyBox is in PostScript format. If there is no PostScript compatible device or software available, then you should consider using Pretty to format multiple alignments.

It is possible to obtain unexpected results with certain combinations of formatting parameters such as -WIDth, -BLOcksize, -SPAcing, etc. For example, the right-hand side of the output may be trimmed when the value of -WIDth is too large for a given page size.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

Minimal Syntax: % prettybox [-INfile=]@pretty.list -Default

Prompted Parameters:

-BEGin=1 -END=349      sets the range of interest

-ORIentation=l         specifies the direction for printing as

                         Landscape (L) or Portrait (P)

-NUMbering=r           sets printing of sequence numbering

                         to Right side (R), Top (T), or None

-CONsensus             generates a consensus sequence

-OUTfile=prettybox.ps  writes to PostScript output file

Local Data Files:

-MATRix=prettyboxdna.cmp  assigns the scoring matrix for nucleotides

-MATRix=blosum62.cmp      assigns the scoring matrix for proteins

-MARk=pretty.mrk          defines regions to be shaded

Optional Parameters:

-PAIr=x,2,1            sets thresholds for identical (x), very similar, and

                         weekly similar comparisons to the consensus,

                         respectively.  Protein defaults are:  x, 2, 1.

                         Nucleic acid defaults are: 1, 1, 1.

-THReshold=1           sets minimum comparison value for symbol to vote in

                         the consensus

-PLUrality=2.0         defines the minimum number of votes for a consensus

                         to exist

-IDEntity              restricts shading and consensus determination to

                         positions of unanimous agreement

-CASe                  shows positions agreeing with the calculated consensus

                         in uppercase

-SIMPlify=simplify.txt simplifies sequences; works like the Simplify program.

-SIMIlar=a             considers similarity in generating a consensus.

                         If 'O' is used, then only identical matches are

                         considered.

-NOOFFset              prevents printing the consensus line offset from the

                         other sequences

-NOHEAder              suppresses printing a header

-SEQName=p             sets sequences names to be Partial (P), Full (F),

                         or None (N)

-ASKstart              asks about the starting numbers for each sequence

-WIDth=50              sets the number of residues per line

-BLOcksize=10          sets the number of residues per block

-SPAcing=1             sets the number of spaces between blocks

-BLAnklines=2          sets the number of blank lines between each group

                         of sequence lines

-FONtsize=10           sets the font size in terms of PostScript numbers

-XMArgin=20            sets the left and right margins in PostScript units

-YMArgin=20            set the top and bottom margins in PostScript units

-FAT                   uses fat (bold) lettering

-COLor=b,L,P,W         sets the colors (shading intensities) to use for

                         identical, similar, somewhat-similar, and

                         non-similar comparisons to the consensus,

                         respectively.  The available colors, by decreasing

                         order of intensity, are: Black (B), Dark (D),

                         Light (L), Pale (P), and White (W).

-DENsity=f             sets the  density of printing to be either Rough (R)

                         or Fine (F).  Rough may photocopy better.  Density

                         only works with the colors Dark, Light, and Pale.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

Shading and the effect of -CASe can be restricted to ranges of sequence positions that are specified in a mark file. The presence of a file in your directory with the same name as your sequence and the filename extension .mrk causes the program to mark each range specified in the file. You can specify a marking file with the parameter -MARk=pretty.mrk. The file pretty.mrk contains a series of up to 100 sequence ranges defining regions of interest.

The parameter -SIMPlify=simplify.txt lets you simplify an alignment by synonymizing groups of related symbols, such as can be done with the program Simplify. Such a simplification would allow you, for instance, to treat all hydrophobic amino acids as equivalent. A public data file called simplify.txt contains useful groupings of amino acid residues.

Here are the default simplifications in the public data file simplify.txt.

A = P,A,G,S,T (neutral, weakly hydrophobic)

D = Q,N,E,D,B,Z (hydrophilic, acid amine)

H = H,K,R (hydrophilic, basic)

I = L,I,V,M (hydrophobic)

F = F,Y,W (hydrophobic, aromatic)

C = C (cross-link forming)

All other characters are unchanged.

If a file named simplify.txt is present in your current working directory, it takes precedence over the public version. You can use Fetch to copy simplify.txt to your local directory and modify it for your own use. The public version of simplify.txt is appropriate only for protein alignments and you must create your own for nucleotide alignments. See the Simplify documentation for additional information.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program's default scoring matrix in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Section 4, Using Data Files in the User's Guide.

The default scoring matrices are blosum62.cmp for proteins and prettyboxdna.cmp for nucleic acids.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-ORIentation=L

Specifies whether the direction for printing will be Landscape (L) or Portrait (P).

-NUMbering=R

Sets printing of sequence numbering to Right side (R), Top (T), or None (N).

-CONsensus

Causes the calculated consensus sequence to be displayed.

-MATRix=mymatrix.cmp

Allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData.

For more information see the Local Scoring Matrices section.

-MARk=pretty.mrk

Specifies a mark file that contains ranges of sequence positions to which shading should be restricted. These ranges also restrict the positions that are affected by the -CONsensus, -IDEntity, and -CASe parameters. For more information, see the LOCAL DATA FILES topic of this document.

-PAIr=x,2,1

Determines the similarity levels corresponding to the four levels of shading. The three values associated with -PAIr are the display thresholds for Black, Light, and Pale shading. By default, the match display criterion for Black is symbolic identity. If a number is used for the first value, then symbols whose comparison values are greater than or equal to this numeric threshold will be shaded black. If you need set the second and third thresholds but still want Black shading to represent identical symbols, use x instead of a number as the first value. The -COLor parameter can be used to specify which three of the four possible shading "colors" (i.e. Black, Dark, Light, Pale, and White) are associated with a given level of similarity.

-THReshold=1

Determines the scoring matrix value below which a symbol may not vote for a coalition (see the Calculating and Displaying a Consensus section above). PrettyBox chooses a default threshold that is appropriate for the scoring matrix it reads. If you select a different scoring matrix with -MATRix, the program will adjust the default threshold accordingly. Use -THReshold to specify an alternative threshold if you don't want to accept the default value.

-PLUrality=2.0

Defines the number of votes (vote weights) below which there is no consensus (see the Calculating A Consensus section above).

-IDEntity

Causes PrettyBox to restrict shading to columns where there is complete agreement among all of the sequences. -IDEntity modifies the effect of the -CASe parameter.

-CASe

Causes Prettybox to use uppercase with symbols whose comparison value with the consensus symbol is greater than or equal to the first value specified with the -PAIr parameter. When the default value "x" is used with -PAIr, only symbols that are identical to the consensus are displayed in uppercase.

If -CASe is used with -IDEntity,.then only the symbols in shaded columns are shown in uppercase.

-SIMPlify=simplify.txt

Causes PrettyBox to reduce the number of symbols in sequences similar to the function of the program Simplfy. See LOCAL DATA FILES topic for additional information.

The simplify.txt file in the public data directory is only appropriate for simplifying peptide sequences. You must create your own simplify.txt file to define equivalences for nucleic acid simplifications.

-SIMIlar=A

Determines whether or not PrettyBox considers similar as opposed to only identical sequence characters when calculating the consensus sequence. By default, all possible residues are used. If you specify O, then only identical matches are considered when determining consensus characters. Similarites are defined in the scoring matrix (see -MATRix).

-NOOFFset

Prevents printing the consensus sequence offset from the rest of the sequences. By default the consensus sequence is offset slightly downward.

-NOHEAder

Prevents printing an informative header at the top of each page of output. By default the header is displayed.

-SEQName=P

Specifies whether sequence names should be Partial (P), i.e. without the filenames, Full (F), or None (N), i.e. completely omitted.

-ASKstart

Causes PrettyBox to prompt for a number to use for the first residue of each sequence.

-WIDth=50

Specifies the number of sequence symbols to display on each line. You can set the width from 10 to 150 symbols.

Note: The right-hand side of the output may be trimmed when the width is too large for a given size of paper.

-BLOcksize=10

Specifies the number of sequence symbols to put into each block. You can set the blocksize from 2 to 150 symbols.

-SPAcing=1

Sets the number of spaces between blocks of sequence characters. The specified value may be any integer from 0 to 5.

-BLAnklines=2

Sets the number of blank lines between each group of sequence lines in the alignment. The specified value may be any integer from 0 to 5.

-FONtsize=10

Set the font size in PostScript units, where 1 unit = 1/72 of an inch.

-XMArgin=20

Sets the left and right margins in PostScript units.

-YMArgin=20

Sets the top and bottom margin in PostScript units.

-FAT

Specifies a font that has a heavier stroke weight than normal.

-COLor=B,L,P,W

Sets the shading intensity to use for the four levels of shading (i.e. identical, highly similar,weakly similar, and non-matching sequence characters, respectively. The available colors are Black (B), Dark (D), Light (L), Pale (P), and White (W).

With the default settings, successively darker coloring (shading) is used to represent greater similarity to the consensus character. Using this parameter, you can change the correspondence between a given degree similarity and the level of shading. For example, -COLor=W,L,P,B reverses this correspondance so that lighter shading indicates greater similarity.

-DENsity=F

Sets the printing density as either rough (R) or fine (F). Rough may be more suitable for photocopied output. Density only affects the Dark (D), Light (L), and Pale (P) colors.

Printed: May 27, 2005 14:05

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.