WINDOW

Window makes a table of the frequencies of different sequence patterns within a window as it is moved along a sequence. A pattern is any short sequence like GC or R or ATG. You can plot the output with the program StatPlot.

DESCRIPTION

[ Previous | Top | Next ]

Window calculates the frequency of patterns within a window of a set length. A pattern is any short sequence such as GC, or R, or ATG. The output is a table of numbers suitable for input to the StatPlot program. The window is moved along the sequence by a shift increment, and the number of observations of the pattern at every window position is measured. The frequency can be reported as a fraction, a percent, or simply a number of observations. You can also ask to see the difference between the number of observations of the pattern and the expected number of observations for a random sequence of identical composition. This expectation can be based either on the composition within the window (local) or on the composition of the whole sequence range (global). Another statistic lets you see the difference in frequency between two patterns. The pattern frequencies measured by Window are for one strand only.

PARAMETERS

[ Previous | Top | Next ]

You define the window size and the shift increment. The shift increment is the amount the window is moved between measurements. From a menu of the eight possible measures, you may choose up to six. Each measure you choose makes a column in the output table. After choosing the measurements, you are prompted to enter the pattern you want measured. For each measurement you must designate a pattern when prompted with a question that reminds you of the kind of measurement and the column number.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Window to measure the frequency of C, G, CG, and GC in the sequence gamma.seq. You can see from this experiment whether or not the frequency of the dinucleotide CG correlates well with the content of the nucleotides C and G (it doesn't). The output file from this session with Window is plotted as an example in the program StatPlot.

% window

 WINDOW on what sequence ?  gamma.seq

                  Begin (* 1 *) ?

                End (* 11375 *) ?  500

               Reverse (* No *) ?

 What window size (* 100 *) ?

 What shift increment (* 3 *) ?

 What should I call the output file (* gamma.wdw *) ?

 What functions do you want:

      a) number   of patterns observed

      b) percent  of patterns observed

      c) fraction of patterns observed

      d) number   of observed - expected(local)  patterns

      e) number   of observed - expected(global) patterns

      f) percent  of observed - expected(local)  patterns

      g) percent  of observed - expected(global) patterns

      h) percent difference between two patterns

      q)uit

 Please select up to 6 functions (* ae *):  aaadad

 What is the pattern for the "a" stat in column 1 ?  c

 What is the pattern for the "a" stat in column 2 ?  g

 What is the pattern for the "a" stat in column 3 ?  cg

 What is the pattern for the "d" stat in column 4 ?  cg

 What is the pattern for the "a" stat in column 5 ?  gc

 What is the pattern for the "d" stat in column 6 ?  gc

OUTPUT

[ Previous | Top | Next ]

Some of the output file is shown below. You can see the data plotted in the figure with the documentation for the StatPlot program.

 WINDOW of: gamma.seq  check: 6474  from: 1  to: 500

 Window: 100  Shift: 3  MatchType: Subset MisMatch: 0

Human fetal beta globins G and A gamma

from Shen, Slightom and Smithies,  Cell 26; 191-203.

Analyzed by Smithies et al. Cell 26; 345-353.

                         October 13, 1998 13:06

Position C(obsrv) G(obsrv) CG(obsrv) CG_ob-ex(l) GC(obsrv) GC_ob-ex(l)  ..

      50   17.000   30.000     1.000      -4.049     4.000      -1.049

      53   19.000   29.000     1.000      -4.455     5.000      -0.455

      56   17.000   30.000     1.000      -4.049     5.000      -0.049

     /////////////////////////////////////////////////////////////////

     443   31.000   14.000     0.000      -4.297     2.000      -2.297

     446   32.000   14.000     0.000      -4.435     2.000      -2.435

     449   32.000   13.000     0.000      -4.118     2.000      -2.118

INPUT FILES

[ Previous | Top | Next ]

Window accepts a single sequence file as input. The function of Window depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

StatPlot plots a set of parallel curves from a table of numbers like the table written by the Window program. The statistics in each column of the table are associated with a position in the analyzed sequence.

RESTRICTIONS

[ Previous | Top | Next ]

The region of the input sequence to be analyzed may not be more than 175,000 symbols long.

No more than six statistics can be tabulated. The shift increment cannot exceed the window size. Numbering in the Position column is for the forward strand even if the reverse strand is chosen.

Pattern definitions can only contain GCG sequence characters (see Appendix III). We could easily modify Window to find patterns using a pattern definition syntax like that used for FindPatterns. Contact us if you think this is a good idea!

ALGORITHM

[ Previous | Top | Next ]

Each observation of a pattern is stored in a logical array. This array has a true (pattern observed) or false (pattern not observed) value for every position in the original sequence.

After the observation array is assembled, the incidence of each pattern can be found simply by putting down the window as a mask over the array and counting the observations under the window. The window is moved along the array (sequence) by the set shift increment and the observations are counted again.

Window calculates the number of observations per window in the following manner. The fraction of each symbol in the pattern is measured, either in the window (local expectation) or in the whole sequence range (global expectation). The product of the fractions for each symbol in the pattern multiplied by the maximum possible number of patterns in the window is the expected number of observations for the pattern in the window. Four of the measurements report the difference between the actual number of observations and the expected number.

The percentage measures are simply the number of observations divided by the maximum possible number of patterns in the window and multiplied by 100.

Fraction measures are the number of observations divided by the maximum possible number of patterns in the window.

SUBSET MATCHING

[ Previous | Top | Next ]

For nucleic acid sequences, the ambiguity codes in Appendix III are searched for subset matches. For instance, if the pattern specified is RR and the sequence contains an AG, an observation is scored at the position of the A. If the pattern specified were AG and the sequence contained an RR, no match would be scored. The sequence symbols must be the same as or a subset of the nucleotides implied by the pattern symbols.

PERFECT MATCHING

[ Previous | Top | Next ]

If the sequence is a peptide sequence or if you have -PERfect on the command line, Window scores occurrences of patterns by finding perfect examples of the pattern in the sequence.

OVERLAPPING SET MATCHING

[ Previous | Top | Next ]

If you use the command-line parameter -ALL and your sequence is a nucleic acid sequence, the sequence can be an overlapping set of the pattern instead of only a subset. (In other words, ambiguous bases in the sequence can match bases in the pattern even if the sequence's ambiguous base is not a subset of the pattern's base.) Using the same example as in the SUBSET MATCHING topic, the pattern AG would now match the sequence RR. As another example, the pattern RA would match the sequence MK.

CONSIDERATIONS

[ Previous | Top | Next ]

The cost of running Window is very low, but the output files can be very large. You should recognize that Window writes one line in the output file for every position of the window. Running Window on a sequence of length 10,000, with window size 100, shift increment 1, and using five measures will generate an output file with about 10,000 lines and about 60,000 numbers.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.

WINDOW does not support complete command-line control.

Prompted Parameters:

-BEGin=2101 -END=2600       sets the range of interest

-REVerse                    uses the reverse strand

Optional Parameters:

-ALL       makes an overlapping-set search

-MISmatch  allows mismatches between the pattern and the sequence

-PERFect   suppresses ambiguous matches for nucleic acid sequences

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line.

-ALL

Makes an overlapping-set search for patterns in nucleic acid sequences. If your sequence is rich in ambiguity, you can measure the frequency of potential examples of patterns.

-MISmatch

Allows mismatches between the pattern and the sequence. Window will prompt you for the number of mismatches to allow.

-PERFect

Normally, Window searches for patterns using subset matching in nucleic acids and perfect matching in peptide sequences. You can override the subset default with the command-line parameter -PERfect to suppress all matches between ambiguous base symbols.

Printed: April 5, 2005 15:48

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.