Table of Contents
Window makes a table of the frequencies of different sequence patterns within a window as it is moved along a sequence. A pattern is any short sequence like GC or R or ATG. You can plot the output with the program StatPlot.
Window calculates the frequency of patterns within a window of a set length. A pattern is any short sequence such as GC, or R, or ATG. The output is a table of numbers suitable for input to the StatPlot program. The window is moved along the sequence by a shift increment, and the number of observations of the pattern at every window position is measured. The frequency can be reported as a fraction, a percent, or simply a number of observations. You can also ask to see the difference between the number of observations of the pattern and the expected number of observations for a random sequence of identical composition. This expectation can be based either on the composition within the window (local) or on the composition of the whole sequence range (global). Another statistic lets you see the difference in frequency between two patterns. The pattern frequencies measured by Window are for one strand only.
You define the window size and the shift increment. The shift increment is the amount the window is moved between measurements. From a menu of the eight possible measures, you may choose up to six. Each measure you choose makes a column in the output table. After choosing the measurements, you are prompted to enter the pattern you want measured. For each measurement you must designate a pattern when prompted with a question that reminds you of the kind of measurement and the column number.
Here is a session using Window to measure the frequency of C, G, CG, and GC in the sequence gamma.seq. You can see from this experiment whether or not the frequency of the dinucleotide CG correlates well with the content of the nucleotides C and G (it doesn't). The output file from this session with Window is plotted as an example in the program StatPlot.
WINDOW on what sequence ? gamma.seq
Begin (* 1 *) ?
End (* 11375 *) ? 500
Reverse (* No *) ?
What window size (* 100 *) ?
What shift increment (* 3 *) ?
What should I call the output file (* gamma.wdw *) ?
What functions do you want:
a) number of patterns observed
b) percent of patterns observed
c) fraction of patterns observed
d) number of observed - expected(local) patterns
e) number of observed - expected(global) patterns
f) percent of observed - expected(local) patterns
g) percent of observed - expected(global) patterns
h) percent difference between two patterns
Please select up to 6 functions (* ae *): aaadad
What is the pattern for the "a" stat in column 1 ? c
What is the pattern for the "a" stat in column 2 ? g
What is the pattern for the "a" stat in column 3 ? cg
What is the pattern for the "d" stat in column 4 ? cg
What is the pattern for the "a" stat in column 5 ? gc
What is the pattern for the "d" stat in column 6 ? gc
Some of the output file is shown below. You can see the data plotted in the figure with the documentation for the StatPlot program.
WINDOW of: gamma.seq check: 6474 from: 1 to: 500
Window: 100 Shift: 3 MatchType: Subset MisMatch: 0
Human fetal beta globins G and A gamma
from Shen, Slightom and Smithies, Cell 26; 191-203.
Analyzed by Smithies et al. Cell 26; 345-353.
October 13, 1998
Position C(obsrv) G(obsrv) CG(obsrv) CG_ob-ex(l) GC(obsrv) GC_ob-ex(l) ..
50 17.000 30.000 1.000 -4.049 4.000 -1.049
53 19.000 29.000 1.000 -4.455 5.000 -0.455
56 17.000 30.000 1.000 -4.049 5.000 -0.049
443 31.000 14.000 0.000 -4.297 2.000 -2.297
446 32.000 14.000 0.000 -4.435 2.000 -2.435
449 32.000 13.000 0.000 -4.118 2.000 -2.118
Window accepts a single sequence file as input. The function of Window depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.
StatPlot plots a set of parallel curves from a table of numbers like the table written by the Window program. The statistics in each column of the table are associated with a position in the analyzed sequence.
The region of the input sequence to be analyzed may not be more than 175,000 symbols long.
No more than six statistics can be tabulated. The shift increment cannot exceed the window size. Numbering in the Position column is for the forward strand even if the reverse strand is chosen.
Pattern definitions can only contain GCG sequence characters (see Appendix III). We could easily modify Window to find patterns using a pattern definition syntax like that used for FindPatterns. Contact us if you think this is a good idea!
Each observation of a pattern is stored in a logical array. This array has a true (pattern observed) or false (pattern not observed) value for every position in the original sequence.
After the observation array is assembled, the incidence of each pattern can be found simply by putting down the window as a mask over the array and counting the observations under the window. The window is moved along the array (sequence) by the set shift increment and the observations are counted again.
Window calculates the number of observations per window in the following manner. The fraction of each symbol in the pattern is measured, either in the window (local expectation) or in the whole sequence range (global expectation). The product of the fractions for each symbol in the pattern multiplied by the maximum possible number of patterns in the window is the expected number of observations for the pattern in the window. Four of the measurements report the difference between the actual number of observations and the expected number.
The percentage measures are simply the number of observations divided by the maximum possible number of patterns in the window and multiplied by 100.
Fraction measures are the number of observations divided by the maximum possible number of patterns in the window.
For nucleic acid sequences, the ambiguity codes in Appendix III are searched for subset matches. For instance, if the pattern specified is RR and the sequence contains an AG, an observation is scored at the position of the A. If the pattern specified were AG and the sequence contained an RR, no match would be scored. The sequence symbols must be the same as or a subset of the nucleotides implied by the pattern symbols.
If the sequence is a peptide sequence or if you have -PERfect on the command line, Window scores occurrences of patterns by finding perfect examples of the pattern in the sequence.
If you use the command-line parameter -ALL and your sequence is a nucleic acid sequence, the sequence can be an overlapping set of the pattern instead of only a subset. (In other words, ambiguous bases in the sequence can match bases in the pattern even if the sequence's ambiguous base is not a subset of the pattern's base.) Using the same example as in the SUBSET MATCHING topic, the pattern AG would now match the sequence RR. As another example, the pattern RA would match the sequence MK.
The cost of running Window is very low, but the output files can be very large. You should recognize that Window writes one line in the output file for every position of the window. Running Window on a sequence of length 10,000, with window size 100, shift increment 1, and using five measures will generate an output file with about 10,000 lines and about 60,000 numbers.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional.
WINDOW does not support complete command-line control.
-BEGin=2101 -END=2600 sets the range of interest
-REVerse uses the reverse strand
-ALL makes an overlapping-set search
-MISmatch allows mismatches between the pattern and the sequence
-PERFect suppresses ambiguous matches for nucleic acid sequences
You can set the parameters listed below from the command line.
Makes an overlapping-set search for patterns in nucleic acid sequences. If your sequence is rich in ambiguity, you can measure the frequency of potential examples of patterns.
Allows mismatches between the pattern and the sequence. Window will prompt you for the number of mismatches to allow.
Normally, Window searches for patterns using subset matching in nucleic acids and perfect matching in peptide sequences. You can override the subset default with the command-line parameter -PERfect to suppress all matches between ambiguous base symbols.
Printed: April 5, 2005 15:48
Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.
Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.