FINDPATTERNS+

FindPatterns+ identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.

DESCRIPTION

[ Previous | Top | Next ]

Advantages of Plus “+” Programs:

P Plus programs are enhanced to be able to read sequences in a variety of native formats such as GCG RSF, GCG SSF, GCG MSF, GenBank, EMBL, FastA, SwissProt, PIR, and BSML without conversion.

P Plus programs remove sequence length restriction of 350,000bp.

If you do not need these features and wish to have more interactivity, you might wish to seek out and run the original program version.

FindPatterns+ locates short sequence patterns. If you are trying to find a pattern in a sequence or if you know of a sequence that you think occurs somewhere within a larger one, you can find your place with FindPatterns+. FindPatterns+ can look through large data sets for any short sequence patterns you specify. FindPatterns+ can recognize patterns with some symbols mismatched but not with gaps. It supports the IUPAC-IUB nucleotide ambiguity codes (see Appendix III) for searching through nucleotide sequences.

FindPatterns+ searches both strands of a nucleotide sequence if the patterns you specify are not identical on both strands. If your sequence is a protein, FindPatterns searches for a simple symbol match between your pattern and the protein sequence.

FindPatterns+ names each sequence on the screen as it is searched. The output file shows only sequences where a pattern was found unless you use -show. Five residues from the original sequence are shown on either side of each "find." The word /Rev occur if the reverse of the pattern is found. If you run FindPatterns+ with -names, the output file is written as a list file, which you can use as input to other Accelrys GCG (GCG) programs that support indirect file specifications.

You can specify patterns from a prompt or from a pattern file as described in the PATTERN FILE topic below.

When FindPatterns+ finishes searching for your patterns, it returns to the first prompt in the program, FINDPATTERNS+ in what sequence(s) ? If you simply press <Return> at the prompt, FindPatterns+ stops.

FindPatterns+ writes all of its results in the same output file. FindPatterns+ prints a short summary on your screen and in the output file when the search is completed.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using FindPatterns+ to determine if there are any EcoRI, BamHI, or promoter sites in the human immunoglobulin sequences of the Genbank database (The program Fetch+ was used first to make a copy of the file pattern.dat):

findpatterns+

FindPatterns+ identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.

findpatterns+ in what sequence(s) ? AAA33535.RSF

Enter value for patterns ? MAAKI

Enter value for data file ?

What should I call the output file (* <sequence_name>.findpatterns+ *) ?

Input sequences processed  : 1

Total number of finds      : 1

Results written to AAA33535.findpatterns+

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:

! FINDPATTERNS on AAA33535.RSF allowing 0 mismatches

! Pattern1 MAAKI December 03, 2004 16:35 ..

AAA33535.RSF (AAA33535) ck: 2829 len: 235 !

Pattern1 MAAKI

1: MAAKI FCLLM

*** SUMMARY ***

Input sequences processed : 1

Total number of finds : 1

If the pattern is a complex expression, it will be written above each find along with a simplification of the ambiguous parts of the pattern so that you can see what was actually found. In the above example, the pattern MAAKI is the pattern being searched for, in the Zea Mays protein [AAA33535.RSF]

INPUT FILES

[ Previous | Top | Next ]

FindPatterns+ takes single or multiple sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example Genbank:*. The function of FindPatterns+ depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

GCG mapping programs Map, Map+, MapPlot, and MapSort can be used to mark finds in the context of a DNA restriction map. Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. These programs all use the same search algorithm and input data file format as FindPatterns+.

FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal.

RESTRICTIONS

[ Previous | Top | Next ]

The restrictions specified using -mincuts and -maxcuts refer to the number of times a pattern if found in a sequence and must be fulfilled on a single strand of a nucleotide sequence for the find to be reported. For instance, if you use -mincuts=2 and -patterns=CCCC with the sequence CCCCGGGG, no finds will be reported, because although there are two finds there is only one instance of the pattern on each strand.

LIST REFINEMENT

[ Previous | Top | Next ]

The database programs LookUp, Names, StringSearch, FindPatterns+, FastA+, TFastA+, FastX+, TFastX+, SSearch+, and WordSearch can be used for list refinement if you are looking for sequences with something in common. For instance, you could identify human globin nucleotide sequences with LookUp. The output list from LookUp could then be refined further with FindPatterns+ to show only those human globin sequences containing EcoRI sites. If you run FindPatterns+ with -names, you could then do a FastA+ sequence search on the FindPatterns+ list file output to see if a sequence you have is similar to any of these EcoRI-containing human globin sequences.

Adding Lists Together

You can add two lists together by simply appending one of the files to the other. It is better if you use a text editor to modify the heading of the combined list so that the annotation in the list correctly reflects what you have done. Remember to delete the text heading from the second file so that it does not occur in the middle of the list.

Suppressing Items

Suppress any item in a list by typing an exclamation point (!) in front of the item. You can also put comments into a list anywhere on a line by placing an exclamation point before the comment.

DEFINING PATTERNS

[ Previous | Top | Next ]

FindPatterns+, Map+, MapSort, MapPlot, and Motifs all let you search with ambiguous expressions that match many different sequences. The expressions can include any legal GCG sequence character (see Appendix III). The expressions can also include several non-sequence characters, which are used to specify OR matching, NOT matching, begin and end constraints, and repeat counts. For instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of any base, followed by ATG. Following is an explanation of the syntax for pattern specification.

Implied Sets and Repeat Counts

Parentheses () enclose one or more symbols that can be repeated some number of times. Braces {} enclose numbers that tell how many times the symbols within the preceding parentheses must be found.

Sometimes, you can leave out part of an expression. If braces appear without preceding parentheses, the numbers in the braces define the number of repeats for the immediately preceding symbol. One or both of the numbers within the braces may be missing. For instance, both the pattern GATG{2,}A and the pattern GATG{2}A mean GAT, followed by G repeated from 2 to 350,000 times, followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0 to 350,000 times, followed by A; the pattern GAT(TG){,2}A means GAT, followed by TG repeated from 0 to 2 times, followed by A; the pattern GAT(TG){2,2}A means GAT, followed by TG repeated exactly 2 times, followed by A. (If the pattern in the parentheses is an OR expression (see below), it cannot be repeated more than 2,000 times.)

OR Matching

If you are searching nucleic acids, the ambiguity symbols defined in Appendix III let you define any combination of G, A, T, or C. If you are searching proteins, you can specify any of several symbol choices by enclosing the different choices in parentheses and separating the choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed by S. The length of each choice need not be the same, and there can be up to 31 different choices within each set of parentheses. The pattern GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from 1 to 4 times followed by A. The sequence GATTGGA matches this pattern. There can be several parentheses in a pattern, but parentheses cannot be nested.

NOT Matching

The pattern GC~CAT means GC, followed by any symbol except C, followed by AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T, followed by CC.

Begin and End Constraints

The pattern <GACCAT can only be found if it occurs at the beginning of the sequence range being searched. Likewise, the pattern GACCAT> would only be found if it occurs at the end of the sequence range.

CONSIDERATIONS

[ Previous | Top | Next ]

FindPatterns+ will not introduce gaps but it can tolerate mismatches when it is run with -mismatch. Mismatched finds are shown in the output in lowercase.

If you are entering patterns from the command line with the -patterns parameter, the comma is assumed to separate different patterns on the command line.

SPECIFYING SEQUENCES

[ Previous | Top | Next ]

There is information on specifying sets of sequences in Section 2, Using Sequence Files and Databases of the User's Guide.

LARGE DATA SETS

[ Previous | Top | Next ]

FindPatterns+ is one of the few programs in GCG that can take more than a few minutes to run. Large searches should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using -batch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Section 3, Using Programs in the User's Guide. Very large comparisons may exceed the CPU limit set by some systems.

Patterns that start with complicated OR or NOT expressions take longer to search than simple expressions like GATTC.

PATTERN FILE

[ Previous | Top | Next ]

You can put any patterns you want to search for into a file like the one below. The pattern data files for FindPatterns+ are modeled on the enzyme data files for the mapping programs described in Appendix VII. The pattern names should not have more than eight letters. The offset field is ignored by FindPatterns+, but the field should have an integer number in it to make these files compatible with the files that are read by mapping programs.

The exact column used for each field does not matter, only the order of the fields in the line. You can give several patterns the same name, but put all of the entries for that name on adjacent lines of the file. Blank lines and lines that start with an exclamation point (!) are ignored.

If the overhang field is a period (.) instead of a number, only the top strand of a nucleic acid sequence is searched for the pattern. Any number implies that both strands are to be searched. The value of the overhang number has no significance to FindPatterns+. Here is the pattern data file used in the example above:

!!PATTERNS 1.0

An example of a pattern data file for the program FINDPATTERNS+.

Name    Offset  Pattern             Overhang  Documentation  ..

BamHI        1  GGATCC                     0  !

EcoRI        1  GAATTC                     0  !

Promotor     1  TAATA(N){20,30}ATG         0  !

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -check to view the summary below and to specify parameters before the program executes. In the syntax summary below, square brackets ([ and ]) enclose parameter values that are optional. For each program parameter, square brackets enclose the type of parameter value specified, the default parameter value, and shortened forms of the parameter name, aliases. Programs with a plus in the name use either the full parameter name or a specified alias. If “Type” is “Boolean”, then the presence of the parameter on the command line indicates a true condition. A false condition needs to be stated as, parameter=false.

Minimal Syntax: % findpatterns+ [-infile=]value -Default

Minimal Parameters (case-insensitive):

-infile [Type: InFile / Default: EMPTY / Aliases: infile1 in]

The name of the input file

Prompted Parameters (case-insensitive):

-patterns [Type: List / Default: EMPTY / Aliases: pat pattern]

specifies patterns for which to search.

-data [Type: InFile / Default: EMPTY / Aliases: dat]

The name of the pattern input file.

-outfile [Type: OutFile / Default: '<sequence_name>.findpatterns+' /Aliases: out]Names the output file

Optional Parameters (case-insensitive):

-check [Type: Boolean / Default: 'false' / Aliases: che help]

Prints out this usage message.

-default [Type: Boolean / Default: 'false' / Aliases: d def]

Specifies that sensible default values be used for all parameters where possible.

-documentation [Type: Boolean / Default: 'true' / Aliases: doc]

Prints banner at program startup.

-quiet [Type: Boolean / Default: 'false' / Aliases: qui]

Tells application to print only a minimal amount of information.

-doclines [Type: Integer / Default: EMPTY / Aliases: docl]

Specifies number of documentation lines to copy.

-batch [Type: Boolean / Default: 'false']

Allows to submit a job to a batch queue.

-mismatch [Type: Integer / Default: '0' / Aliases: mis]

Allows the specified number of mismatches when searching for your subsequence.

-onestrand [Type: Boolean / Default: 'false' / Aliases: one]

searches only the top strand of nucleotide sequences.

-sixbase [Type: Boolean / Default: 'false' / Aliases: six]

Searches for patterns with 6 or more residues in pattern.

-circular [Type: Boolean / Default: 'false' / Aliases: cir]

Searches all sequences as if they were circular.

-all [Type: Boolean / Default: 'false']

Does an "overlapping-set" search in nucleotide sequences.

-perfect [Type: Boolean / Default: 'false' / Aliases: perf]

Only looks for perfect matches against your pattern.

-append [Type: Boolean / Default: 'false' / Aliases: app]

Appends the pattern data file to the output file.

-show [Type: Boolean / Default: 'false' / Aliases: sho]

Shows every file searched even if there are no finds.

-terminal [Type: Boolean / Default: 'false' / Aliases: ter]

Writes output to the terminal screen instead of a file.

-monitor [Type: Boolean / Default: 'false' / Aliases: mon quiet]

Prints screen trace showing each file.

-minsitelen [Type: Integer / Default: '0' / Aliases: mins]

Ignores patterns with less than specified number of residues. See also: -sixbase.

-maxresult [Type: Integer / Default: '1000' / Aliases: max maxresults] Specifies maximum number of results to return for each strand searched.

-mincuts [Type: Integer / Default: '0' / Aliases: minc]

Limits finds to patterns found a minimum of specified number of times.

-maxcuts [Type: Integer / Default: '2147483647' / Aliases: maxc]

Limits finds to patterns found a maximum of specified number of times.

-once [Type: Boolean / Default: 'false' / Aliases: onc]

Limits finds to patterns found a maximum of 1 time.

-exclude [Type: String / Default: EMPTY / Aliases: exc]

Excludes patterns found between specified positions. Value is a comma-separated list of index positions, e.g. n1,n2.

-listfile [Type: String / Default: EMPTY / Aliases: lis names name] Writes names of matching sequences as output to specified list file.

-seqout [Type: String / Default: EMPTY / Aliases: rsf]

Writes matching sequences as output to specified sequence file.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

FindPatterns+ can read the patterns you want to find from the file pattern.dat in your working directory. If you don't have a file called pattern.dat in your directory, FindPatterns+ asks you to type in the patterns you want to find. If you want to use a pattern data file with a name other than pattern.dat, include -DATa=filename on the command line.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line. Shortened forms of the parameter name, aliases, are shown, separated by commas.

-patterns=gaattc,rggay, -pat, -pattern

Specifies the patterns to be found.

-data, -dat

The name of the pattern input file.

-outfile, -out

Names the output file.

-infile, -infile1, -in

The name of the input file.

-mismatch=1, -mis

Causes the program to recognize sites that are like the recognition site but with one (or more) mismatches. If you allow too many mismatches, you may get ridiculous results. The output from most mapping programs distinguishes between real sites and sites with one or more mismatches.

-check, -che, -help

Prints out this usage message.

-default, -d, -def

Specifies that sensible default values be used for all parameters where possible.

-documentation, -doc

Prints banner at program startup.

-quiet, -qui

This parameter is not supported.

-doclines, -docl

Specifies number of documentation lines to copy.

-sixbase, -six

Searches for patterns with 6 or more residues in pattern.

-maxresult, -max, -maxresults

Specifies maximum number of results to return for each strand searched.

-listfile, -lis

Writes names of matching sequences as output to specified list file.

-seqout, -rsf

Annotated sequence output.

-onestrand, -one

Searches only the top strand of nucleotide sequences.

-circular, -cir

Searches past the end of the sequence into the beginning of the sequence as if the molecule were continuous. Patterns that span the origin can only be found if the search is -circular.

-all

Makes an overlap set map instead of the usual subset map. If your sequence is very ambiguous (as for instance a back-translated sequence would be) and you want to see where restriction sites could be, then you should create an overlap-set map. Overlap-set and subset pattern recognition are discussed in more detail in the Program Manual entry for Window.

-perfect, -perf

Sets the program to look for a perfect alphabetic match between the site and the sequence. Ambiguity codes are normally expanded so that the site RXY would find sequences like ACT or GAC. With this parameter the ambiguity codes are not expanded so the site RXY would only match the sequence RXY. This parameter is not the same as -mismatch=0.

-append, -app

Appends the input enzyme data file to your output file.

-show, -sho

Normally, FindPatterns+ shows that a file was searched only if there were one or more finds in sequence. With -show, FindPatterns+ shows every file searched whether or not a pattern was actually found in it. (-show is equivalent to setting -mincuts=0.)

-terminal, -ter

Writes output on the terminal screen and suppresses the output file query. If you use FindPatterns+ often in this mode, you should assign a logical symbol that runs FindPatterns+ with terminal output as the default. Answering the output file query with term has the same effect on FindPatterns+.

-monitor, -mon

Program monitors its progress on your screen by displaying a screen trace of progress. However, when you use -default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

A find in these applications is referred to as a cut while a pattern is referred to as a restriction enzyme recognition site.

-minsitelen=6, -mins

Selects only patterns with the specified number or more bases in the recognition site. You can display the sites from any pattern in the enzyme or pattern file that you take the trouble to name individually, but when you use all of the patterns, the program uses all of the patterns whose recognition sites have the specified number or more non-N, non-X bases. The -mincuts, -,maxcuts, -once, and -exclude parameters suppress the display of patterns by the number of times the patterns are found in a sequence (abbreviated as cuts).

-mincuts=2, -minc

Excludes patterns that are not found at least two times.

-maxcuts=2, -maxc

Excludes patterns found more than two times.

-once, -onc

Excludes patterns found in your sequence more than once (equivalent to setting both mincuts and maxcuts to one).

-exclude=n1,n2[n3,n4,...], -exc

Excludes patterns found anywhere within one or more ranges of the sequence. If a pattern is found within an excluded range, then the pattern is not displayed. The ranges are defined with sets of two numbers. The numbers are separated by commas. Spaces between numbers are not allowed. The numbers must be integers that fall within the sequence beginning and ending points you have chosen. The range may be circular if the sequence being analyzed is circular. Exclusion is not done if there are any non-numeric characters in the numbers or numbers out of range or if there is an odd number of integers following the parameter.

-maxresult, -max, -maxresults

Specifies maximum number of results to return for each strand searched.

-batch, -bat

Submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

-rsf=findpatterns.rsf, -rsf

Writes an RSF (rich sequence format) file containing the input sequences annotated with features generated from the results of FindPatterns+. This RSF file is suitable for input to other GCG programs that support RSF files. In particular, you can use SeqLab to view this features annotation graphically. If you don't specify a file name with this parameter, then the program creates one using FindPatterns+ for the file basename and .rsf for the extension. For more information on RSF files, see "Using Rich Sequence Format (RSF) Files" in Section 2 of the User's Guide. Or, see "Rich Sequence Format (RSF) Files" in Appendix C of the SeqLab Guide.

Printed: June 1, 2005 19:02

Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.