HTHSCAN+

[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]

 

Table of Contents

FUNCTION

DESCRIPTION

EXAMPLE

OUTPUT

INPUT FILES

RELATED PROGRAMS

CONSIDERATIONS

ALGORITHM

COMMAND-LINE SUMMARY

LOCAL DATA FILES

PARAMETER REFERENCE


FUNCTION

[ Top | Next ]

HTHScan+ scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.

DESCRIPTION

[ Previous | Top | Next ]

Advantages of Plus “+” Programs:

 

P      Plus programs are enhanced to be able to read sequences in a variety of native formats such as GCG RSF, GCG SSF, GCG MSF, GenBank, EMBL, FastA, SwissProt, PIR, and BSML without conversion.

 

P      Plus programs remove sequence length restriction of 350,000bp.

 

If you do not need these features and wish to have more interactivity, you might wish to seek out and run the original program version.

HTHScan+ predicts helix-turn-helix (H-T-H) motifs in protein sequences. For each sequence, HTHScan+ prints a list of possible H-T-H motifs sorted in descending order according to score. Associated with each score is the probability of achieving that score in the target sequence by chance using the given family-specific weight matrix. HTHScan+ has weight matrices for the araC and lysR families of H-T-H motifs and one for homeobox domains.

EXAMPLE

[ Previous | Top | Next ]

Here is a session with HTHScan+ that was used to find H-T-Hs in the arabinose operon regulatory protein araC sequence from E. coli:

11:41~139> hthscan+
 
HTHScan+ scans protein sequences for the presence of helix-turn-helix motifs,indicative of sequence-specific DNA-binding structures often associated with gene regulation.
 
 
hthscan+ of what sequence(s) ? pir:rgeca
Begin (* 1 *) ?
End (-1 for entire sequence) (* -1 *) ?
Only display H-T-Hs whose score exceeds (* 4.0 *) ?
What should I call the output file (* <sequence_name>.hthscan+ *) ?
Search using weight matrix for which H-T-H family ("arac", "lysr", or "homeobox" (* arac *) ?
 
 
HTHScan of pir:rgeca  December 08, 2004 11:44
 
  Weight matrix: SHARE_MATRIX:htharac.dat
  Minimum score for H-T-Hs (threshold): 4.0
 
 
Input sequences processed                 : 1
Number of sequences with predicted H-T-Hs : 1
 
Results written to RGECA.hthscan+

OUTPUT

[ Previous | Top | Next ]

Here is the output file:

 

sequence: pir1:RGECA

name: RGECA  check: 4061  from: 1  to: 292

 

   1. 197 IASVAQHVCLSPSRLSHLFR 216

      Score: 39.8

      Probability: 4.031E-12

 

 

 

*** SUMMARY ***

 

Input sequences processed                 : 1

Number of sequences with predicted H-T-Hs : 1

 

 

The N-terminus->C-terminus direction of the predicted H-T-H is from left to right. The position of the first residue in the H-T-H is shown to the left. The position of the last residue in the H-T-H is shown to the right.

Below the H-T-H display is the score computed for the predicted H-T-H and the probability of random occurrence of that score or better given a sequence whose residue distribution is uniform and whose positions are independent of one another.

INPUT FILES

[ Previous | Top | Next ]

The input to HTHScan+ is one or more protein sequences. If HTHScan+ rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example Genbank:*

RELATED PROGRAMS

[ Previous | Top | Next ]

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds. FindPatterns+ identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. SPScan+ scans protein sequences for the presence of secretor signal peptides (SPs). CoilScan+ locates coiled-coil segments in protein sequences. FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. SPScan scans protein sequences for the presence of secretor signal peptides (SPs). CoilScan locates coiled-coil segments in protein sequences.

HTHScan scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.

 

 

CONSIDERATIONS

[ Previous | Top | Next ]

Because of the way HTHScan+ sorts and stores predicted H-T-H motifs during scanning, no particular ordering is guaranteed among H-T-H motifs that have exactly the same score.

Ambiguity codes (such as B or Z) in protein sequences contribute exactly 0 to the score of the sequence window within which they are found. Therefore, the scores and probabilities associated with any predicted motifs from such a sequence window are likely to differ to varying extents from what they would be otherwise. You shouldn't routinely encounter this problem because ambiguity codes are extremely rare in protein sequences.

ALGORITHM

[ Previous | Top | Next ]

HTHScan+ uses a log-odds position-weight matrix ("weight matrix") to detect the presence of H-T-H motifs in protein sequences. The weight matrix encodes the H-T-H motif as a set of weights representing the likelihood of each amino acid residue to appear in each position of the motif. The score reported by HTHScan+ for each prediction is a measure of the local goodness of fit between the target sequence and the H-T-H signal represented by the weight matrix. This score is the sum of the weights corresponding to the amino acid residues found in the target sequence at each weight matrix position.

The statistical significance of each score is computed as the probability of random occurrence of that score or better in a sequence with the same amino acid residue distribution as the target sequence and whose positions are all independent of each other (Claverie, J.-M. and Audic, S. CABIOS 12(5); 431-439 (1996)).

The weight matrices used by HTHScan+ were prepared using sequence sets taken from Pfam Release 2.0 (Sonnhammer, E.L. et al. Proteins 28; 405-420 (1997)). The Pfam families used were HTH 1 (bacterial regulatory helix-loop-helix proteins, lysR family), HTH 2 (bacterial regulatory helix-loop-helix proteins, araC family), and homeobox (homeobox domain). The log-odds weight matrices were constructed from these sequences with MEME+ version 2.1 (Bailey, T.L. and Elkan, C. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36 (1994)).

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -check to view the summary below and to specify parameters before the program executes. In the syntax summary below, square brackets ([ and ]) enclose parameter values that are optional. For each program parameter, square brackets enclose the type of parameter value specified, the default parameter value, and shortened forms of the parameter name, aliases.  Programs with a plus in the name use either the full parameter name or a specified alias. If “Type” is “Boolean”, then the presence of the parameter on the command line indicates a true condition. A false condition needs to be stated as, parameter=false.

HTHScan+ scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.

 

Minimal Syntax: % hthscan+ [-infile=]value -Default

 

 

Minimal Parameters (case-insensitive):

 

-infile         [Type: InFile / Default: EMPTY / Aliases: infile1 in]

The name of the input file.

 

Prompted Parameters (case-insensitive):

 

-begin          [Type: Integer / Default: '1' / Aliases: beg]

                First base of interest for each query sequence.

 

-end            [Type: Integer / Default: '-1']

                Last base of interest for each query sequence.

 

-threshold      [Type: Double / Default: '4.0' / Aliases: thresh]

                Sets minimum score for H-T-H detection.

 

-outfile        [Type: OutFile / Default: '<sequence_name>.hthscan+' /

                Aliases: out outfile1] Names the output file.

 

-family         [Type: String / Default: 'arac' / Aliases: fam]

                Specifies weight matrix by H-T-H family ("arac", "lysr", or "homeobox".

 

Optional Parameters (case-insensitive):

 

-check          [Type: Boolean / Default: 'false' / Aliases: che help]

                Prints out this usage message.

 

-default        [Type: Boolean / Default: 'false' / Aliases: d def]

                Specifies that sensible default values be used for all parameters where possible.

 

-documentation  [Type: Boolean / Default: 'true' / Aliases: doc]

                Prints banner at program startup.

 

-quiet          [Type: Boolean / Default: 'false' / Aliases: qui]

                Tells application to print only a minimal amount of information.

 

-data           [Type: String / Default: EMPTY / Aliases: dat]

                Assigns weight matrix.

 

-seqout         [Type: OutFile / Default: EMPTY / Aliases: rsf]

                Annotated sequence output.

 

-numtopscores   [Type: Integer / Default: '-1' / Aliases: numtop maxhits] Specifies maximum number of H-T-Hs to report.

 

-even           [Type: Boolean / Default: 'false']

                Assumes even target residue distribution.

 

-probabilities  [Type: Boolean / Default: 'true' / Aliases: prob]

                Compute score probabilities.

 

-verbose        [Type: Boolean / Default: 'false' / Aliases: ver]

                Print more documentation about each sequence to the output file.

 

-monitor        [Type: Boolean / Default: 'false' / Aliases: mon]

                Displays screen trace of progress.

 

-summary        [Type: Boolean / Default: 'true' / Aliases: sum]

                Displays screen summary at end of the program.

 

 

 
 
 

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -data1=myfile.dat. For more information see Section 4, Using Data Files in the User's Guide.

If you choose to search for the araC family of H-T-H motifs (the default), HTHScan+ will use the weight matrix file share_misc:htharac.dat. If you choose to search for the lysR family of H-T-H motifs, HTHScan+ will use the weight matrix file share_misc:hthlysr.dat. If you choose to search for the homeobox family of H-T-H motifs, HTHScan+ will use the weight matrix file share_misc:hthhomeobox.dat.

PARAMETER REFERENCE

[ Previous | Top ]

You can set the parameters listed below from the command line. Shortened forms of the parameter name, aliases, are shown, separated by commas.

-infile, -infile1, -in

 

The name of the input file.

 

-begin, -beg

 

            First base of interest for each query sequence.

 

-end

 

                     Last base of interest for each query sequence.

 

-threshold, -thresh

 

            Sets minimum score for H-T-H detection

 

-outfile, -out, outfile1

 

Names the output file.

 

-family=arac, -fam

Allows you to specify the weight matrix used by choosing the H-T-H motif family by name. You may specify arac for the araC family of bacterial regulatory proteins (represented by the weight matrix file htharac.dat), lysr for the lysR family of bacterial regulatory proteins (represented by the weight matrix file hthlysr.dat), or homeobox for the homeobox domain, (represented by the weight matrix file hthhomeobox.dat).

-check, -che, -help

 

Prints out this usage message.

 

-default, -def

 

Specifies that sensible default values be used for all parameters where possible.

 

-documentation, -doc

 

Prints banner at program startup.

 

-quiet, -qui

 

This parameter is not supported.

 

-data, -dat

 

Assigns weight matrix.

 

-seqout

 

Annotated sequence output.

 

-probabilities, -prob

 

Compute score probabilities.

 

-numtopscores=3, -numtop

Specifies the maximum number of predicted H-T-H motifs to report for each sequence scanned. For example, if you use -numtopscores=3, HTHScan+ will display no more than three of the highest scoring H-T-Hs predicted for each sequence. Use –numtopscores=1 if you want to see only the highest scoring H-T-H in each sequence. By default, HTHScan+ will display all H-T-Hs predicted for each sequence.

-even, -eve

Tells HTHScan+ to assume that amino acid residues are distributed evenly throughout the length of the target sequence for the purpose of calculating score probabilities. This makes HTHScan+ perform a little faster, because it does not have to compute the actual distribution of residues in each input sequence. However, reliability of the score probability calculations may be adversely affected.

-verbose, -ver

Tells HTHScan+ to print more documentation about each sequence to the output file. The number of lines of documentation printed depends upon the value of the % DocLines global switch described in "Using Global Switches" in Section 3, Using Programs in the User's Guide.

-rsf=hthscan+.rsf

Writes an RSF (rich sequence format) file containing the input sequences annotated with features generated from the results of HTHScan+. This RSF file is suitable for input to other Accelrys GCG (GCG) programs that support RSF files. In particular, you can use SeqLab to view this features annotation graphically. If you don't specify a file name with this parameter, then the program creates one using hthscan+ for the file basename and .rsf for the extension. For more information on RSF files, see "Using Rich Sequence Format (RSF) Files" in Section 2 of the User's Guide. Or, see "Rich Sequence Format (RSF) Files" in Appendix C of the SeqLab Guide.

-monitor=100, -mon

Program monitors its progress on your screen by displaying a screen trace of progress. However, when you use -default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

The monitor is updated every time the program processes 100 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.

-summary, -sum

Writes a summary of the program's completion to the screen. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -summary=false.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: May 27, 2005  12:49


[Genhelp | Program Manual | User's Guide | Data Files | Databases | Release Notes ]


Technical Support: support-us@accelrys.com, support-japan@accelrys.com,
or support-eu@accelrys.com

Copyright (c) 1982-2005 Accelrys Inc. All rights reserved.

Licenses and Trademarks: Discovery Studio ®, SeqLab ®, SeqWeb ®, SeqMerge ®, GCG ® and, the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

www.accelrys.com/bio