File: help.html

package info (click to toggle)
blimps 3.9%2Bds-4
links: PTS, VCS
area: non-free
in suites: sid, trixie
size: 7,204 kB
sloc: ansic: 43,276; csh: 553; perl: 116; makefile: 100; cs: 27; cobol: 23
file content (332 lines) | stat: -rw-r--r-- 15,196 bytes
parent folder | download | duplicates (3)
<HTML>
<TITLE>Block Searcher Help</TITLE>
<A NAME="top"></A>
<H1><IMG ALIGN=MIDDLE SRC="/blocks/icons/blocks.xbm">
Block Searcher Help</H1>
<UL>
<LI><A HREF="#INTRO">Introduction</A>
<LI><A HREF="#FORM">WWW form</A>
<LI><A HREF="#EMAIL">Submitting multiple searches</A>
<LI><A HREF="#RESULTS">Interpreting results of a search</A>
<LI><A HREF="#GET">Getting documentation for blocks</A>
<LI><A HREF="#REFS">References</A>
</UL>

<A NAME="INTRO"><H3>Introduction</H3></A>
<BR>
As an aid to detection and verification of protein sequence homology, the 
BLOCKS Searcher compares a protein or DNA sequence to the
<A HREF="/blocks/help/blocks_release.html">current database</A>
of protein blocks. Blocks are short multiply aligned ungapped 
segments corresponding to the most highly conserved regions of proteins. 
<P>
The rationale behind searching a database of blocks is that information from
multiply aligned sequences is present in a concentrated form, reducing
background and increasing sensitivity to distant relationships.  This
information is represented in a position-specific scoring table or "profile"
(<A HREF="#4">4</A>),
in which each column of the alignment is converted to a column of a table
representing the frequency of occurrence of each of the 20 amino acids.  For
searching a database of blocks, the first position of the sequence is aligned
with the first position of the first block, and a score for that amino acid is
obtained from the profile column corresponding to that position.  Scores are
summed over the width of the alignment, and then the block is aligned with the
next position.  This procedure is carried out exhaustively for all positions
of the sequence for all blocks in the database, and the best alignments
between a sequence and entries in the BLOCKS database are noted.  If a
particular block scores highly, it is possible that the sequence is related to
the group of sequences the block represents.  Typically, a group of proteins
has more than one region in common and their relationship is represented as a
series of blocks separated by unaligned regions.  If a second block for a
group also scores highly in the search, the evidence that the sequence is
related to the group is strengthened, and is further strengthened if a third
block also scores it highly, and so on.

<BR><A HREF="#top">Return to top</A> <P>

<H3><A NAME="FORM">
<A HREF="/blocks/blocks_search.html">WWW form</A></A></H3>
<UL>
<LI><H4><A NAME="DB">Select the database to search</A></H4>

<P>
<LI><H4><A NAME="EM">Results through email</A></H4>
If you enter an email address, your search will be queued and
your results will be emailed in text format. Your results
will also be available in HTML format via a URL that will remain
active for approximately 4 hours.

<P>
<LI><H4><A NAME="EX">Cutoff expected value</A></H4>
Select an expected value for the search, the default is 5 hits
expected by chance. The expected value is the combined p-value
(<A HREF="#18">18</A>) multiplied by the number of
alignments done between the query sequence and blocks database
during the search. 

<P>
<LI><H4><A NAME="OU">Amount of output</A></H4>
<I>Summary with alignments:</I> Includes a one line 
<A HREF="#SUM">summary</A> for each hit plus 
<A HREF="#ALIGN">details</A> for each blocks included in the hit.  <BR>
<I>Summary only:</I>A one line
<A HREF="#SUM">summary</A> is printed for each hit.<BR>
<I>GFF format:</I>"Gene Finding Features" format proposed by the 
<A HREF="http://www.sanger.ac.uk/Software/GFF/GFF_Spec.shtml">Sanger Centre</A>.
Because of the limitations of this format, the combined E-value cannot
be displayed. Instead, each block included in a hit is displayed individually
with an E-value for that block alone. Protein queries are reported with
strand "+" and frame "0".
<BR>
<I>Old style results:</I> The original "block sort" post-processing
program is used, with statistics based on percentiles.<BR>
<I>Raw results only:</I> The raw
<A HREF="ftp://ftp.ncbi.nih.gov/repository/blocks/unix/blimps">Blimps</A> 
search results are returned
without any statistics or post-processing to combine blocks from the
same family.<BR>

<P>
<LI><H4><A NAME="TY">Query sequence type</A></H4>
<I>Determine automatically:</I> If the query sequence
contains any occurence of 'E', 'F', 'I', 'L', 'P' or 'Q',
it is assumed to be protein, otherwise, if it consists of
more than 85% 'A', 'C', 'G', 'T', 'U', and 'N", it is
assumed to be DNA, otherwise, it is assumed to be protein.
Problems can arise if the query sequence format is incorrectly
interpreted; for example, the title may be incorrectly combined
with the sequence itself causing a DNA sequence to be
interpreted as protein. Check that the sequence read in is
of the correct type and length before interpreting your 
<A HREF="#HEAD">results</A>.<BR>

<I>Amino acid:</I> Forces the sequence to be treated as protein
and it will not be translated before searching.<BR>

<I>DNA:</I> Forces the sequence to be treated as DNA and translated
before searching. Be sure that the sequence length reported 
is correct.

<P>
Optional search parameters for DNA queries:<BR>
<UL>
<LI><A NAME="ST">Strands to search</A><BR>
Both:<BR>
Forward only:<BR>
Reverse only:<BR>

<P>
<LI><A NAME="GE">Genetic code</A><BR>

</UL>
<LI><H4><A NAME="SQ">Enter query sequence</A></H4>
<P>
</UL>
<BR><A HREF="#top">Return to top</A> <P>

<A NAME="EMAIL"><H3>Submitting multiple searches</H3></A>
If you have a large query sequence or several queries, 
Here is a <A HREF="bulk_search.txt">sample perl script</A>
to do bulk searches.<P>

<P>
The Blocks Email Searcher currently runs on a Sun E250 workstation.
It uses an unsophisticated first-in/first-executed queuing scheme and
can complete an average search of one typical search of 350 amino
acids every two minutes. A DNA search takes longer because the
sequence is translated in all six frames. So a 1000 residue DNA query
will take about the same amount of time as six average amino acid
queries, or about 12 minutes, and a contig of 10,000 residues will take
about two hours. Consequently, if you have more than five searches
to do, we ask that you space them at reasonable intervals depending
on the type and size of your sequences so other people can get searches
processed between yours. The Blocks Searcher is least busy on weekends
and between about 20:00 and 04:00 Pacific coast (USA) time. We
appreciate your considerate use of this service.
<BR><A HREF="#top">Return to top</A> <P>

<A NAME="RESULTS"><H3>Interpreting results of a search</H3></A>
<A HREF="blkprob-aa.html">Example using a protein query</A><BR>
<A HREF="blkprob-dna.html">Example using a DNA query</A><BR>
<P>
<A NAME="HEAD"><B>Heading</B></A><BR>
Query=Description line from query sequence<BR>
Size=Number of amino acids for protein query or base pairs for DNA query.
Be sure this number is correct before interpreting your results.<BR>
Blocks searched=Number of blocks searched with query.<BR>
Alignments done=Number of alignments done between query and blocks searched. 
this number is used to determine the expected value for each hit.<BR>
Cutoff expected value=Maximum combined E-value reported.
This is the number of matches expected to be found merely by chance.<BR>
<P>
<A NAME="SUM"><B>Summary</B></A><BR>
One line is printed per hit, where a hit consists of blocks belonging
to a protein family represented in the database of blocks searched with
combined E-value less than or equal to the cutoff.<BR>

<UL>
<LI>
Strand: 1 for protein or forward strand (frames 1,2,3) of DNA query,
and -1 for reverse strand (frames -1,-2,-3) of DNA query<BR>
<LI>
Blocks x of y: x blocks of a total of y for the family are included in the
hit. Blocks are only included if they appear in the correct order within
the query, and are spaced consistently with those observed in the sequences
from which the blocks were made. <BR>
<LI>
Anchor E-value: E-value of the "anchor" block for this hit, which is
the block with the lowest position p-value.<BR>
<LI>
Combined E-value: Combined E-value for all of the blocks in the hit
(<A HREF="#18">18</A>). For single block hits, this is about the same
as the Anchor E-value.<BR>
</UL>

<P>
<A NAME="ALIGN"><B>Details</B></A><BR>
Detailed information is printed for each hit, including alignments with
the most similar sequence in each block.<BR>

<UL>
<LI>
Frame: 0 for protein query, or 1,2,3,-1,-2,-3 for DNA query.<BR>
Location: Location of block within query sequence in amino acids for 
protein or base pairs from 5' to 3' on the forward strand for DNA query.
<LI>
Block E-value: Expected number of times the score for the alignment of this
block with the query sequence at this position could be seen by chance
given the search space (number of alignments done)
(<A HREF="#17">17</A>, <A HREF="#11">11</A>).<BR>
<LI>
Bias: If the block has biased composition as determined by
checking the correlation among positions using the
<A HREF="/blocks/biassed_blocks.html">Biased Block Checker</A>,
this fact is noted to the right of the position p-value. If this
is the case, a high-scoring alignment may simply be due to biased
composition of the query sequence.
<LI>
Up to n repeats expected:<BR>
If the family is documented as having repeated domains, this fact
is noted and the location of possible repeats are printed.<BR>
<LI>
Other reported alignments:<BR>
Alignments of the query with blocks in the family that were not included
in the hit either because they were lower-scoring or because they were
in the wrong order, or because the distance between them and other
blocks in the hit is inconsistent with distances represented in the
blocks database, are printed for reference. The distance criterion may
exclude blocks from the hit erroneously, especially for translated
searching of sequences containing introns.<BR>
<LI>
Map: Following the list of blocks in the hit, a rough map compares
the blocks seen in family members with those seen in the query.<BR>
AAA represents a block roughly in proportion to its width.<BR>
  : represents the minimum distance between blocks in the database.<BR>
  . represents the maximum distance between blocks in the database.<BR>
< > indicate the sequence has been truncated to fit the page.<BR>
The query map is aligned on the highest scoring block. Multiple block hits 
that are consistent with the highest scoring block are separated by colons.
Block hits that are not consistent are mapped below.<BR>

<LI>
Alignments: The query sequence in aligned with the family member from
each block in the hit which is most like it; a different family member may be
shown for each block.
The distance between detected blocks is listed as (min, max): for the
database entry followed by the distance in the query. Upper case in the query
indicates at least one occurrence of the residue in that column of the block.
<P>
Note: For searches using DNA queries, "Location" refers to the position
in the query in base pairs from 5' to 3' on the forward strand, 
whereas the map and 
alignments show the translated position in amino acid residues.
</UL>
<P>
<BR><A HREF="#top">Return to top</A> <P>

<A NAME="GET"><H3>Getting documentation for blocks</H3></A>
Following up a potentially interesting hit is often aided by examining the
full set of blocks for a group.
Hits are linked to <A HREF="/blocks-bin/getblock.sh">Get blocks</A>.

</PRE>
<P>
<BR><A HREF="#top">Return to top</A> <P>

<A NAME="REFS"><H3>References</H3></A>
<PRE>
If you find the Blocks Searcher useful, please cite:

Henikoff S, Henikoff JG: Protein family classification based on
searching a database of blocks", Genomics 1994, 19:97-107.
[<A HREF="/blocks/papers/BLOCKSEARCH.ps">Postscript</A> <A HREF="/blocks/papers/BLOCKSEARCH.pdf">PDF</A>]

Other references for this work are:

<A NAME="1"></A>1. Henikoff S, Henikoff JG: Automated assembly of protein blocks for database
searching. Nucleic Acids Res. 1991, 19:6565-6572.
[<A HREF="/blocks/papers/BLOCKMAKER.ps">Postscript</A> <A HREF="/blocks/papers/BLOCKMAKER.pdf">PDF</A>]

<A NAME="2"></A>2. Bairoch A: PROSITE: A dictionary of sites and patterns in proteins. Nucleic
Acids Res. 1992, 20:2013-2018.
[<A HREF="http://www.expasy.ch/prosite/">Prosite page</A>]

<A NAME="3"></A>3. Bairoch A, Boeckmann B: The SWISS-PROT protein sequence data bank. Nucleic
Acids Res. 1992, 20:2019-2022.
[<A HREF="http://www.expasy.ch/sprot/">Swiss-Prot page</A>]

<A NAME="4"></A>4. Henikoff JG and HENIKOFF S: Using substitution probabilities
to improve position-specifiic scoring matrices", CABIOS 1996, 12:135-143.
[<A HREF="/blocks/papers/PSEUDO_COUNTS.ps">Postscript</A> <A HREF="/blocks/papers/PSEUDO_COUNTS.pdf">PDF</A>]

<A NAME="5"></A>5. Wallace JC, Henikoff S: PATMAT: a searching and extraction program for
sequence, pattern, and block queries and databases. CABIOS 1992, 8:249-254.

<A NAME="6"></A>6. Henikoff S: Detection of Caenorhabditis transposon homologs in diverse
organisms. New Biol. 1992, 4:382-388.

<A NAME="7"></A>7. Oliver SG et al.: The complete DNA sequence of yeast chromosome III. Nature
1992, 357:38-46.

<A NAME="8"></A>8. Bork P, Ouzounis C, Sander C, Scharf M, Schneider R, Sonnhammer E: What's
in a genome? Nature 1992, 358:287.

<A NAME="9"></A>9. Henikoff S, Henikoff JG: A protein family classifcation method for
analysis of large DNA sequences, Proc. 27th HICSS 1994, p. 265-274.
[<A HREF="/blocks/papers/BLOCKSEARCH_DNA.ps">Postscript</A> <A HREF="/blocks/papers/BLOCKSEARCH_DNA.pdf">PDF</A>]

<A NAME="10"></A>10. Henikoff S, Henikoff JG: Position-based sequence weights, J. Mol. Biol.
1994, 243:574-578.
[<A HREF="/blocks/papers/SEQUENCE_WEIGHTS.ps">Postscript</A> <A HREF="/blocks/papers/SEQUENCE_WEIGHTS.pdf">PDF</A>]

<A NAME="11"></A>11. Tatusov RL, Altschul SF, Koonin EV: Detection of conserved segments in
proteins: Iterative scanning of sequence databases with alignment blocks,
PNAS 1994, 91:12091-12095.

<A NAME="12"></A>12. Henikoff JG, Henikoff S: Using substitution probabilities to improve
position-specific scoring matrices, CABIOS 1996, 12:135-143.

<A NAME="13"></A>13. Henikoff S, Henikoff JG: Embedding strategies for effective use of
multiple sequence alignment information, 1996, sumbitted for publication.

<A NAME="14"></A>14. Thompson, JD, Higgins, DG and Gibson, TJ: CLUSTAL W: Improving the
sensitivity of progressive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight matrix choice,
NAR 1994, 22:4673-4680.
[<A HREF="ftp://ftp.ebi.ac.uk/pub/software">FTP CLUSTAL W</A>]

<A NAME="15"></A>15. Saitou, N and Nei, M: The neighbor-joining method: A new method for
reconstructing phylogenetic trees, Mol. Biol. Evol. 1987, 4:406-425.

<A NAME="16"></A>16. Felsenstein, J: , Cladistics 1989, 5:164-166.
[<A HREF="http://evolution.genetics.washington.edu/phylip.html">Phylip page</A>]

<A NAME="17"></A>17. McLachlan, A.: ,J. Mol. Biol. 1983, 169:15-30.

<A NAME="18"></A>18. Bailey, T.L. and Gribskov, M.: Combining evidence
using p-values: application to sequence homology searchers, Bioinformatics
1998, 14:48-54.
[<A HREF="http://www.sdsc.edu/~tbailey/papers/qfast.ps">Postscript</A>]

</PRE>
<BR><A HREF="#top">Return to top</A> <P>