File: LAMA_help.html

package info (click to toggle)
blimps 3.9%2Bds-1
links: PTS, VCS
area: non-free
in suites: bookworm, bullseye, buster
size: 6,812 kB
sloc: ansic: 43,271; csh: 553; perl: 116; makefile: 99; cs: 27; cobol: 23
file content (856 lines) | stat: -rw-r--r-- 49,461 bytes
parent folder | download | duplicates (3)
<TITLE>LAMA help</TITLE>

<H1><A HREF="/blocks/help/LAMA/LAMA_YWK.html">
<IMG ALIGN=MIDDLE SRC="/blocks/help/LAMA/llama.2.gif" HEIGHT=145 WIDTH=95></A>
<A HREF="/blocks-bin/LAMA_search.sh">LAMA</A> help</H1>

<UL>
<LI><A HREF="#LAMA">What does LAMA do?</A>
<LI><A HREF="#LAMA_FOR_ME">What can LAMA do for <B>me</B>?</A>
<LI><A HREF="#LAMA_HOW">How does LAMA align blocks?</A>
<LI><A HREF="#LAMA_INPUT">Input for LAMA</A>
	<UL>
	<LI><A HREF="#LAMA_INPUT_CONTENT">Content of input</A>
	<LI><A HREF="#LAMA_INPUT_FORMAT">Format of input</A>
        <LI><A HREF="#LAMA_OUTPUT_OPTIONS">Output options</A>
	</UL>
<LI><A HREF="#LAMA_OUTPUT">Output from LAMA</A>
	<UL>
	<LI><A HREF="#LAMA_OUTPUT_FORMAT">Format and content of output</A>
	<LI><A HREF="#EVALUATING_SCORES">Evaluating LAMA alignment scores</A>
	</UL>
<LI><A HREF="#EXAMPLES">Examples</A>
	<UL>
	<LI><A HREF="#FLAVOPROTEINS">Flavoproteins FAD binding and catalytic sites</A>
	<LI><A HREF="#ST_CD59">Snake toxins and the CD59 extracellular domain</A>
        <LI><A HREF="#IS30">IS30 transposases DNA-binding domain</A>
        <LI><A HREF="#HTH">Hth motifs in the Blocks Database</A>
	</UL>
<LI><A HREF="#SUPPLMNT">Supplements</A>
	<UL>
	<LI><A HREF="LAMA/LAMA.Z_stat.html">Mean and standard deviation for scores expected by chance</A>
	<LI><A HREF="LAMA/LAMA.ZVp.html">Percentile of Z scores expected by chance</A>
	</UL>
<LI><A HREF="#LAMA_CREDITS">Credits and citation</A>
</UL>

<A NAME="LAMA"><H2>What does LAMA do?</H2></A>
LAMA (Local Alignment of Multiple Alignments) is a program for comparing 
protein multiple sequence alignments with each other. The program can search 
databases of such multiple alignments. The search is for sequence 
similarities between conserved regions of protein families.
The method is sensitive, detecting weak sequence 
relationships between protein families. Sequence similarities 
beyond the range of conventional sequence database searches can be 
detected by the method.<P>

<A NAME="LAMA_FOR_ME"><H2>What can LAMA do for <B>me</B>?</H2></A>
LAMA can identify protein families similar to your protein(s) of interest
and protein motifs similar to conserved regions in your protein(s). The
information known about these similar families and motifs can help you
identify the function and structure of your protein and locate critical
conserved regions in your protein(s). This can direct you in
designing experiments to test your hypotheses.<P>

LAMA compares <B>multiple</B> sequence alignments of proteins. 
If you have only a <B>single</B> protein sequence you first need to 
find other members of its family. The protein sequences also need to 
be multiply aligned. The <A HREF="#LAMA_INPUT_CONTENT">Content of input</A>
section explains how to find related sequences and align them.<P>

<A NAME="LAMA_HOW"><H2>How does LAMA align blocks?</H2></A>
The multiple alignments are first transformed into position specific 
scoring matrices (<A HREF="PSSM_def.html">PSSMs</A>). Each column in 
the PSSM corresponds to a position in the 
alignment and has the amino acid distribution of that position. The 
transformation into the PSSM is done with position-based sequence weights 
(<A HREF="/blocks/papers/#SEQUENCE_WEIGHTS.ps">Henikoff & Henikoff, 1994a</A>) 
and odd ratios between the amino acid frequencies 
observed in the multiple alignments and the frequencies expected 
from protein databases 
(<A HREF="/blocks/papers/#BLOCKMAKER.ps">Henikoff & Henikoff, 1995</A>). 
The transformation corrects possible overrepresentation of some 
sequences by sequence weighting and considers the background 
frequencies of the amino acids. 
The method was tested and calibrated with ungapped local multiple alignments 
(blocks) from the 
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
.<P> 

The matrices are treated as sequences of columns, enabling 
their alignment with one another. To use algorithms developed for
aligning single sequences we need a measure for comparing pairs of 
matrix columns. This corresponds to the substitution matrices 
(PAM, BLOSUM etc.) used in single-sequence alignments. The
measure used in our method to score the similarity between pairs of 
matrix columns is the Pearson correlation coefficient <A NAME="Pr">(r)</A>: 
<IMG SRC = "LAMA/LAMA_r.gif" HEIGHT=69 WIDTH=181>
where A(i) and B(i) are the values of amino acid i in columns A and B, 
respectively, and /A and /B
are the means of the values in columns A and B.
The correlation score ranges from 1 for columns with identical 
amino acid distributions to -1 for columns with opposite 
distributions (in each column only 10 amino acids occur and 
those 10 amino acids are different in the two compared 
columns).<P>

The score of a block-to-block alignment is the sum of the scores from 
comparing the corresponding columns in the two block matrices: <BR>
<IMG SRC = "LAMA/LAMA_algorithm.gif" HEIGHT=477 WIDTH=438>
<PRE>
Local alignment of blocks.
Positions 2 to 7 from block A aligned with positions 4 to 9 from 
block B. A column comparison score, <STRONG>s(Xn*Ym)</STRONG>, is calculated for 
each pair of positions (A2*B4 to A7*B9). The score of the alignment 
of the two segments, <STRONG><I>S</I></STRONG>, is the sum of the column comparison scores.
</PRE>

 The alignment is done using the Smith-Waterman algorithm for optimal 
local alignments. No gaps are allowed since the aligned objects are 
short conserved sequence regions. All alignments above the cutoff score 
are reported for each pair of compared blocks. There may be cases where parts
of one long block are similar to several blocks:
<STRONG><PRE>
	AAAAAAAAAAAAAAAAAAA
	 BBB       CCCCCC
</PRE></STRONG>


<A NAME="LAMA_INPUT"><H2>Input for LAMA</H2></A>

<A NAME="LAMA_INPUT_CONTENT"><H3>Content of input</H3></A>
 LAMA can compare any multiple alignment if it is in the correct format.
<I>However</I>, the 
<A HREF="#Pr">column comparison measure</A> and 
the <A HREF="#SCORE_SIGNIF">significance estimation</A> of the scores
are appropriate for protein sequence blocks - ungapped conserved multiple 
alignments. The use of other types of multiple alignments, such as global 
multiple alignments that include many gaps, may give misleading results.
For example, the resulting alignments may not be optimal or their 
significance different from what the output suggests.
<P>

 If you only have a single protein sequence or want to find more protein
sequences related to yours you can search the sequence databases.
One way to do this on the WWW is using the 
<A HREF="http://www.ncbi.nlm.nih.gov/cgi-bin/BLAST/nph-blast?Jform=0">
BLAST program</A> to search the
<A HREF="http://www.ncbi.nlm.nih.gov/index.html">NCBI</A> sequence databases. 
Links to other search methods can be found at 
the Baylor College of Medicine Human Genome Center
<A HREF="http://dot.imgen.bcm.tmc.edu:9331/seq-search/protein-search.html">
Search Launcher site</A>.<P>

 The <A HREF = "/blocks/make_blocks.html">BlockMaker</A> WWW site can be used
for finding blocks in your group of related protein sequences. There are 
various other methods for making protein multiple sequence alignments. 
Among these are the 
<A HREF="http://meme.sdsc.edu/meme/website/meme-intro.html">
MEME system</A>,
<A HREF = "http://www3.ncbi.nlm.nih.gov:80/htbin-post/Entrez/query?uid=94023958&form=6&db=m&Dopt=r">
Gibbs sampling programs</A>, 
the <A HREF="http://www3.ncbi.nlm.nih.gov:80/htbin-post/Entrez/query?uid=91172743&form=6&db=m&Dopt=r">
MACAW interactive program</A>, and
the <A HREF = "http://www3.ncbi.nlm.nih.gov:80/htbin-post/Entrez/query?uid=95075648&form=6&db=m&Dopt=r">
CLUSTAL-W progressive multiple alignment program</A>. 
Several of these methods are available through the 
<A HREF="http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html">
multiple sequence alignment page</A> 
at the Baylor College of Medicine Human Genome Center.
<P>

Multiple alignments submitted to the program should be of conserved, 
relatively ungapped, protein sequence regions. A few gaps in the 
alignment are acceptable. The more sequences are in the alignment the 
better. In general, avoid alignments with less than 4 sequences.
<P>


<A NAME="LAMA_INPUT_FORMAT"><H3>Format of input</H3></A>
LAMA only accepts input in the 
<A HREF="/blocks/blocks_format.html">Block format</A>. Other multiple
alignments can be <A HREF="/blocks/block_formatter.html">
reformatted to the Block format</A>. If you are not sure of your 
multiple alignment or just have a group of <STRONG>related</STRONG>
sequences you can use the 
<A HREF = "/blocks/make_blocks.html">BlockMaker program</A> for 
finding blocks in the sequences. Note that to avoid biassed sequence 
representation blocks include sequence weights.<P>

<A NAME="LAMA_OUTPUT_OPTIONS"><H3>Output options</H3></A>
<UL>
<A NAME="OUTPUT"><LI>Output level</A><BR> 
The <A HREF="#LAMA_OUTPUT">standard output</A> displays pairs of 
blocks with alignment scores above a <A HREF = "#SCORE_CUTOFF">
Z score cutoff</A>. When both target and query blocks are given 
by the user there are options for also seeing the 
<A HREF="#Pr">column scores</A> composing the alignment score 
for <I>every</I> reported alignment and the <A HREF="#LAMA_HOW">PSSMs</A>
of <I>all</I> the compared blocks.<BR>

<A NAME="CUTOFF"><LI>Score cutoff</A><BR>
The default cutoff value is 5.6 Z scores. When both target and query 
blocks are given by the user different cutoffs can be specified. 
Giving a lower value will allow reporting of weaker alignments. 
Alignments with low values can occur by chance between unrelated 
blocks. Raising the cutoff score may exclude potentially genuine 
alignments. The <A HREF = "#EXPECTED">expected number</A> of 
occurrences should be used to <A HREF = "#EVALUATING_SCORES">
evaluate the alignment scores</A>.
</UL>
Some of the <A HREF="#EXAMPLES">examples</A> included in this document
illustrate the use of the options.<P>


<H2><A NAME="LAMA_OUTPUT">Output from LAMA</A></H2>

<H3><A NAME="LAMA_OUTPUT_FORMAT">Content and format of output:</A></H3>
<pre><HR>
LAMA version 1.00 October 96.

Minimal length of reported alignments   4
Score cutoff is 5.6 Z score units (in the top 7.7e-05 percentile of chance scores)


                                            alignment     Z-score  expected number for 
block 1   from:to       block 2   from:to   length                 searching 5000 blocks
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>   20 :  46 and <A HREF="/blocks-bin/getblock.sh?BL00042#BL00042B">BL00042B</A>    3 :  29 (27) score  39 ( 7.2  1.3e-02) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+2+/howard/blocks/bin/blocks.dat+BL00042B+3+27"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+20+/howard/blocks/bin/blocks.dat+BL00042B+3+27"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>    5 :  39 and <A HREF="/blocks-bin/getblock.sh?BL00324#BL00324C">BL00324C</A>    3 :  37 (35) score  27 ( 6.1  1.5e-01) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+5+/howard/blocks/bin/blocks.dat+BL00324C+3+35"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+5+/howard/blocks/bin/blocks.dat+BL00324C+3+35"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>   12 :  47 and <A HREF="/blocks-bin/getblock.sh?BL00622#BL00622">BL00622</A>     8 :  43 (36) score  33 ( 8.2  0.0e+00) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+12+/howard/blocks/bin/blocks.dat+BL00622+8+36"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+12+/howard/blocks/bin/blocks.dat+BL00622+8+36"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>   10 :  46 and <A HREF="/blocks-bin/getblock.sh?BL00894#BL00894A">BL00894A</A>    1 :  37 (37) score  26 ( 5.7  3.2e-01) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+10+/howard/blocks/bin/blocks.dat+BL00894A+1+37"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+10+/howard/blocks/bin/blocks.dat+BL00894A+1+37"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>    4 :  42 and <A HREF="/blocks-bin/getblock.sh?BL01043#BL01043A">BL01043A</A>    2 :  40 (39) score  29 ( 8.1  0.0e+00) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+4+/howard/blocks/bin/blocks.dat+BL01043A+2+39"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+4+/howard/blocks/bin/blocks.dat+BL01043A+2+39"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
</pre>

The program version and execution parameters head the search output. 
Only alignments longer than the <STRONG>minimal length</STRONG> will 
be reported. The significance of very short alignments (fewer than 4 
positions) 
cannot be reliably estimated. Alignments with scores equal or above 
the <A NAME="SCORE_CUTOFF"><STRONG>score cutoff</STRONG></A> will be reported. 
The score cutoff is specified as a <STRONG>Z score</STRONG>. 
<A NAME="Z_SCORE">Z score</A> is 
the number of standard deviations between the score and the mean score. 
<A NAME="SHUFFLED_SCORES">T</A>he mean score and the standard deviations 
were calculated for the random scores from the alignment of a large number 
of shuffled unbiassed blocks (7 million block pairs; 
see <A HREF="#SUPPLMNT">first supplement</A>).
The <STRONG>Z score</STRONG> is related to the percentile of the score 
in the shuffled blocks scores. This dependence is not linear but sigmoidal
(see <A HREF="#SUPPLMNT">second supplement</A>).<BR>
For each reported alignment the program shows the names of the two 
<STRONG>aligned blocks</STRONG>, 
their <STRONG>position</STRONG> relative to one another,
the <STRONG>alignment length</STRONG>,
the <STRONG>score</STRONG>, 
and the <STRONG>expected number</STRONG>
of such scores when searching a given number of blocks. 
<A NAME="EXPECTED">T</A>he expected number is for chance (random) 
alignments of unbiassed blocks. 
It is calculated from the score percentiles between the shuffled 
unbiassed blocks.
In this example the expected number is for searching 5000 blocks. 
Blocks from the 
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
and from the
<A HREF="/blocks/about_blocks.html#prints">Prints database</A> 
will be linked to the database entries. The "alignment" link 
(<IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment">) 
shows the alignment of the two blocks. This can also be seen by 
following the "logos" 
(<IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos">) 
link that shows the <A HREF="/blocks/about_logos.html">sequence logos</A> 
of aligned pairs of blocks.
<A HREF="/blocks/about_logos.html">Sequence logos</A> are graphical representations 
of the blocks. 
For example, 
<A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+12+/howard/blocks/bin/blocks.dat+BL00622+8+36">here</A> 
(PostScript viewer required) the logo of block 
<A HREF="/blocks-bin/getblock.sh?BL00622#BL00622">BL00622</A>
is shifted 4 positions relative to the logo of block 
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>
so that their similar segments (8-43 and 12-47) are aligned. 
Indeed, these segments both contain helix-turn-helix DNA binding motifs.
<P>
When both query and target blocks are provided by the user the 
<A HREF="#OUTPUT">output</A> can also contain the column scores 
of each reported alignment and the <A HREF="#LAMA_HOW">PSSMs</A> 
of every compared block.
<P>
Pay attention to any error or warning messages. Most will probably 
have to do with the <A HREF="#LAMA_INPUT_FORMAT">format of the input</A>.
<P>

<A NAME="EVALUATING_SCORES"><H3>Evaluating LAMA alignment scores</H3></A>
The alignment score is the average of the 
<A HREF="#Pr">column scores</A> in the alignment multiplied by 100. 
Since the column scores have a range of -1 to 1 the alignment score 
will range from -100 to 100. An alignment score of 46 means 
that on average the aligned positions had a correlation coefficient
of 0.46. <I>The significance of the alignment score depends on the 
length of the compared blocks.</I> Alignments between longer blocks 
will tend to be longer and have higher scores. 
The <A HREF="#Z_SCORE">Z score</A> and 
<A HREF="#EXPECTED">expected number</A> let us estimate the 
<A NAME="SCORE_SIGNIF">significance of the scores</A> 
and to compare alignments of different lengths. 
The higher the Z score the less likely the alignment is due
to chance. <I>How unlikely depends on the number of blocks searched.
The more blocks searched the greater the probability to find chance 
high scores.</I> For example, the output of the calibration with the 
<A HREF="#SHUFFLED_SCORES">shuffled blocks</A> contained 7 million 
scores but no alignments with Z scores greater than 8.3 . 
Hence an alignment with a score equal or higher than that Z score 
is unlikely by chance in a comparable or smaller number 
of alignments. The expected number shows this directly. 
The expected number is shown for searching 5000 blocks since version 9.1 of the
<A HREF="/blocks/help/blocks_release.html">Blocks Database</A>
contains 3300 blocks. For example, searching this release of 
the Blocks Database and finding an alignment expected to appear 
1.8e-01 times (0.18) suggests that this alignment is not due to chance.
Alignments with expected occurrences of 7.5e-03 or even 0 are almost
certainly genuine (or due to <A HREF="#BIASED_BLOCKS">biassed blocks</A>,
see <A HREF="#TABLE1">below</A>).<BR>

 A relation between two families by a single pair of blocks with a
high Z score is termed a <STRONG>single hit</STRONG>.
However, protein families often have a number of blocks. 
A <STRONG>multiple hit</STRONG> is when two or more block pairs 
from the same two families are similar:
<STRONG><PRE>
                                               multiple hit
     Family 1, blocks 1A, 1B, 1C, 1D.         1A=2B + 1D=2C
     Family 2, blocks 2A, 2B, 2C.
</PRE></STRONG>
We expect the order of the blocks in the hit to be the same in both 
families (in this example 1A -> 1D and 2B -> 2C).<BR>


 Individual block pairs with Z scores likely by chance 
by themselves can still indicate a genuine relation if they 
are in a multiple hit. While the shuffled blocks scores contained 
no single hit with Z score above 8.3, there were no multiple hit 
with Z scores less than 5.6 . Hence genuine relationships can also 
be indicated by <I>several</I> alignments whose Z scores are 
<I>individually</I> expected to occur by chance.<P>

When comparing blocks against a database the Z score cutoff is set as 5.6, 
corresponding to expected occurrence rate of 0.385 per searching 5000 blocks.
When both query and target blocks are provided other cutoffs can be 
<A HREF="#CUTOFF">chosen</A>.
<P>

False positive (high score but no relation) and false negative 
(low score but genuine relation) hits are still possible and
biological knowledge and common sense should be used. 
<A NAME="BIASED_BLOCKS">Compositionally</A> 
biassed blocks (consisting of sequence segments rich in a few amino 
acids or short repeats) are a common cause for false positive hits. 
You can check if a block is biassed <A HREF="/blocks/biassed_blocks.html">here</A>.
False negative hits can be caused by misalignment in the blocks .
<P>

<A NAME="TABLE1">E</A>ach entry in the 
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
version 8.6 (3174 blocks from 858 protein families)
was searched against the other entries in the database. 
All block pairs with Z scores larger than 5.6 were saved. 
Protein families related by more then one saved score were
considered as multiple hits and alignments with Z scores 
above 8.3 as single hits. This resulted in 141 pairs of families. 
Eighty percent of these were 
identified as genuine relationships (true positives) according to the 
family descriptions, by sharing common sequences, or by detailed 
examination. Compositional bias was responsible for another eight percent 
of the high scores. The remaining twelve percent of the high scores could 
not be classified as either genuine or false based on available evidence.<P>


<TABLE BORDER WIDTH=532>
<CAPTION>Distribution of top scoring family pairs</CAPTION>
<TR VALIGN=top><TD>Relation type</TD><TD>Genuine(1)</TD><TD>Biassed<BR>Composition</TD><TD>Unknown</TD><TD><B>Total</TD></TR>
<TR VALIGN=top><TD><PRE>Multiple block hits- independent(2)</TD><TD><PRE>  24 </TD><TD><PRE>  -</TD><TD><PRE>  1 </TD><TD><PRE><B>  25 </TD></TR>
<TR VALIGN=top><TD><PRE>                   - repeats(3)</TD><TD><PRE>  11 </TD><TD><PRE>  6 </TD><TD><PRE>  9 </TD><TD><PRE><B>  26 </TD></TR>
<TR VALIGN=top><TD><PRE>                   - inner repeats(4)</TD><TD><PRE>  15 </TD><TD><PRE>  4 </TD><TD><PRE>  2 </TD><TD><PRE><B>  21 </TD></TR>
<TR VALIGN=top><TD><PRE>Single block hits</TD><TD><PRE>  63</TD><TD><PRE>  1</TD><TD><PRE>  5</TD><TD><PRE><B>  69</TD></TR>
<TR VALIGN=top><TD><B>Total</TD><TD><PRE><B> 113</TD><TD><PRE><B> 11</TD><TD><PRE><B> 17</TD><TD><PRE><B> 141</TD></TR>
<TR VALIGN=top><TD><B>Fraction</TD><TD><PRE><B>  80%</TD><TD><PRE><B>  8%</TD><TD><PRE><B> 12%</TD><TD></TD></TR>
</TABLE>
<BR>
<PRE>
(1) Genuine relations were identified by the families prosite descriptions,
    detailed analysis of the literature or by sharing common sequences 
    (22 of the single and independent-multiple hits).
(2) An independent multiple hit is two different protein families 
    related by two or more different block pairs.
(3) A repeat multiple hit is two different protein families where a 
    block from one family is similar with two or more blocks from the 
    other family.
(4) An inner-repeat multiple hit is a case where the similarities are 
    between blocks from the same family.
</PRE>

<A NAME="EXAMPLES"><H2>Examples</H2></A>
<UL>
<LI><A NAME="FLAVOPROTEINS"><H3>Flavoproteins FAD binding and catalytic sites</H3></A><P>
     A comparison of all the Blocks Databases v8.6 entries with each other 
found the following hit between FAD flavoprotein subunits from two 
oxidoreductase enzyme complexes, BL00504 - succinate dehydrogenases 
(Sdh) and fumarate reductases (Frd) and BL00677 - D-amino oxidases (DAO):
<PRE>
                                            alignment     Z-score  expected number for
block 1   from:to       block 2   from:to   length                 searching 5000 blocks
BL00504A    2 :  20 and BL00677A    2 :  20 (19) score  51 (10.0  0.0e+00) [<A HREF="/blocks-bin/LAMA_logos?/blocks/help/LAMA/BL00504.dat+BL00504A+2+/blocks/help/LAMA/BL00677.dat+BL00677A+2+19">logos</A> <A HREF="/blocks/about_logos.html">?</A>]
</PRE>
A comparison with a lower cutoff found another hit supporting the first one:
<PRE>
BL00504D    3 :  35 and BL00677D   17 :  49 (33) score  26 ( 5.5  5.1e-01) [<A HREF="/blocks-bin/LAMA_logos?/blocks/help/LAMA/BL00504.dat+BL00504D+3+/blocks/help/LAMA/BL00677.dat+BL00677D+17+33">logos</A> <A HREF="/blocks/about_logos.html">?</A>]
</PRE>

Sequence annotations and a literature search revealed that block BL00504A 
is the FAD-binding site and BL00504D is the active site (Birch Machin 
<I>et al</I>., 1992) of the Sdh/Frd flavoproteins. Block BL00677A is 
the FAD-binding site of the DAO proteins. The FAD AMP-binding sites in 
both families are beta-alpha-beta ADP binding folds and were already 
noted as such (Birch-Machin et al., 1992; Schulz <I>et al</I>., 1982). 
This explains the first hit. 
<P>
     The DAO BL00677D block has a conserved histidine important for 
enzymatic activity of pig DAO (Miyano <I>et al</I>., 1991). This histidine 
is aligned with a conserved and essential histidine in the Sdh/Frd 
flavoproteins catalytic site (Birch-Machin et al., 1992; Schroder 
<I>et al</I>., 1991). Other positions in these aligned regions are also 
similar (column scores 0.31 to 0.98). The dissimilar positions have 
column scores close to zero (0.04 to -0.14). This finding suggests 
that the active site of DAO flavoproteins is in the BL00677D region with 
the conserved histidine as the crucial residue.
<P>
      BLAST and FASTA searches of the SwissProt protein database could 
not identify this similarity. No sequence from one family identified 
any sequence from the other family. Optimal local alignments of all the 
sequence pairs from the two families had scores expected by chance. 
Searching the Blocks Database with the sequences from the two families 
identified the relation between the families with 6 Sdh/Frd flavoproteins 
sequences (multiple hits with 98.1 to 76.2 percentiles of scores with 
shuffled sequence queries and P values of 8.4*10-3 to 1.1*10-1) but not 
with the other two sequences from that family or any of the sequences 
from the DAO family (single hits with less then 60.0 score percentiles).
<P>
<IMG SRC="LAMA/LAMA_flavoproteins.gif" HEIGHT=555 WIDTH=503>
<PRE>
Suggested catalytic site of DAO flavoproteins.
A, positions 17-49 of DAO flavoproteins (block BL00677D) aligned with
the catalytic region of Sdh/Frd flavoproteins (positions 3-35 of block
BL00504D). The histidines important for the enzymes catalytic activity
are outlined (the histidine in sequence DHSA_BACSU is misaligned due to
a two aa insertion). The start and end coordinates flank the sequences.
B, the column scores of the alignment.
</PRE>
<P>

Birch-Machin, M. A., Farnsworth, L., Ackrell, B. A., Cochran, B., Jackson, S.,
 Bindoff, L. A., Aitken, A., Diamond, A. G. & Turnbull, D. M. (1992). 
The sequence of the flavoprotein subunit of bovine heart succinate 
dehydrogenase. <I>J. Biol. Chem.</I> <B>267</B>, 11553-11558.<P>
Miyano, M., Fukui, K., Watanabe, F., Takahashi, S., Tada, M., Kanashiro, M. & 
Miyake, Y. (1991). Studies on Phe-228 and Leu-307 recombinant mutants of 
porcine kidney D-amino acid oxidase: expression, purification, and 
characterization. <I>J. Biochemistry</I> <B>109</B>, 171-177.<P>
Schroder, I., Gunsalus, R. P., Ackrell, B. A., Cochran, B. & Cecchini, G. 
(1991). Identification of active site residues of Escherichia coli fumarate 
reductase by site-directed mutagenesis. <I>J. Biol. Chem.</I> <B>266</B>, 
13572-13579.<P>
Schulz, G. E., Schirmer, R. H. & Pai, E. F. (1982). FAD-binding site of 
glutathione reductase. <I>J. Mol. Biol.</I> <B>160</B>, 287-308.
<HR>
<LI><A NAME="ST_CD59"><H3>Snake toxins and the CD59 extracellular domain</H3></A><P>

Conserved regions from snake toxins and the CD59 extracellular domain were found
similar to each other. The alignment score is not very striking but the two families 
seem be quite dissimilar. What is the connection between snake toxins, small 
extracellular proteins that bind to nerve receptors, and the CD59 domain, a domain 
that is found in one or more copies on GPI-linked cell surface glycoproteins ? 
a closer look at the alignment was taken by requesting to see the column scores.
These scores are shown above the score line for each of the 12 alignment positions 
(8,3 to 19,14):
<PRE>
Column scores for optimal alignment of <A HREF="/blocks-bin/getblock.sh?BL00272#BL00272B">BL00272B</A> and <A HREF="/blocks-bin/getblock.sh?BL00983#BL00983B">BL00983B</A> -
  8, 3   9, 4  10, 5  11, 6  12, 7  13, 8  14, 9  15,10  16,11  17,12  18,13 19,14
 0.262  0.169  0.138  0.286  0.995  1.000  0.368  0.224  0.986 -0.067  1.000 1.000
<A HREF="/blocks-bin/getblock.sh?BL00272#BL00272B">BL00272B</A>    8 :  19 and <A HREF="/blocks-bin/getblock.sh?BL00983#BL00983B">BL00983B</A>    3 :  14 (12) score  53 ( 6.5  6.0e-02) [<A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL00272B+8+/howard/blocks/bin/blocks.dat+BL00983B+3+12">logos</A> <A HREF="/blocks/about_logos.html">?</A>]
</PRE>

 Five of the positions [(12,7), (13.8), (16,11), (18,13) and (19,14)]
have very high column scores (0.986-1.000) 
indicating identical and almost identical amino acid distribution in these
column pairs. The other positions contribute less to the alignment score
and position (12,17) has a slightly negative score, actually detracting from
the alignment.<P>
Upon requesting to see the PSSMs of the blocks (below) or their aligned 
logos (link to 'logos' above) you will note 
that 3 of the alignment positions contributing to the score are highly 
conserved cysteine residues. This raises the possibility of identical 
patterns of disulphide bonds in both regions. We might give this 
alignment more attention since disulphide bonds are known to be well
conserved even between distantly related sequences. 
 More information can be found by following the block links to the
<A HREF="/blocks">Blocks Database</A> 
entries. Each family is accompanied by its 
<A HREF="http://www.ebi.ac.uk/interpro/">InterPro</A> 
annotation and the multiple alignment each block can be 
viewed as a graphical 
<A HREF="/blocks/help/about_logos.html">sequence logo</A>. 
The <A HREF="/blocks/help/LAMA/LAMA_cardiotoxin+CD59.JPEG">structures of both proteins</A> 
are known and confirm their relation. (The 
<A HREF="http://www.expasy.ch/sw3d/">SWISS-3DIMAGE</A> 
was the source for these images of the structures.)
<P>
<PRE>
PSSM of <A HREF="/blocks-bin/getblock.sh?BL00272#BL00272B">BL00272B</A>

  |                                       1   1   1   1   1   1   1   1   1   1
  |   1   2   3   4   5   6   7   8   9   0   1   2   3   4   5   6   7   8   9
--+----------------------------------------------------------------------------
A |   0   0   0  13   0   3   0   0   1   0   2   0   0   0   1   0   2   0   0
C |  87  12  12  11   0   0   0   0  21   0   6 100 100   0   0   0   0  99   0
D |   0   0   3   2  11   9   3   2   6   0   0   0   0   3   0  82  10   0   2
E |   0   5   2   8   3   5   2   9   4   6   9   0   0   7   0   5   5   0   0
F |   2   3   9   0   0   2   6   4   2   2   0   0   0   0   0   0   0   0   0
G |   1   1   1   1   3   0  24   8   7   0   0   0   0   2   3   0   1   0   2
H |   0   0   2   0   0   4   2   0   4   0  13   0   0   6   0   0   0   0   0
I |   0   0   4   0   2   1   0  17   7  30   3   0   0   0   6   0   0   0   0
K |   0   3  22   4  30   3   8   5  17   0  24   0   0  16   1   0  36   0   0
L |   0   0   1   0   1   3   8   3   0  14   5   0   0   0   0   0   2   0   0
M |   0   0   0   0  11   2   9   0   3   0   3   0   0   0   0   0   0   0   0
N |   0   0   0   5   2   7   2   2   2   0   2   0   0  16   0  13  22   1  96
P |   6  65   9   2   3  23   6   8   2   5   0   0   0   0   0   0   0   0   0
Q |   0   2   6   0   0   1   0   1   6   0   8   0   0   3   0   0   0   0   0
R |   0   2   4  15   8   2   6   6   2   3   8   0   0  10   0   0  19   0   0
S |   1   4   4   6  13   3   0   4   6   2   1   0   0  19  16   0   0   0   0
T |   3   4  14   5   5   5   1   5   0   4   3   0   0  18  72   0   0   0   0
V |   0   0   3  28   5   1   0  19   1  22   6   0   0   0   0   0   1   0   0
W |   0   0   0   0   0  22   0   0   0   0   0   0   0   0   0   0   0   0   0
X |   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Y |   0   0   5   0   3   2  23   7   9  11   7   0   0   0   0   0   2   0   0
- |   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0


PSSM of <A HREF="/blocks-bin/getblock.sh?BL00983#BL00983B">BL00983B</A>

  |                                       1   1   1   1   1
  |   1   2   3   4   5   6   7   8   9   0   1   2   3   4
--+--------------------------------------------------------
A |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
C |   0   0   0   0   0   0  91 100   0   0   0   0 100   0
D |   0   0   0   0   0   0   0   0   0   0  76   0   0   0
E |   0  17  29   0  20   0   0   0   0  42   0   0   0   0
F |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
G |  10   0   0   0   0   0   0   0  10   0   0   0   0   0
H |   0   0   0   0   0  39   0   0   0   0   0   0   0   0
I |  25   0   0   0   0   0   0   0   0   0   0   0   0   0
K |   0   0   0  30   0   0   0   0  28  23   0   0   0   0
L |   0   0  48   0   0   0   0   0   0   9   0 100   0   0
M |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
N |  25  23   0  13   0   0   0   0   0   0  24   0   0 100
P |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Q |   0  15   0   0   0  13   0   0  29   0   0   0   0   0
R |   0  12   0  12   0   0   0   0  24   0   0   0   0   0
S |   0  20   0  11   0  18   0   0   9  12   0   0   0   0
T |  23  13   0  23  35  10   0   0   0  14   0   0   0   0
V |  16   0  22  11   0   8   9   0   0   0   0   0   0   0
W |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
X |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Y |   0   0   0   0  45  13   0   0   0   0   0   0   0   0
- |   0   0   0   0   0   0   0   0   0   0   0   0   0   0

</PRE>
("X" specifies unknown amino acids.)
<HR>
<LI><A NAME="IS30"><H3>IS30 transposases DNA-binding domain</H3></A><P>

        Excision and insertion of bacterial insertion sequence elements (IS)
require the activity of a transposase protein sometimes encoded by the
ISs. The IS30 transposase family (Dong et al., 1992) is represented by
five blocks in BLOCKS 8.6. A region of 21 positions from the first block
had high scores (Z scores 6.7 to 8.8) only to helix-turn-helix
DNA-binding motifs (hth) from four protein families (see the
<A HREF="#FIGURE_HTH_GRAPH">figure</A> in the
<A HREF="#HTH">next example</A>).
Hth DNA binding motifs occur in many proteins that bind specific DNA
sequences (Pabo & Sauer, 1992).<P>
        BLAST searches of the SwissProt protein database with the IS30
sequences did not identify any protein with known hth region. Searching
the Blocks Database with the IS30 sequences gave high scores with hth
blocks for two of the sequences (98.1 and 93.1 percentiles of scores
with shuffled sequence queries (Henikoff & Henikoff, 1994)). The other
two sequences had low scores with hth blocks (30.8 and 18.1 score
percentiles) and higher scores with non-hth blocks. However, each of the
transposases putative DNA binding regions was detected by the method of
Dodd and Egan (Dodd & Egan, 1990) as an almost certain hth domain.<P>
        Classification of the first IS30 block as a hth motif is supported by
the finding that the N-terminal region of an IS30 transposase,
containing the putative hth DNA-binding region, binds the IS30 element
(Stalder et al., 1990).<P>

<IMG SRC="LAMA/LAMA_IS30.gif" HEIGHT=129 WIDTH=548>
<PRE>
Hth-like region in IS30 transposases.
Block BL01043A of the IS30 transposases family. The regions similar to
the hth motifs in the block to block searches are underlined. The start
and end coordinates flank the sequences. The diagram shows the suggested
position of the hth motifs found by the hth  algorithm (Dodd & Egan, 1990). 
The algorithm scores for hth motifs were 5.19 standard deviation
units (SD), corresponding to 100% probability for TRA1_STRSL, 5.95 SD
and 100% for TRA4_BACFR, 4.13 SD and 90% for TRA8_ALCEU, and 5.72 SD and
100% for TRA8_ECOLI.
</PRE>

Dodd, I. B. & Egan, J. B. (1990). Improved detection of helix-turn-helix
DNA-binding motifs in protein sequences. Nucl. Acid. Res. 18, 5019-5026.<P>

Dong, Q., Sadouk, A., van der Lelie, D., Taghavi, S., Ferhat, A.,
Nuyten, J. M., Borremans, B., Mergeay, M. & Toussaint, A. (1992).
Cloning and sequencing of IS1086, an Alcaligenes eutrophus insertion
element related to IS30 and IS4351. J. Bacteriol. 174, 8133-8138.<P>

Henikoff, S. & Henikoff, J. G. (1994). Protein family classification
based on searching a database of blocks. Genomics 19, 97-107.<P>

Pabo, C. O. & Sauer, R. T. (1992). Transcription factors: structural
families and principles of DNA recognition. Annu. Rev. Biochem. 61,
1053-1095.<P>

Stalder, R., Caspers, P., Olasz, F. & Arber, W. (1990). The N-terminal
domain of the insertion sequence 30 transposase interacts specifically
with the terminal inverted repeats of the element. J. Biol. Chem. 265,
3757-3762.<P>

<HR>
<LI><A NAME="HTH"><H3>Hth motifs in the Blocks Database</H3></A><P>

     In comparing the entries in the Blocks Database v8.6 among 
themselves all fourteen hth blocks had high scores with two or more 
other hth blocks (<A HREF="#FIGURE_HTH_GRAPH">Figure</A>). 
The two high scoring non-hth blocks could be 
distinguished by relating to single hth block and having lower scores 
relative to the ones between the hth blocks. The blocks are from four 
types of protein families - bacterial regulatory proteins, homeobox 
domain proteins, sigma bacterial transcription initiation factors and IS 
transposases. Manual inspection of the Prosite annotation of the protein 
families in the Blocks Database and of blocks themselves found no 
other hth blocks in the database.<P>
     The hth blocks included different number of sequences, from 4 to 185. 
There was no correlation between the number of sequences in a block and 
its relation to other blocks. This suggests that even blocks with 4-6 
sequences can give a correct representation of conserved protein domains. 
More than 90% of the blocks in the database used had more than four 
sequences. This fraction is increasing with each release (>94% in BLOCKS 
9.0) as the number of new protein sequences is higher than the number of 
new protein families (Green <I>et al</I>., 1993; Koonin <I>et al</I>., 
1995; Koonin <I>et al</I>., 1994).<P>

     Hth blocks illustrate the problem of distinguishing genuine 
relationships from chance ones and suggest a solution. Two of the hth 
blocks (BL00622 and BL01063B) lie below the threshold for detection 
single-hit relations (Z score >=8.3, bold lines in 
<A HREF="#FIGURE_HTH_GRAPH">Figure</A>). Protein 
families with hth-motifs usually have no other common blocks to support the 
relation between the hth blocks. However, hth motifs are found in several 
protein families. These hth blocks all have high scores with each other, but 
not all these scores are high enough to identify genuine relationships by 
themselves. Nevertheless, blocks with a number of such scores to known hth 
blocks can be identified as hth blocks too. The two non-hth blocks have high 
scores to single hth blocks, and do not form part of the connected graph. An 
analogous strategy is the basis for detecting weak similarities in 
single-sequence alignments using the BLAST3 program (Altschul & Lipman, 1990).
<P>

<A NAME="FIGURE_HTH_GRAPH"><IMG SRC="LAMA/LAMA_hth_graph.gif" HEIGHT=444 WIDTH=693></A>
<PRE>
High scores of helix-turn-helix DNA binding blocks.
All 14 hth blocks found in BLOCKS 8.6 and their high scoring relationships 
with each other (true positives) and with other blocks (false positives, 
outward pointing lines). Each block had different sequences except two pairs 
of homeobox blocks that had common sequences (BL00027 with BL00032B and with 
BL00035B). Lines show scores above the 5.6 Z score cutoff. Thick lines 
correspond to scores above the 8.3 Z score cutoff. BRP - bacterial 
regulatory proteins.<P>
</PRE>

Since all the hth blocks are similar to one another we examined how well 
would one composite hth block identify other hth blocks. The 
<A HREF="http://www.ncbi.nlm.nih.gov/Complete_Genomes/Ecoli/README">
ecmot database</A> (Koonin et al., 1995) contains such a 
<A HREF="http://www.ncbi.nlm.nih.gov/cgi-bin/Complete_Genomes/mot2html?EC0157">
composite hth block</A>, with 609 sequence segments from many hth families. 
The <A HREF="#EC0157_LOGO">graphical representation</A> 
(<A HREF="/blocks/about_logos.html">logo</A>) of this block 
illustrates the conservation in each of its positions. This and the 
avoidance of particular amino acids at specific positions can also be seen in 
the <A HREF="LAMA/EC0157_.pssm.html">PSSM of block EC0157</A>.
This block had high scores with 18 blocks in Blocks Database v8.6 
(<A HREF="#TABLE2">Table</A>). 
Fourteen of those are the hth blocks discussed above. All the 
hth blocks had high to extremely high scores, the lowest one expected to 
occur 3.2e-3.<BR> 
(<A HREF="LAMA/EC0157_.blk">Here</A> you will find block 
EC0157 in a format you can use in a 
<A HREF="/blocks-bin/LAMA_search.sh?LAMA/EC0157_.blk">LAMA search</A>.)<P>

The four blocks at the end of the table have significantly lower scores 
(Z 5.6-6.5). These are non-hth blocks but their similarity to the 
composite hth block can be explained. Two of the blocks are from 
bacterial regulatory proteins families, occurring C-terminal to the hth 
motifs. One is a hth-similar region from the araC family (Brunelle & 
Schleif, 1989) and the other corresponds to the 
<A HREF="LAMA/LAMA_lacIs.html">hth helix3 and DNA 
binding hinge helix in the <I>E.coli</I> lac repressor protein</A> (Lewis et 
al., 1996). Another block is from the S3 ribosomal proteins (BL00548A). 
This protein binds RNA, and it is interesting to note the recent report 
of the RNA binding activity by a hth domain (Dubnau & Struhl, 1996). The 
last non-hth block is from L-lactate dehydrogenase (LDH) proteins. LDHs 
do not bind DNA but the 
<A HREF="LAMA/LAMA_LDHs.html">crystal structure of the detected region 
(alpha-2f to Beta-G) is a helix-turn followed by a helix or strand in 
different proteins</A> (Abad Zapatero et al., 1987; Grau et al., 1981; Iwata 
& Ohta, 1993).<P>

<A NAME="EC0157_LOGO">
<IMG SRC="LAMA/EC0157_.PSSM.logo.jpeg" HEIGHT=520 WIDTH=760></A><P>

<A NAME="TABLE2"><B>Blocks similar to composite hth block</A> <A HREF="LAMA/EC0157_.blk">EC0157</A></B>
<TABLE BORDER>
<TR VALIGN=top><TH><PRE>Protein family (1)</TH><TH><PRE>Z  score</TH></TR>
<TR VALIGN=top><TD><PRE>'Homeobox' domain proteins</TD><TD><PRE>18.4</TD></TR>
<TR VALIGN=top><TD><PRE>'Homeobox' antennapedia-type proteins</TD><TD><PRE>13.2</TD></TR>
<TR VALIGN=top><TD><PRE>'POU' domain proteins</TD><TD><PRE>11.7</TD></TR>
<TR VALIGN=top><TD><PRE>BRP crp family</TD><TD><PRE>12.1</TD></TR>
<TR VALIGN=top><TD><PRE>BRP gntR family</TD><TD><PRE>12.4</TD></TR>
<TR VALIGN=top><TD><PRE>BRP lysR family</TD><TD><PRE>14.4</TD></TR>
<TR VALIGN=top><TD><PRE>BRP lacI family (2)</TD><TD><PRE>11.7</TD></TR>
<TR VALIGN=top><TD><PRE>BRP luxR family</TD><TD><PRE>12.4</TD></TR>
<TR VALIGN=top><TD><PRE>BRP arsR family</TD><TD><PRE> 8.0</TD></TR>
<TR VALIGN=top><TD><PRE>BRP deoR family</TD><TD><PRE> 8.7</TD></TR>
<TR VALIGN=top><TD><PRE>BRP tetR family</TD><TD><PRE>14.1</TD></TR>
<TR VALIGN=top><TD><PRE>Sigma-54 factors family</TD><TD><PRE> 7.8</TD></TR>
<TR VALIGN=top><TD><PRE>Sigma-70 factors ECF subfamily</TD><TD><PRE> 8.3</TD></TR>
<TR VALIGN=top><TD><PRE>Transposases, IS30 family</TD><TD><PRE>11.2</TD></TR>
<TR VALIGN=top></TR>
<TR VALIGN=top><TD><PRE>BRP araC family</TD><TD><PRE> 6.5</TD></TR>
<TR VALIGN=top><TD><PRE>BRP lacI family (2)</TD><TD><PRE> 6.6</TD></TR>
<TR VALIGN=top><TD><PRE>Ribosomal S3 proteins</TD><TD><PRE> 5.8</TD></TR>
<TR VALIGN=top><TD><PRE>L-lactate dehydrogenase family</TD><TD><PRE> 5.8</TD></TR>
</PRE>
</TABLE>
<PRE>
(1) The family Blocks Database entry numbers are in the previous figure 
    except for BRP araC family - BL00041, L-lactate dehydrogenase - BL00064D 
    and Ribosomal protein S3 proteins - BL00548A.
    The non-hth blocks are separated at the end of the table.
(2) Two blocks from the lacI hth family are similar to the composite hth block -
    block BL00356A, the hth region, and block BL00356B, the following
    DNA-binding hinge region.
</PRE>

     Identifying all the hth regions in the Blocks Database illustrates 
the potential of the multiple alignment comparison method as an aid for 
annotating protein-family databases. Besides identifying the function of 
unknown regions, the approach outlined in this example can be useful in 
annotating databases that generate the multiple alignments automatically. 
Multiple alignments of characterized protein motifs (such as the hth, 
nucleotide binding folds or leucine zipper) could be used to identify other 
multiple alignments containing these motifs.<P>

Altschul, S. F. & Lipman, D. J. (1990). Protein database searches for multiple alignments. <I>Proc. Natl. Acad. Sci. USA</I> <B>87</B>, 5509-5513.<P>

Abad Zapatero, C., Griffith, J., Sussman, J. & Rossmann, M. (1987). 
Refined crystal structure of dogfish M4 apo-lactate dehydrogenase. 
<I>J Mol Biol</I> <B>198</B>, 445-467.<P>

Brunelle, A. & Schleif, R. (1989). Determining residue-base interactions 
between AraC protein and araI DNA. <I>J Mol Biol</I> <B>209</B>, 607-622.<P>

Dubnau, J. & Struhl, G. (1996). RNA recognition and translational 
regulation by a homeodomain protein. <I>Nature</I> <B>379</B>, 694-699.<P>

Grau, U., Trommer, W. & Rossmann, M. (1981). Structure of the active 
ternary complex of pig heart lactate dehydrogenase with S-lac-NAD at 2.7 
A resolution. <I>J Mol Biol</I> <B>151</B>, 289-307.<P>

Green, P., Lipman, D., Hillier, L., Waterston, R., States, D. & Claverie, J. M. 
(1993). Ancient conserved regions in new gene sequences and the protein databases. 
<I>Science</I> <B>259</B>, 1711-1716.<P>

Iwata, S. & Ohta, T. (1993). Molecular basis of allosteric activation of 
bacterial L-lactate dehydrogenase. <I>J Mol Biol</I> <B>230</B>, 21-27.<P>

Koonin, E., Tatusov, R. & Rudd, K. (1995). Sequence similarity analysis of 
Escherichia coli proteins: functional and evolutionary implications. 
<I>Proc Natl Acad Sci USA</I> <B>92</B>, 11921-11925.<P>

Koonin, E. V., Bork, P. & Sander, C. (1994). Yeast chromosome III: 
new gene functions. <I>EMBO J.</I> <B>13</B>, 493-503.<P>

Lewis, M., Chang, G., Horton, N. C., Kercher, M. A., Pace, H. C., 
Schumacher, M. A., Brenan, R. G. & Lu, P. (1996). Crystal Structure of 
the Lactose Operon Repressor and Its Complexes with DNA and Inducer. 
<I>Science</I> <B>271</B>, 1247 1254.<P>
<HR>
</UL>


<A NAME="SUPPLMNT"><H2>Supplements</H2></A>
To calibrate the LAMA scores the 
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
was purged from <A HREF="#BIASED_BLOCKS">biassed blocks</A>, the PSSMs of 
the remaining blocks were each shuffled and then compared against the 
blocks from the unshuffled database. The best score from each of
the resulting 7 million comparisons was saved. These scores are due to chance
and were used to estimate the significance of alignment scores between blocks.
The mean and variance of chance alignments depend on the length of the 
compared blocks. Longer blocks will give longer alignments and higher scores
by chance alone. Grouping the chance scores by the length of the shorter 
block in each comparison gave very similar score distributions. The mean 
and standard deviation of each group was used to transform each score into
a <A HREF="#Z_SCORE">Z score</A>. The percentiles of all these Z scores was
then calculated. These percentiles are used to estimate the 
<A HREF="#EXPECTED">expected number</A> each score should appear not due 
to genuine relationship.<P>

Following are links to tables with this data. Note that the scores in the
tables are the raw scores of the alignments. The scores shown in the LAMA
output are normalized by dividing the raw score by the alignment length.

<UL>
<LI><A HREF="LAMA/LAMA.Z_stat.html">Mean and standard deviation for scores expected by chance</A>
<LI><A HREF="LAMA/LAMA.ZVp.html">Percentile of Z scores expected by chance</A>
</UL><P>

<A NAME="LAMA_CREDITS"><H2>Credits and citation</H2></A>
The multiple alignment comparison method and LAMA program were developed by
<A HREF="/~pietro">Shmuel Pietrokovski</A> 
in the lab of Steve Henikoff at the 
<A HREF="http://www.fhcrc.org">Fred Hutchinson Cancer Research Center</A>, 
<A HREF="http://www.cyberspace.com/bobk/">Seattle</A>.<P>
An article describing the method and its uses<BR>
"<STRONG>Searching Databases of Conserved Sequence Regions by 
 Aligning Protein Multiple-Alignments</STRONG>"<BR>
appeared in
<A HREF="http://www.oup.co.uk/oup/smj/journals/ed/titles/nar/Volume_24/Issue_19/6s0225_gml.abs.html">
Nucleic Acids Research 24(19) 3836-3845 (October 96')</A>. 
This article should be cited in research using this method.<BR>
<HR>
<A HREF="/blocks">[Blocks home]</A> 
<A HREF="/blocks/blocks_search.html">[Block Searcher]</A>
<A HREF="/blocks/make_blocks.html">[Block Maker]</A>
<A HREF="/blocks-bin/getblock.sh">[Get Blocks]</A>
<A HREF="/blocks/block_formatter.html">[format a block]</A>
<A HREF="/blocks/biassed_blocks.html">[check for biassed blocks]</A>
<A HREF="/blocks-bin/LAMA_search.sh">[LAMA Searcher]</A>
<HR>
Page last modified <MODIFICATION_DATE>January 1997</MODIFICATION_DATE> 
(thanks for Liz G.Wiz for useful comments)

<Address>
<A HREF="/~pietro">Shmuel Pietrokovski</A>
</Address>