1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856
|
<TITLE>LAMA help</TITLE>
<H1><A HREF="/blocks/help/LAMA/LAMA_YWK.html">
<IMG ALIGN=MIDDLE SRC="/blocks/help/LAMA/llama.2.gif" HEIGHT=145 WIDTH=95></A>
<A HREF="/blocks-bin/LAMA_search.sh">LAMA</A> help</H1>
<UL>
<LI><A HREF="#LAMA">What does LAMA do?</A>
<LI><A HREF="#LAMA_FOR_ME">What can LAMA do for <B>me</B>?</A>
<LI><A HREF="#LAMA_HOW">How does LAMA align blocks?</A>
<LI><A HREF="#LAMA_INPUT">Input for LAMA</A>
<UL>
<LI><A HREF="#LAMA_INPUT_CONTENT">Content of input</A>
<LI><A HREF="#LAMA_INPUT_FORMAT">Format of input</A>
<LI><A HREF="#LAMA_OUTPUT_OPTIONS">Output options</A>
</UL>
<LI><A HREF="#LAMA_OUTPUT">Output from LAMA</A>
<UL>
<LI><A HREF="#LAMA_OUTPUT_FORMAT">Format and content of output</A>
<LI><A HREF="#EVALUATING_SCORES">Evaluating LAMA alignment scores</A>
</UL>
<LI><A HREF="#EXAMPLES">Examples</A>
<UL>
<LI><A HREF="#FLAVOPROTEINS">Flavoproteins FAD binding and catalytic sites</A>
<LI><A HREF="#ST_CD59">Snake toxins and the CD59 extracellular domain</A>
<LI><A HREF="#IS30">IS30 transposases DNA-binding domain</A>
<LI><A HREF="#HTH">Hth motifs in the Blocks Database</A>
</UL>
<LI><A HREF="#SUPPLMNT">Supplements</A>
<UL>
<LI><A HREF="LAMA/LAMA.Z_stat.html">Mean and standard deviation for scores expected by chance</A>
<LI><A HREF="LAMA/LAMA.ZVp.html">Percentile of Z scores expected by chance</A>
</UL>
<LI><A HREF="#LAMA_CREDITS">Credits and citation</A>
</UL>
<A NAME="LAMA"><H2>What does LAMA do?</H2></A>
LAMA (Local Alignment of Multiple Alignments) is a program for comparing
protein multiple sequence alignments with each other. The program can search
databases of such multiple alignments. The search is for sequence
similarities between conserved regions of protein families.
The method is sensitive, detecting weak sequence
relationships between protein families. Sequence similarities
beyond the range of conventional sequence database searches can be
detected by the method.<P>
<A NAME="LAMA_FOR_ME"><H2>What can LAMA do for <B>me</B>?</H2></A>
LAMA can identify protein families similar to your protein(s) of interest
and protein motifs similar to conserved regions in your protein(s). The
information known about these similar families and motifs can help you
identify the function and structure of your protein and locate critical
conserved regions in your protein(s). This can direct you in
designing experiments to test your hypotheses.<P>
LAMA compares <B>multiple</B> sequence alignments of proteins.
If you have only a <B>single</B> protein sequence you first need to
find other members of its family. The protein sequences also need to
be multiply aligned. The <A HREF="#LAMA_INPUT_CONTENT">Content of input</A>
section explains how to find related sequences and align them.<P>
<A NAME="LAMA_HOW"><H2>How does LAMA align blocks?</H2></A>
The multiple alignments are first transformed into position specific
scoring matrices (<A HREF="PSSM_def.html">PSSMs</A>). Each column in
the PSSM corresponds to a position in the
alignment and has the amino acid distribution of that position. The
transformation into the PSSM is done with position-based sequence weights
(<A HREF="/blocks/papers/#SEQUENCE_WEIGHTS.ps">Henikoff & Henikoff, 1994a</A>)
and odd ratios between the amino acid frequencies
observed in the multiple alignments and the frequencies expected
from protein databases
(<A HREF="/blocks/papers/#BLOCKMAKER.ps">Henikoff & Henikoff, 1995</A>).
The transformation corrects possible overrepresentation of some
sequences by sequence weighting and considers the background
frequencies of the amino acids.
The method was tested and calibrated with ungapped local multiple alignments
(blocks) from the
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
.<P>
The matrices are treated as sequences of columns, enabling
their alignment with one another. To use algorithms developed for
aligning single sequences we need a measure for comparing pairs of
matrix columns. This corresponds to the substitution matrices
(PAM, BLOSUM etc.) used in single-sequence alignments. The
measure used in our method to score the similarity between pairs of
matrix columns is the Pearson correlation coefficient <A NAME="Pr">(r)</A>:
<IMG SRC = "LAMA/LAMA_r.gif" HEIGHT=69 WIDTH=181>
where A(i) and B(i) are the values of amino acid i in columns A and B,
respectively, and /A and /B
are the means of the values in columns A and B.
The correlation score ranges from 1 for columns with identical
amino acid distributions to -1 for columns with opposite
distributions (in each column only 10 amino acids occur and
those 10 amino acids are different in the two compared
columns).<P>
The score of a block-to-block alignment is the sum of the scores from
comparing the corresponding columns in the two block matrices: <BR>
<IMG SRC = "LAMA/LAMA_algorithm.gif" HEIGHT=477 WIDTH=438>
<PRE>
Local alignment of blocks.
Positions 2 to 7 from block A aligned with positions 4 to 9 from
block B. A column comparison score, <STRONG>s(Xn*Ym)</STRONG>, is calculated for
each pair of positions (A2*B4 to A7*B9). The score of the alignment
of the two segments, <STRONG><I>S</I></STRONG>, is the sum of the column comparison scores.
</PRE>
The alignment is done using the Smith-Waterman algorithm for optimal
local alignments. No gaps are allowed since the aligned objects are
short conserved sequence regions. All alignments above the cutoff score
are reported for each pair of compared blocks. There may be cases where parts
of one long block are similar to several blocks:
<STRONG><PRE>
AAAAAAAAAAAAAAAAAAA
BBB CCCCCC
</PRE></STRONG>
<A NAME="LAMA_INPUT"><H2>Input for LAMA</H2></A>
<A NAME="LAMA_INPUT_CONTENT"><H3>Content of input</H3></A>
LAMA can compare any multiple alignment if it is in the correct format.
<I>However</I>, the
<A HREF="#Pr">column comparison measure</A> and
the <A HREF="#SCORE_SIGNIF">significance estimation</A> of the scores
are appropriate for protein sequence blocks - ungapped conserved multiple
alignments. The use of other types of multiple alignments, such as global
multiple alignments that include many gaps, may give misleading results.
For example, the resulting alignments may not be optimal or their
significance different from what the output suggests.
<P>
If you only have a single protein sequence or want to find more protein
sequences related to yours you can search the sequence databases.
One way to do this on the WWW is using the
<A HREF="http://www.ncbi.nlm.nih.gov/cgi-bin/BLAST/nph-blast?Jform=0">
BLAST program</A> to search the
<A HREF="http://www.ncbi.nlm.nih.gov/index.html">NCBI</A> sequence databases.
Links to other search methods can be found at
the Baylor College of Medicine Human Genome Center
<A HREF="http://dot.imgen.bcm.tmc.edu:9331/seq-search/protein-search.html">
Search Launcher site</A>.<P>
The <A HREF = "/blocks/make_blocks.html">BlockMaker</A> WWW site can be used
for finding blocks in your group of related protein sequences. There are
various other methods for making protein multiple sequence alignments.
Among these are the
<A HREF="http://meme.sdsc.edu/meme/website/meme-intro.html">
MEME system</A>,
<A HREF = "http://www3.ncbi.nlm.nih.gov:80/htbin-post/Entrez/query?uid=94023958&form=6&db=m&Dopt=r">
Gibbs sampling programs</A>,
the <A HREF="http://www3.ncbi.nlm.nih.gov:80/htbin-post/Entrez/query?uid=91172743&form=6&db=m&Dopt=r">
MACAW interactive program</A>, and
the <A HREF = "http://www3.ncbi.nlm.nih.gov:80/htbin-post/Entrez/query?uid=95075648&form=6&db=m&Dopt=r">
CLUSTAL-W progressive multiple alignment program</A>.
Several of these methods are available through the
<A HREF="http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html">
multiple sequence alignment page</A>
at the Baylor College of Medicine Human Genome Center.
<P>
Multiple alignments submitted to the program should be of conserved,
relatively ungapped, protein sequence regions. A few gaps in the
alignment are acceptable. The more sequences are in the alignment the
better. In general, avoid alignments with less than 4 sequences.
<P>
<A NAME="LAMA_INPUT_FORMAT"><H3>Format of input</H3></A>
LAMA only accepts input in the
<A HREF="/blocks/blocks_format.html">Block format</A>. Other multiple
alignments can be <A HREF="/blocks/block_formatter.html">
reformatted to the Block format</A>. If you are not sure of your
multiple alignment or just have a group of <STRONG>related</STRONG>
sequences you can use the
<A HREF = "/blocks/make_blocks.html">BlockMaker program</A> for
finding blocks in the sequences. Note that to avoid biassed sequence
representation blocks include sequence weights.<P>
<A NAME="LAMA_OUTPUT_OPTIONS"><H3>Output options</H3></A>
<UL>
<A NAME="OUTPUT"><LI>Output level</A><BR>
The <A HREF="#LAMA_OUTPUT">standard output</A> displays pairs of
blocks with alignment scores above a <A HREF = "#SCORE_CUTOFF">
Z score cutoff</A>. When both target and query blocks are given
by the user there are options for also seeing the
<A HREF="#Pr">column scores</A> composing the alignment score
for <I>every</I> reported alignment and the <A HREF="#LAMA_HOW">PSSMs</A>
of <I>all</I> the compared blocks.<BR>
<A NAME="CUTOFF"><LI>Score cutoff</A><BR>
The default cutoff value is 5.6 Z scores. When both target and query
blocks are given by the user different cutoffs can be specified.
Giving a lower value will allow reporting of weaker alignments.
Alignments with low values can occur by chance between unrelated
blocks. Raising the cutoff score may exclude potentially genuine
alignments. The <A HREF = "#EXPECTED">expected number</A> of
occurrences should be used to <A HREF = "#EVALUATING_SCORES">
evaluate the alignment scores</A>.
</UL>
Some of the <A HREF="#EXAMPLES">examples</A> included in this document
illustrate the use of the options.<P>
<H2><A NAME="LAMA_OUTPUT">Output from LAMA</A></H2>
<H3><A NAME="LAMA_OUTPUT_FORMAT">Content and format of output:</A></H3>
<pre><HR>
LAMA version 1.00 October 96.
Minimal length of reported alignments 4
Score cutoff is 5.6 Z score units (in the top 7.7e-05 percentile of chance scores)
alignment Z-score expected number for
block 1 from:to block 2 from:to length searching 5000 blocks
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A> 20 : 46 and <A HREF="/blocks-bin/getblock.sh?BL00042#BL00042B">BL00042B</A> 3 : 29 (27) score 39 ( 7.2 1.3e-02) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+2+/howard/blocks/bin/blocks.dat+BL00042B+3+27"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+20+/howard/blocks/bin/blocks.dat+BL00042B+3+27"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A> 5 : 39 and <A HREF="/blocks-bin/getblock.sh?BL00324#BL00324C">BL00324C</A> 3 : 37 (35) score 27 ( 6.1 1.5e-01) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+5+/howard/blocks/bin/blocks.dat+BL00324C+3+35"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+5+/howard/blocks/bin/blocks.dat+BL00324C+3+35"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A> 12 : 47 and <A HREF="/blocks-bin/getblock.sh?BL00622#BL00622">BL00622</A> 8 : 43 (36) score 33 ( 8.2 0.0e+00) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+12+/howard/blocks/bin/blocks.dat+BL00622+8+36"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+12+/howard/blocks/bin/blocks.dat+BL00622+8+36"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A> 10 : 46 and <A HREF="/blocks-bin/getblock.sh?BL00894#BL00894A">BL00894A</A> 1 : 37 (37) score 26 ( 5.7 3.2e-01) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+10+/howard/blocks/bin/blocks.dat+BL00894A+1+37"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+10+/howard/blocks/bin/blocks.dat+BL00894A+1+37"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A> 4 : 42 and <A HREF="/blocks-bin/getblock.sh?BL01043#BL01043A">BL01043A</A> 2 : 40 (39) score 29 ( 8.1 0.0e+00) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+4+/howard/blocks/bin/blocks.dat+BL01043A+2+39"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+4+/howard/blocks/bin/blocks.dat+BL01043A+2+39"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
</pre>
The program version and execution parameters head the search output.
Only alignments longer than the <STRONG>minimal length</STRONG> will
be reported. The significance of very short alignments (fewer than 4
positions)
cannot be reliably estimated. Alignments with scores equal or above
the <A NAME="SCORE_CUTOFF"><STRONG>score cutoff</STRONG></A> will be reported.
The score cutoff is specified as a <STRONG>Z score</STRONG>.
<A NAME="Z_SCORE">Z score</A> is
the number of standard deviations between the score and the mean score.
<A NAME="SHUFFLED_SCORES">T</A>he mean score and the standard deviations
were calculated for the random scores from the alignment of a large number
of shuffled unbiassed blocks (7 million block pairs;
see <A HREF="#SUPPLMNT">first supplement</A>).
The <STRONG>Z score</STRONG> is related to the percentile of the score
in the shuffled blocks scores. This dependence is not linear but sigmoidal
(see <A HREF="#SUPPLMNT">second supplement</A>).<BR>
For each reported alignment the program shows the names of the two
<STRONG>aligned blocks</STRONG>,
their <STRONG>position</STRONG> relative to one another,
the <STRONG>alignment length</STRONG>,
the <STRONG>score</STRONG>,
and the <STRONG>expected number</STRONG>
of such scores when searching a given number of blocks.
<A NAME="EXPECTED">T</A>he expected number is for chance (random)
alignments of unbiassed blocks.
It is calculated from the score percentiles between the shuffled
unbiassed blocks.
In this example the expected number is for searching 5000 blocks.
Blocks from the
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
and from the
<A HREF="/blocks/about_blocks.html#prints">Prints database</A>
will be linked to the database entries. The "alignment" link
(<IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment">)
shows the alignment of the two blocks. This can also be seen by
following the "logos"
(<IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos">)
link that shows the <A HREF="/blocks/about_logos.html">sequence logos</A>
of aligned pairs of blocks.
<A HREF="/blocks/about_logos.html">Sequence logos</A> are graphical representations
of the blocks.
For example,
<A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+12+/howard/blocks/bin/blocks.dat+BL00622+8+36">here</A>
(PostScript viewer required) the logo of block
<A HREF="/blocks-bin/getblock.sh?BL00622#BL00622">BL00622</A>
is shifted 4 positions relative to the logo of block
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>
so that their similar segments (8-43 and 12-47) are aligned.
Indeed, these segments both contain helix-turn-helix DNA binding motifs.
<P>
When both query and target blocks are provided by the user the
<A HREF="#OUTPUT">output</A> can also contain the column scores
of each reported alignment and the <A HREF="#LAMA_HOW">PSSMs</A>
of every compared block.
<P>
Pay attention to any error or warning messages. Most will probably
have to do with the <A HREF="#LAMA_INPUT_FORMAT">format of the input</A>.
<P>
<A NAME="EVALUATING_SCORES"><H3>Evaluating LAMA alignment scores</H3></A>
The alignment score is the average of the
<A HREF="#Pr">column scores</A> in the alignment multiplied by 100.
Since the column scores have a range of -1 to 1 the alignment score
will range from -100 to 100. An alignment score of 46 means
that on average the aligned positions had a correlation coefficient
of 0.46. <I>The significance of the alignment score depends on the
length of the compared blocks.</I> Alignments between longer blocks
will tend to be longer and have higher scores.
The <A HREF="#Z_SCORE">Z score</A> and
<A HREF="#EXPECTED">expected number</A> let us estimate the
<A NAME="SCORE_SIGNIF">significance of the scores</A>
and to compare alignments of different lengths.
The higher the Z score the less likely the alignment is due
to chance. <I>How unlikely depends on the number of blocks searched.
The more blocks searched the greater the probability to find chance
high scores.</I> For example, the output of the calibration with the
<A HREF="#SHUFFLED_SCORES">shuffled blocks</A> contained 7 million
scores but no alignments with Z scores greater than 8.3 .
Hence an alignment with a score equal or higher than that Z score
is unlikely by chance in a comparable or smaller number
of alignments. The expected number shows this directly.
The expected number is shown for searching 5000 blocks since version 9.1 of the
<A HREF="/blocks/help/blocks_release.html">Blocks Database</A>
contains 3300 blocks. For example, searching this release of
the Blocks Database and finding an alignment expected to appear
1.8e-01 times (0.18) suggests that this alignment is not due to chance.
Alignments with expected occurrences of 7.5e-03 or even 0 are almost
certainly genuine (or due to <A HREF="#BIASED_BLOCKS">biassed blocks</A>,
see <A HREF="#TABLE1">below</A>).<BR>
A relation between two families by a single pair of blocks with a
high Z score is termed a <STRONG>single hit</STRONG>.
However, protein families often have a number of blocks.
A <STRONG>multiple hit</STRONG> is when two or more block pairs
from the same two families are similar:
<STRONG><PRE>
multiple hit
Family 1, blocks 1A, 1B, 1C, 1D. 1A=2B + 1D=2C
Family 2, blocks 2A, 2B, 2C.
</PRE></STRONG>
We expect the order of the blocks in the hit to be the same in both
families (in this example 1A -> 1D and 2B -> 2C).<BR>
Individual block pairs with Z scores likely by chance
by themselves can still indicate a genuine relation if they
are in a multiple hit. While the shuffled blocks scores contained
no single hit with Z score above 8.3, there were no multiple hit
with Z scores less than 5.6 . Hence genuine relationships can also
be indicated by <I>several</I> alignments whose Z scores are
<I>individually</I> expected to occur by chance.<P>
When comparing blocks against a database the Z score cutoff is set as 5.6,
corresponding to expected occurrence rate of 0.385 per searching 5000 blocks.
When both query and target blocks are provided other cutoffs can be
<A HREF="#CUTOFF">chosen</A>.
<P>
False positive (high score but no relation) and false negative
(low score but genuine relation) hits are still possible and
biological knowledge and common sense should be used.
<A NAME="BIASED_BLOCKS">Compositionally</A>
biassed blocks (consisting of sequence segments rich in a few amino
acids or short repeats) are a common cause for false positive hits.
You can check if a block is biassed <A HREF="/blocks/biassed_blocks.html">here</A>.
False negative hits can be caused by misalignment in the blocks .
<P>
<A NAME="TABLE1">E</A>ach entry in the
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
version 8.6 (3174 blocks from 858 protein families)
was searched against the other entries in the database.
All block pairs with Z scores larger than 5.6 were saved.
Protein families related by more then one saved score were
considered as multiple hits and alignments with Z scores
above 8.3 as single hits. This resulted in 141 pairs of families.
Eighty percent of these were
identified as genuine relationships (true positives) according to the
family descriptions, by sharing common sequences, or by detailed
examination. Compositional bias was responsible for another eight percent
of the high scores. The remaining twelve percent of the high scores could
not be classified as either genuine or false based on available evidence.<P>
<TABLE BORDER WIDTH=532>
<CAPTION>Distribution of top scoring family pairs</CAPTION>
<TR VALIGN=top><TD>Relation type</TD><TD>Genuine(1)</TD><TD>Biassed<BR>Composition</TD><TD>Unknown</TD><TD><B>Total</TD></TR>
<TR VALIGN=top><TD><PRE>Multiple block hits- independent(2)</TD><TD><PRE> 24 </TD><TD><PRE> -</TD><TD><PRE> 1 </TD><TD><PRE><B> 25 </TD></TR>
<TR VALIGN=top><TD><PRE> - repeats(3)</TD><TD><PRE> 11 </TD><TD><PRE> 6 </TD><TD><PRE> 9 </TD><TD><PRE><B> 26 </TD></TR>
<TR VALIGN=top><TD><PRE> - inner repeats(4)</TD><TD><PRE> 15 </TD><TD><PRE> 4 </TD><TD><PRE> 2 </TD><TD><PRE><B> 21 </TD></TR>
<TR VALIGN=top><TD><PRE>Single block hits</TD><TD><PRE> 63</TD><TD><PRE> 1</TD><TD><PRE> 5</TD><TD><PRE><B> 69</TD></TR>
<TR VALIGN=top><TD><B>Total</TD><TD><PRE><B> 113</TD><TD><PRE><B> 11</TD><TD><PRE><B> 17</TD><TD><PRE><B> 141</TD></TR>
<TR VALIGN=top><TD><B>Fraction</TD><TD><PRE><B> 80%</TD><TD><PRE><B> 8%</TD><TD><PRE><B> 12%</TD><TD></TD></TR>
</TABLE>
<BR>
<PRE>
(1) Genuine relations were identified by the families prosite descriptions,
detailed analysis of the literature or by sharing common sequences
(22 of the single and independent-multiple hits).
(2) An independent multiple hit is two different protein families
related by two or more different block pairs.
(3) A repeat multiple hit is two different protein families where a
block from one family is similar with two or more blocks from the
other family.
(4) An inner-repeat multiple hit is a case where the similarities are
between blocks from the same family.
</PRE>
<A NAME="EXAMPLES"><H2>Examples</H2></A>
<UL>
<LI><A NAME="FLAVOPROTEINS"><H3>Flavoproteins FAD binding and catalytic sites</H3></A><P>
A comparison of all the Blocks Databases v8.6 entries with each other
found the following hit between FAD flavoprotein subunits from two
oxidoreductase enzyme complexes, BL00504 - succinate dehydrogenases
(Sdh) and fumarate reductases (Frd) and BL00677 - D-amino oxidases (DAO):
<PRE>
alignment Z-score expected number for
block 1 from:to block 2 from:to length searching 5000 blocks
BL00504A 2 : 20 and BL00677A 2 : 20 (19) score 51 (10.0 0.0e+00) [<A HREF="/blocks-bin/LAMA_logos?/blocks/help/LAMA/BL00504.dat+BL00504A+2+/blocks/help/LAMA/BL00677.dat+BL00677A+2+19">logos</A> <A HREF="/blocks/about_logos.html">?</A>]
</PRE>
A comparison with a lower cutoff found another hit supporting the first one:
<PRE>
BL00504D 3 : 35 and BL00677D 17 : 49 (33) score 26 ( 5.5 5.1e-01) [<A HREF="/blocks-bin/LAMA_logos?/blocks/help/LAMA/BL00504.dat+BL00504D+3+/blocks/help/LAMA/BL00677.dat+BL00677D+17+33">logos</A> <A HREF="/blocks/about_logos.html">?</A>]
</PRE>
Sequence annotations and a literature search revealed that block BL00504A
is the FAD-binding site and BL00504D is the active site (Birch Machin
<I>et al</I>., 1992) of the Sdh/Frd flavoproteins. Block BL00677A is
the FAD-binding site of the DAO proteins. The FAD AMP-binding sites in
both families are beta-alpha-beta ADP binding folds and were already
noted as such (Birch-Machin et al., 1992; Schulz <I>et al</I>., 1982).
This explains the first hit.
<P>
The DAO BL00677D block has a conserved histidine important for
enzymatic activity of pig DAO (Miyano <I>et al</I>., 1991). This histidine
is aligned with a conserved and essential histidine in the Sdh/Frd
flavoproteins catalytic site (Birch-Machin et al., 1992; Schroder
<I>et al</I>., 1991). Other positions in these aligned regions are also
similar (column scores 0.31 to 0.98). The dissimilar positions have
column scores close to zero (0.04 to -0.14). This finding suggests
that the active site of DAO flavoproteins is in the BL00677D region with
the conserved histidine as the crucial residue.
<P>
BLAST and FASTA searches of the SwissProt protein database could
not identify this similarity. No sequence from one family identified
any sequence from the other family. Optimal local alignments of all the
sequence pairs from the two families had scores expected by chance.
Searching the Blocks Database with the sequences from the two families
identified the relation between the families with 6 Sdh/Frd flavoproteins
sequences (multiple hits with 98.1 to 76.2 percentiles of scores with
shuffled sequence queries and P values of 8.4*10-3 to 1.1*10-1) but not
with the other two sequences from that family or any of the sequences
from the DAO family (single hits with less then 60.0 score percentiles).
<P>
<IMG SRC="LAMA/LAMA_flavoproteins.gif" HEIGHT=555 WIDTH=503>
<PRE>
Suggested catalytic site of DAO flavoproteins.
A, positions 17-49 of DAO flavoproteins (block BL00677D) aligned with
the catalytic region of Sdh/Frd flavoproteins (positions 3-35 of block
BL00504D). The histidines important for the enzymes catalytic activity
are outlined (the histidine in sequence DHSA_BACSU is misaligned due to
a two aa insertion). The start and end coordinates flank the sequences.
B, the column scores of the alignment.
</PRE>
<P>
Birch-Machin, M. A., Farnsworth, L., Ackrell, B. A., Cochran, B., Jackson, S.,
Bindoff, L. A., Aitken, A., Diamond, A. G. & Turnbull, D. M. (1992).
The sequence of the flavoprotein subunit of bovine heart succinate
dehydrogenase. <I>J. Biol. Chem.</I> <B>267</B>, 11553-11558.<P>
Miyano, M., Fukui, K., Watanabe, F., Takahashi, S., Tada, M., Kanashiro, M. &
Miyake, Y. (1991). Studies on Phe-228 and Leu-307 recombinant mutants of
porcine kidney D-amino acid oxidase: expression, purification, and
characterization. <I>J. Biochemistry</I> <B>109</B>, 171-177.<P>
Schroder, I., Gunsalus, R. P., Ackrell, B. A., Cochran, B. & Cecchini, G.
(1991). Identification of active site residues of Escherichia coli fumarate
reductase by site-directed mutagenesis. <I>J. Biol. Chem.</I> <B>266</B>,
13572-13579.<P>
Schulz, G. E., Schirmer, R. H. & Pai, E. F. (1982). FAD-binding site of
glutathione reductase. <I>J. Mol. Biol.</I> <B>160</B>, 287-308.
<HR>
<LI><A NAME="ST_CD59"><H3>Snake toxins and the CD59 extracellular domain</H3></A><P>
Conserved regions from snake toxins and the CD59 extracellular domain were found
similar to each other. The alignment score is not very striking but the two families
seem be quite dissimilar. What is the connection between snake toxins, small
extracellular proteins that bind to nerve receptors, and the CD59 domain, a domain
that is found in one or more copies on GPI-linked cell surface glycoproteins ?
a closer look at the alignment was taken by requesting to see the column scores.
These scores are shown above the score line for each of the 12 alignment positions
(8,3 to 19,14):
<PRE>
Column scores for optimal alignment of <A HREF="/blocks-bin/getblock.sh?BL00272#BL00272B">BL00272B</A> and <A HREF="/blocks-bin/getblock.sh?BL00983#BL00983B">BL00983B</A> -
8, 3 9, 4 10, 5 11, 6 12, 7 13, 8 14, 9 15,10 16,11 17,12 18,13 19,14
0.262 0.169 0.138 0.286 0.995 1.000 0.368 0.224 0.986 -0.067 1.000 1.000
<A HREF="/blocks-bin/getblock.sh?BL00272#BL00272B">BL00272B</A> 8 : 19 and <A HREF="/blocks-bin/getblock.sh?BL00983#BL00983B">BL00983B</A> 3 : 14 (12) score 53 ( 6.5 6.0e-02) [<A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL00272B+8+/howard/blocks/bin/blocks.dat+BL00983B+3+12">logos</A> <A HREF="/blocks/about_logos.html">?</A>]
</PRE>
Five of the positions [(12,7), (13.8), (16,11), (18,13) and (19,14)]
have very high column scores (0.986-1.000)
indicating identical and almost identical amino acid distribution in these
column pairs. The other positions contribute less to the alignment score
and position (12,17) has a slightly negative score, actually detracting from
the alignment.<P>
Upon requesting to see the PSSMs of the blocks (below) or their aligned
logos (link to 'logos' above) you will note
that 3 of the alignment positions contributing to the score are highly
conserved cysteine residues. This raises the possibility of identical
patterns of disulphide bonds in both regions. We might give this
alignment more attention since disulphide bonds are known to be well
conserved even between distantly related sequences.
More information can be found by following the block links to the
<A HREF="/blocks">Blocks Database</A>
entries. Each family is accompanied by its
<A HREF="http://www.ebi.ac.uk/interpro/">InterPro</A>
annotation and the multiple alignment each block can be
viewed as a graphical
<A HREF="/blocks/help/about_logos.html">sequence logo</A>.
The <A HREF="/blocks/help/LAMA/LAMA_cardiotoxin+CD59.JPEG">structures of both proteins</A>
are known and confirm their relation. (The
<A HREF="http://www.expasy.ch/sw3d/">SWISS-3DIMAGE</A>
was the source for these images of the structures.)
<P>
<PRE>
PSSM of <A HREF="/blocks-bin/getblock.sh?BL00272#BL00272B">BL00272B</A>
| 1 1 1 1 1 1 1 1 1 1
| 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
--+----------------------------------------------------------------------------
A | 0 0 0 13 0 3 0 0 1 0 2 0 0 0 1 0 2 0 0
C | 87 12 12 11 0 0 0 0 21 0 6 100 100 0 0 0 0 99 0
D | 0 0 3 2 11 9 3 2 6 0 0 0 0 3 0 82 10 0 2
E | 0 5 2 8 3 5 2 9 4 6 9 0 0 7 0 5 5 0 0
F | 2 3 9 0 0 2 6 4 2 2 0 0 0 0 0 0 0 0 0
G | 1 1 1 1 3 0 24 8 7 0 0 0 0 2 3 0 1 0 2
H | 0 0 2 0 0 4 2 0 4 0 13 0 0 6 0 0 0 0 0
I | 0 0 4 0 2 1 0 17 7 30 3 0 0 0 6 0 0 0 0
K | 0 3 22 4 30 3 8 5 17 0 24 0 0 16 1 0 36 0 0
L | 0 0 1 0 1 3 8 3 0 14 5 0 0 0 0 0 2 0 0
M | 0 0 0 0 11 2 9 0 3 0 3 0 0 0 0 0 0 0 0
N | 0 0 0 5 2 7 2 2 2 0 2 0 0 16 0 13 22 1 96
P | 6 65 9 2 3 23 6 8 2 5 0 0 0 0 0 0 0 0 0
Q | 0 2 6 0 0 1 0 1 6 0 8 0 0 3 0 0 0 0 0
R | 0 2 4 15 8 2 6 6 2 3 8 0 0 10 0 0 19 0 0
S | 1 4 4 6 13 3 0 4 6 2 1 0 0 19 16 0 0 0 0
T | 3 4 14 5 5 5 1 5 0 4 3 0 0 18 72 0 0 0 0
V | 0 0 3 28 5 1 0 19 1 22 6 0 0 0 0 0 1 0 0
W | 0 0 0 0 0 22 0 0 0 0 0 0 0 0 0 0 0 0 0
X | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Y | 0 0 5 0 3 2 23 7 9 11 7 0 0 0 0 0 2 0 0
- | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
PSSM of <A HREF="/blocks-bin/getblock.sh?BL00983#BL00983B">BL00983B</A>
| 1 1 1 1 1
| 1 2 3 4 5 6 7 8 9 0 1 2 3 4
--+--------------------------------------------------------
A | 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C | 0 0 0 0 0 0 91 100 0 0 0 0 100 0
D | 0 0 0 0 0 0 0 0 0 0 76 0 0 0
E | 0 17 29 0 20 0 0 0 0 42 0 0 0 0
F | 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G | 10 0 0 0 0 0 0 0 10 0 0 0 0 0
H | 0 0 0 0 0 39 0 0 0 0 0 0 0 0
I | 25 0 0 0 0 0 0 0 0 0 0 0 0 0
K | 0 0 0 30 0 0 0 0 28 23 0 0 0 0
L | 0 0 48 0 0 0 0 0 0 9 0 100 0 0
M | 0 0 0 0 0 0 0 0 0 0 0 0 0 0
N | 25 23 0 13 0 0 0 0 0 0 24 0 0 100
P | 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Q | 0 15 0 0 0 13 0 0 29 0 0 0 0 0
R | 0 12 0 12 0 0 0 0 24 0 0 0 0 0
S | 0 20 0 11 0 18 0 0 9 12 0 0 0 0
T | 23 13 0 23 35 10 0 0 0 14 0 0 0 0
V | 16 0 22 11 0 8 9 0 0 0 0 0 0 0
W | 0 0 0 0 0 0 0 0 0 0 0 0 0 0
X | 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Y | 0 0 0 0 45 13 0 0 0 0 0 0 0 0
- | 0 0 0 0 0 0 0 0 0 0 0 0 0 0
</PRE>
("X" specifies unknown amino acids.)
<HR>
<LI><A NAME="IS30"><H3>IS30 transposases DNA-binding domain</H3></A><P>
Excision and insertion of bacterial insertion sequence elements (IS)
require the activity of a transposase protein sometimes encoded by the
ISs. The IS30 transposase family (Dong et al., 1992) is represented by
five blocks in BLOCKS 8.6. A region of 21 positions from the first block
had high scores (Z scores 6.7 to 8.8) only to helix-turn-helix
DNA-binding motifs (hth) from four protein families (see the
<A HREF="#FIGURE_HTH_GRAPH">figure</A> in the
<A HREF="#HTH">next example</A>).
Hth DNA binding motifs occur in many proteins that bind specific DNA
sequences (Pabo & Sauer, 1992).<P>
BLAST searches of the SwissProt protein database with the IS30
sequences did not identify any protein with known hth region. Searching
the Blocks Database with the IS30 sequences gave high scores with hth
blocks for two of the sequences (98.1 and 93.1 percentiles of scores
with shuffled sequence queries (Henikoff & Henikoff, 1994)). The other
two sequences had low scores with hth blocks (30.8 and 18.1 score
percentiles) and higher scores with non-hth blocks. However, each of the
transposases putative DNA binding regions was detected by the method of
Dodd and Egan (Dodd & Egan, 1990) as an almost certain hth domain.<P>
Classification of the first IS30 block as a hth motif is supported by
the finding that the N-terminal region of an IS30 transposase,
containing the putative hth DNA-binding region, binds the IS30 element
(Stalder et al., 1990).<P>
<IMG SRC="LAMA/LAMA_IS30.gif" HEIGHT=129 WIDTH=548>
<PRE>
Hth-like region in IS30 transposases.
Block BL01043A of the IS30 transposases family. The regions similar to
the hth motifs in the block to block searches are underlined. The start
and end coordinates flank the sequences. The diagram shows the suggested
position of the hth motifs found by the hth algorithm (Dodd & Egan, 1990).
The algorithm scores for hth motifs were 5.19 standard deviation
units (SD), corresponding to 100% probability for TRA1_STRSL, 5.95 SD
and 100% for TRA4_BACFR, 4.13 SD and 90% for TRA8_ALCEU, and 5.72 SD and
100% for TRA8_ECOLI.
</PRE>
Dodd, I. B. & Egan, J. B. (1990). Improved detection of helix-turn-helix
DNA-binding motifs in protein sequences. Nucl. Acid. Res. 18, 5019-5026.<P>
Dong, Q., Sadouk, A., van der Lelie, D., Taghavi, S., Ferhat, A.,
Nuyten, J. M., Borremans, B., Mergeay, M. & Toussaint, A. (1992).
Cloning and sequencing of IS1086, an Alcaligenes eutrophus insertion
element related to IS30 and IS4351. J. Bacteriol. 174, 8133-8138.<P>
Henikoff, S. & Henikoff, J. G. (1994). Protein family classification
based on searching a database of blocks. Genomics 19, 97-107.<P>
Pabo, C. O. & Sauer, R. T. (1992). Transcription factors: structural
families and principles of DNA recognition. Annu. Rev. Biochem. 61,
1053-1095.<P>
Stalder, R., Caspers, P., Olasz, F. & Arber, W. (1990). The N-terminal
domain of the insertion sequence 30 transposase interacts specifically
with the terminal inverted repeats of the element. J. Biol. Chem. 265,
3757-3762.<P>
<HR>
<LI><A NAME="HTH"><H3>Hth motifs in the Blocks Database</H3></A><P>
In comparing the entries in the Blocks Database v8.6 among
themselves all fourteen hth blocks had high scores with two or more
other hth blocks (<A HREF="#FIGURE_HTH_GRAPH">Figure</A>).
The two high scoring non-hth blocks could be
distinguished by relating to single hth block and having lower scores
relative to the ones between the hth blocks. The blocks are from four
types of protein families - bacterial regulatory proteins, homeobox
domain proteins, sigma bacterial transcription initiation factors and IS
transposases. Manual inspection of the Prosite annotation of the protein
families in the Blocks Database and of blocks themselves found no
other hth blocks in the database.<P>
The hth blocks included different number of sequences, from 4 to 185.
There was no correlation between the number of sequences in a block and
its relation to other blocks. This suggests that even blocks with 4-6
sequences can give a correct representation of conserved protein domains.
More than 90% of the blocks in the database used had more than four
sequences. This fraction is increasing with each release (>94% in BLOCKS
9.0) as the number of new protein sequences is higher than the number of
new protein families (Green <I>et al</I>., 1993; Koonin <I>et al</I>.,
1995; Koonin <I>et al</I>., 1994).<P>
Hth blocks illustrate the problem of distinguishing genuine
relationships from chance ones and suggest a solution. Two of the hth
blocks (BL00622 and BL01063B) lie below the threshold for detection
single-hit relations (Z score >=8.3, bold lines in
<A HREF="#FIGURE_HTH_GRAPH">Figure</A>). Protein
families with hth-motifs usually have no other common blocks to support the
relation between the hth blocks. However, hth motifs are found in several
protein families. These hth blocks all have high scores with each other, but
not all these scores are high enough to identify genuine relationships by
themselves. Nevertheless, blocks with a number of such scores to known hth
blocks can be identified as hth blocks too. The two non-hth blocks have high
scores to single hth blocks, and do not form part of the connected graph. An
analogous strategy is the basis for detecting weak similarities in
single-sequence alignments using the BLAST3 program (Altschul & Lipman, 1990).
<P>
<A NAME="FIGURE_HTH_GRAPH"><IMG SRC="LAMA/LAMA_hth_graph.gif" HEIGHT=444 WIDTH=693></A>
<PRE>
High scores of helix-turn-helix DNA binding blocks.
All 14 hth blocks found in BLOCKS 8.6 and their high scoring relationships
with each other (true positives) and with other blocks (false positives,
outward pointing lines). Each block had different sequences except two pairs
of homeobox blocks that had common sequences (BL00027 with BL00032B and with
BL00035B). Lines show scores above the 5.6 Z score cutoff. Thick lines
correspond to scores above the 8.3 Z score cutoff. BRP - bacterial
regulatory proteins.<P>
</PRE>
Since all the hth blocks are similar to one another we examined how well
would one composite hth block identify other hth blocks. The
<A HREF="http://www.ncbi.nlm.nih.gov/Complete_Genomes/Ecoli/README">
ecmot database</A> (Koonin et al., 1995) contains such a
<A HREF="http://www.ncbi.nlm.nih.gov/cgi-bin/Complete_Genomes/mot2html?EC0157">
composite hth block</A>, with 609 sequence segments from many hth families.
The <A HREF="#EC0157_LOGO">graphical representation</A>
(<A HREF="/blocks/about_logos.html">logo</A>) of this block
illustrates the conservation in each of its positions. This and the
avoidance of particular amino acids at specific positions can also be seen in
the <A HREF="LAMA/EC0157_.pssm.html">PSSM of block EC0157</A>.
This block had high scores with 18 blocks in Blocks Database v8.6
(<A HREF="#TABLE2">Table</A>).
Fourteen of those are the hth blocks discussed above. All the
hth blocks had high to extremely high scores, the lowest one expected to
occur 3.2e-3.<BR>
(<A HREF="LAMA/EC0157_.blk">Here</A> you will find block
EC0157 in a format you can use in a
<A HREF="/blocks-bin/LAMA_search.sh?LAMA/EC0157_.blk">LAMA search</A>.)<P>
The four blocks at the end of the table have significantly lower scores
(Z 5.6-6.5). These are non-hth blocks but their similarity to the
composite hth block can be explained. Two of the blocks are from
bacterial regulatory proteins families, occurring C-terminal to the hth
motifs. One is a hth-similar region from the araC family (Brunelle &
Schleif, 1989) and the other corresponds to the
<A HREF="LAMA/LAMA_lacIs.html">hth helix3 and DNA
binding hinge helix in the <I>E.coli</I> lac repressor protein</A> (Lewis et
al., 1996). Another block is from the S3 ribosomal proteins (BL00548A).
This protein binds RNA, and it is interesting to note the recent report
of the RNA binding activity by a hth domain (Dubnau & Struhl, 1996). The
last non-hth block is from L-lactate dehydrogenase (LDH) proteins. LDHs
do not bind DNA but the
<A HREF="LAMA/LAMA_LDHs.html">crystal structure of the detected region
(alpha-2f to Beta-G) is a helix-turn followed by a helix or strand in
different proteins</A> (Abad Zapatero et al., 1987; Grau et al., 1981; Iwata
& Ohta, 1993).<P>
<A NAME="EC0157_LOGO">
<IMG SRC="LAMA/EC0157_.PSSM.logo.jpeg" HEIGHT=520 WIDTH=760></A><P>
<A NAME="TABLE2"><B>Blocks similar to composite hth block</A> <A HREF="LAMA/EC0157_.blk">EC0157</A></B>
<TABLE BORDER>
<TR VALIGN=top><TH><PRE>Protein family (1)</TH><TH><PRE>Z score</TH></TR>
<TR VALIGN=top><TD><PRE>'Homeobox' domain proteins</TD><TD><PRE>18.4</TD></TR>
<TR VALIGN=top><TD><PRE>'Homeobox' antennapedia-type proteins</TD><TD><PRE>13.2</TD></TR>
<TR VALIGN=top><TD><PRE>'POU' domain proteins</TD><TD><PRE>11.7</TD></TR>
<TR VALIGN=top><TD><PRE>BRP crp family</TD><TD><PRE>12.1</TD></TR>
<TR VALIGN=top><TD><PRE>BRP gntR family</TD><TD><PRE>12.4</TD></TR>
<TR VALIGN=top><TD><PRE>BRP lysR family</TD><TD><PRE>14.4</TD></TR>
<TR VALIGN=top><TD><PRE>BRP lacI family (2)</TD><TD><PRE>11.7</TD></TR>
<TR VALIGN=top><TD><PRE>BRP luxR family</TD><TD><PRE>12.4</TD></TR>
<TR VALIGN=top><TD><PRE>BRP arsR family</TD><TD><PRE> 8.0</TD></TR>
<TR VALIGN=top><TD><PRE>BRP deoR family</TD><TD><PRE> 8.7</TD></TR>
<TR VALIGN=top><TD><PRE>BRP tetR family</TD><TD><PRE>14.1</TD></TR>
<TR VALIGN=top><TD><PRE>Sigma-54 factors family</TD><TD><PRE> 7.8</TD></TR>
<TR VALIGN=top><TD><PRE>Sigma-70 factors ECF subfamily</TD><TD><PRE> 8.3</TD></TR>
<TR VALIGN=top><TD><PRE>Transposases, IS30 family</TD><TD><PRE>11.2</TD></TR>
<TR VALIGN=top></TR>
<TR VALIGN=top><TD><PRE>BRP araC family</TD><TD><PRE> 6.5</TD></TR>
<TR VALIGN=top><TD><PRE>BRP lacI family (2)</TD><TD><PRE> 6.6</TD></TR>
<TR VALIGN=top><TD><PRE>Ribosomal S3 proteins</TD><TD><PRE> 5.8</TD></TR>
<TR VALIGN=top><TD><PRE>L-lactate dehydrogenase family</TD><TD><PRE> 5.8</TD></TR>
</PRE>
</TABLE>
<PRE>
(1) The family Blocks Database entry numbers are in the previous figure
except for BRP araC family - BL00041, L-lactate dehydrogenase - BL00064D
and Ribosomal protein S3 proteins - BL00548A.
The non-hth blocks are separated at the end of the table.
(2) Two blocks from the lacI hth family are similar to the composite hth block -
block BL00356A, the hth region, and block BL00356B, the following
DNA-binding hinge region.
</PRE>
Identifying all the hth regions in the Blocks Database illustrates
the potential of the multiple alignment comparison method as an aid for
annotating protein-family databases. Besides identifying the function of
unknown regions, the approach outlined in this example can be useful in
annotating databases that generate the multiple alignments automatically.
Multiple alignments of characterized protein motifs (such as the hth,
nucleotide binding folds or leucine zipper) could be used to identify other
multiple alignments containing these motifs.<P>
Altschul, S. F. & Lipman, D. J. (1990). Protein database searches for multiple alignments. <I>Proc. Natl. Acad. Sci. USA</I> <B>87</B>, 5509-5513.<P>
Abad Zapatero, C., Griffith, J., Sussman, J. & Rossmann, M. (1987).
Refined crystal structure of dogfish M4 apo-lactate dehydrogenase.
<I>J Mol Biol</I> <B>198</B>, 445-467.<P>
Brunelle, A. & Schleif, R. (1989). Determining residue-base interactions
between AraC protein and araI DNA. <I>J Mol Biol</I> <B>209</B>, 607-622.<P>
Dubnau, J. & Struhl, G. (1996). RNA recognition and translational
regulation by a homeodomain protein. <I>Nature</I> <B>379</B>, 694-699.<P>
Grau, U., Trommer, W. & Rossmann, M. (1981). Structure of the active
ternary complex of pig heart lactate dehydrogenase with S-lac-NAD at 2.7
A resolution. <I>J Mol Biol</I> <B>151</B>, 289-307.<P>
Green, P., Lipman, D., Hillier, L., Waterston, R., States, D. & Claverie, J. M.
(1993). Ancient conserved regions in new gene sequences and the protein databases.
<I>Science</I> <B>259</B>, 1711-1716.<P>
Iwata, S. & Ohta, T. (1993). Molecular basis of allosteric activation of
bacterial L-lactate dehydrogenase. <I>J Mol Biol</I> <B>230</B>, 21-27.<P>
Koonin, E., Tatusov, R. & Rudd, K. (1995). Sequence similarity analysis of
Escherichia coli proteins: functional and evolutionary implications.
<I>Proc Natl Acad Sci USA</I> <B>92</B>, 11921-11925.<P>
Koonin, E. V., Bork, P. & Sander, C. (1994). Yeast chromosome III:
new gene functions. <I>EMBO J.</I> <B>13</B>, 493-503.<P>
Lewis, M., Chang, G., Horton, N. C., Kercher, M. A., Pace, H. C.,
Schumacher, M. A., Brenan, R. G. & Lu, P. (1996). Crystal Structure of
the Lactose Operon Repressor and Its Complexes with DNA and Inducer.
<I>Science</I> <B>271</B>, 1247 1254.<P>
<HR>
</UL>
<A NAME="SUPPLMNT"><H2>Supplements</H2></A>
To calibrate the LAMA scores the
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
was purged from <A HREF="#BIASED_BLOCKS">biassed blocks</A>, the PSSMs of
the remaining blocks were each shuffled and then compared against the
blocks from the unshuffled database. The best score from each of
the resulting 7 million comparisons was saved. These scores are due to chance
and were used to estimate the significance of alignment scores between blocks.
The mean and variance of chance alignments depend on the length of the
compared blocks. Longer blocks will give longer alignments and higher scores
by chance alone. Grouping the chance scores by the length of the shorter
block in each comparison gave very similar score distributions. The mean
and standard deviation of each group was used to transform each score into
a <A HREF="#Z_SCORE">Z score</A>. The percentiles of all these Z scores was
then calculated. These percentiles are used to estimate the
<A HREF="#EXPECTED">expected number</A> each score should appear not due
to genuine relationship.<P>
Following are links to tables with this data. Note that the scores in the
tables are the raw scores of the alignments. The scores shown in the LAMA
output are normalized by dividing the raw score by the alignment length.
<UL>
<LI><A HREF="LAMA/LAMA.Z_stat.html">Mean and standard deviation for scores expected by chance</A>
<LI><A HREF="LAMA/LAMA.ZVp.html">Percentile of Z scores expected by chance</A>
</UL><P>
<A NAME="LAMA_CREDITS"><H2>Credits and citation</H2></A>
The multiple alignment comparison method and LAMA program were developed by
<A HREF="/~pietro">Shmuel Pietrokovski</A>
in the lab of Steve Henikoff at the
<A HREF="http://www.fhcrc.org">Fred Hutchinson Cancer Research Center</A>,
<A HREF="http://www.cyberspace.com/bobk/">Seattle</A>.<P>
An article describing the method and its uses<BR>
"<STRONG>Searching Databases of Conserved Sequence Regions by
Aligning Protein Multiple-Alignments</STRONG>"<BR>
appeared in
<A HREF="http://www.oup.co.uk/oup/smj/journals/ed/titles/nar/Volume_24/Issue_19/6s0225_gml.abs.html">
Nucleic Acids Research 24(19) 3836-3845 (October 96')</A>.
This article should be cited in research using this method.<BR>
<HR>
<A HREF="/blocks">[Blocks home]</A>
<A HREF="/blocks/blocks_search.html">[Block Searcher]</A>
<A HREF="/blocks/make_blocks.html">[Block Maker]</A>
<A HREF="/blocks-bin/getblock.sh">[Get Blocks]</A>
<A HREF="/blocks/block_formatter.html">[format a block]</A>
<A HREF="/blocks/biassed_blocks.html">[check for biassed blocks]</A>
<A HREF="/blocks-bin/LAMA_search.sh">[LAMA Searcher]</A>
<HR>
Page last modified <MODIFICATION_DATE>January 1997</MODIFICATION_DATE>
(thanks for Liz G.Wiz for useful comments)
<Address>
<A HREF="/~pietro">Shmuel Pietrokovski</A>
</Address>
|