File: LAMA_help.html

package info (click to toggle)
blimps 3.9%2Bds-1
  • links: PTS, VCS
  • area: non-free
  • in suites: bookworm, bullseye, buster
  • size: 6,812 kB
  • sloc: ansic: 43,271; csh: 553; perl: 116; makefile: 99; cs: 27; cobol: 23
file content (856 lines) | stat: -rw-r--r-- 49,461 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
<TITLE>LAMA help</TITLE>

<H1><A HREF="/blocks/help/LAMA/LAMA_YWK.html">
<IMG ALIGN=MIDDLE SRC="/blocks/help/LAMA/llama.2.gif" HEIGHT=145 WIDTH=95></A>
<A HREF="/blocks-bin/LAMA_search.sh">LAMA</A> help</H1>

<UL>
<LI><A HREF="#LAMA">What does LAMA do?</A>
<LI><A HREF="#LAMA_FOR_ME">What can LAMA do for <B>me</B>?</A>
<LI><A HREF="#LAMA_HOW">How does LAMA align blocks?</A>
<LI><A HREF="#LAMA_INPUT">Input for LAMA</A>
	<UL>
	<LI><A HREF="#LAMA_INPUT_CONTENT">Content of input</A>
	<LI><A HREF="#LAMA_INPUT_FORMAT">Format of input</A>
        <LI><A HREF="#LAMA_OUTPUT_OPTIONS">Output options</A>
	</UL>
<LI><A HREF="#LAMA_OUTPUT">Output from LAMA</A>
	<UL>
	<LI><A HREF="#LAMA_OUTPUT_FORMAT">Format and content of output</A>
	<LI><A HREF="#EVALUATING_SCORES">Evaluating LAMA alignment scores</A>
	</UL>
<LI><A HREF="#EXAMPLES">Examples</A>
	<UL>
	<LI><A HREF="#FLAVOPROTEINS">Flavoproteins FAD binding and catalytic sites</A>
	<LI><A HREF="#ST_CD59">Snake toxins and the CD59 extracellular domain</A>
        <LI><A HREF="#IS30">IS30 transposases DNA-binding domain</A>
        <LI><A HREF="#HTH">Hth motifs in the Blocks Database</A>
	</UL>
<LI><A HREF="#SUPPLMNT">Supplements</A>
	<UL>
	<LI><A HREF="LAMA/LAMA.Z_stat.html">Mean and standard deviation for scores expected by chance</A>
	<LI><A HREF="LAMA/LAMA.ZVp.html">Percentile of Z scores expected by chance</A>
	</UL>
<LI><A HREF="#LAMA_CREDITS">Credits and citation</A>
</UL>

<A NAME="LAMA"><H2>What does LAMA do?</H2></A>
LAMA (Local Alignment of Multiple Alignments) is a program for comparing 
protein multiple sequence alignments with each other. The program can search 
databases of such multiple alignments. The search is for sequence 
similarities between conserved regions of protein families.
The method is sensitive, detecting weak sequence 
relationships between protein families. Sequence similarities 
beyond the range of conventional sequence database searches can be 
detected by the method.<P>

<A NAME="LAMA_FOR_ME"><H2>What can LAMA do for <B>me</B>?</H2></A>
LAMA can identify protein families similar to your protein(s) of interest
and protein motifs similar to conserved regions in your protein(s). The
information known about these similar families and motifs can help you
identify the function and structure of your protein and locate critical
conserved regions in your protein(s). This can direct you in
designing experiments to test your hypotheses.<P>

LAMA compares <B>multiple</B> sequence alignments of proteins. 
If you have only a <B>single</B> protein sequence you first need to 
find other members of its family. The protein sequences also need to 
be multiply aligned. The <A HREF="#LAMA_INPUT_CONTENT">Content of input</A>
section explains how to find related sequences and align them.<P>

<A NAME="LAMA_HOW"><H2>How does LAMA align blocks?</H2></A>
The multiple alignments are first transformed into position specific 
scoring matrices (<A HREF="PSSM_def.html">PSSMs</A>). Each column in 
the PSSM corresponds to a position in the 
alignment and has the amino acid distribution of that position. The 
transformation into the PSSM is done with position-based sequence weights 
(<A HREF="/blocks/papers/#SEQUENCE_WEIGHTS.ps">Henikoff & Henikoff, 1994a</A>) 
and odd ratios between the amino acid frequencies 
observed in the multiple alignments and the frequencies expected 
from protein databases 
(<A HREF="/blocks/papers/#BLOCKMAKER.ps">Henikoff & Henikoff, 1995</A>). 
The transformation corrects possible overrepresentation of some 
sequences by sequence weighting and considers the background 
frequencies of the amino acids. 
The method was tested and calibrated with ungapped local multiple alignments 
(blocks) from the 
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
.<P> 

The matrices are treated as sequences of columns, enabling 
their alignment with one another. To use algorithms developed for
aligning single sequences we need a measure for comparing pairs of 
matrix columns. This corresponds to the substitution matrices 
(PAM, BLOSUM etc.) used in single-sequence alignments. The
measure used in our method to score the similarity between pairs of 
matrix columns is the Pearson correlation coefficient <A NAME="Pr">(r)</A>: 
<IMG SRC = "LAMA/LAMA_r.gif" HEIGHT=69 WIDTH=181>
where A(i) and B(i) are the values of amino acid i in columns A and B, 
respectively, and /A and /B
are the means of the values in columns A and B.
The correlation score ranges from 1 for columns with identical 
amino acid distributions to -1 for columns with opposite 
distributions (in each column only 10 amino acids occur and 
those 10 amino acids are different in the two compared 
columns).<P>

The score of a block-to-block alignment is the sum of the scores from 
comparing the corresponding columns in the two block matrices: <BR>
<IMG SRC = "LAMA/LAMA_algorithm.gif" HEIGHT=477 WIDTH=438>
<PRE>
Local alignment of blocks.
Positions 2 to 7 from block A aligned with positions 4 to 9 from 
block B. A column comparison score, <STRONG>s(Xn*Ym)</STRONG>, is calculated for 
each pair of positions (A2*B4 to A7*B9). The score of the alignment 
of the two segments, <STRONG><I>S</I></STRONG>, is the sum of the column comparison scores.
</PRE>

 The alignment is done using the Smith-Waterman algorithm for optimal 
local alignments. No gaps are allowed since the aligned objects are 
short conserved sequence regions. All alignments above the cutoff score 
are reported for each pair of compared blocks. There may be cases where parts
of one long block are similar to several blocks:
<STRONG><PRE>
	AAAAAAAAAAAAAAAAAAA
	 BBB       CCCCCC
</PRE></STRONG>


<A NAME="LAMA_INPUT"><H2>Input for LAMA</H2></A>

<A NAME="LAMA_INPUT_CONTENT"><H3>Content of input</H3></A>
 LAMA can compare any multiple alignment if it is in the correct format.
<I>However</I>, the 
<A HREF="#Pr">column comparison measure</A> and 
the <A HREF="#SCORE_SIGNIF">significance estimation</A> of the scores
are appropriate for protein sequence blocks - ungapped conserved multiple 
alignments. The use of other types of multiple alignments, such as global 
multiple alignments that include many gaps, may give misleading results.
For example, the resulting alignments may not be optimal or their 
significance different from what the output suggests.
<P>

 If you only have a single protein sequence or want to find more protein
sequences related to yours you can search the sequence databases.
One way to do this on the WWW is using the 
<A HREF="http://www.ncbi.nlm.nih.gov/cgi-bin/BLAST/nph-blast?Jform=0">
BLAST program</A> to search the
<A HREF="http://www.ncbi.nlm.nih.gov/index.html">NCBI</A> sequence databases. 
Links to other search methods can be found at 
the Baylor College of Medicine Human Genome Center
<A HREF="http://dot.imgen.bcm.tmc.edu:9331/seq-search/protein-search.html">
Search Launcher site</A>.<P>

 The <A HREF = "/blocks/make_blocks.html">BlockMaker</A> WWW site can be used
for finding blocks in your group of related protein sequences. There are 
various other methods for making protein multiple sequence alignments. 
Among these are the 
<A HREF="http://meme.sdsc.edu/meme/website/meme-intro.html">
MEME system</A>,
<A HREF = "http://www3.ncbi.nlm.nih.gov:80/htbin-post/Entrez/query?uid=94023958&form=6&db=m&Dopt=r">
Gibbs sampling programs</A>, 
the <A HREF="http://www3.ncbi.nlm.nih.gov:80/htbin-post/Entrez/query?uid=91172743&form=6&db=m&Dopt=r">
MACAW interactive program</A>, and
the <A HREF = "http://www3.ncbi.nlm.nih.gov:80/htbin-post/Entrez/query?uid=95075648&form=6&db=m&Dopt=r">
CLUSTAL-W progressive multiple alignment program</A>. 
Several of these methods are available through the 
<A HREF="http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html">
multiple sequence alignment page</A> 
at the Baylor College of Medicine Human Genome Center.
<P>

Multiple alignments submitted to the program should be of conserved, 
relatively ungapped, protein sequence regions. A few gaps in the 
alignment are acceptable. The more sequences are in the alignment the 
better. In general, avoid alignments with less than 4 sequences.
<P>


<A NAME="LAMA_INPUT_FORMAT"><H3>Format of input</H3></A>
LAMA only accepts input in the 
<A HREF="/blocks/blocks_format.html">Block format</A>. Other multiple
alignments can be <A HREF="/blocks/block_formatter.html">
reformatted to the Block format</A>. If you are not sure of your 
multiple alignment or just have a group of <STRONG>related</STRONG>
sequences you can use the 
<A HREF = "/blocks/make_blocks.html">BlockMaker program</A> for 
finding blocks in the sequences. Note that to avoid biassed sequence 
representation blocks include sequence weights.<P>

<A NAME="LAMA_OUTPUT_OPTIONS"><H3>Output options</H3></A>
<UL>
<A NAME="OUTPUT"><LI>Output level</A><BR> 
The <A HREF="#LAMA_OUTPUT">standard output</A> displays pairs of 
blocks with alignment scores above a <A HREF = "#SCORE_CUTOFF">
Z score cutoff</A>. When both target and query blocks are given 
by the user there are options for also seeing the 
<A HREF="#Pr">column scores</A> composing the alignment score 
for <I>every</I> reported alignment and the <A HREF="#LAMA_HOW">PSSMs</A>
of <I>all</I> the compared blocks.<BR>

<A NAME="CUTOFF"><LI>Score cutoff</A><BR>
The default cutoff value is 5.6 Z scores. When both target and query 
blocks are given by the user different cutoffs can be specified. 
Giving a lower value will allow reporting of weaker alignments. 
Alignments with low values can occur by chance between unrelated 
blocks. Raising the cutoff score may exclude potentially genuine 
alignments. The <A HREF = "#EXPECTED">expected number</A> of 
occurrences should be used to <A HREF = "#EVALUATING_SCORES">
evaluate the alignment scores</A>.
</UL>
Some of the <A HREF="#EXAMPLES">examples</A> included in this document
illustrate the use of the options.<P>


<H2><A NAME="LAMA_OUTPUT">Output from LAMA</A></H2>

<H3><A NAME="LAMA_OUTPUT_FORMAT">Content and format of output:</A></H3>
<pre><HR>
LAMA version 1.00 October 96.

Minimal length of reported alignments   4
Score cutoff is 5.6 Z score units (in the top 7.7e-05 percentile of chance scores)


                                            alignment     Z-score  expected number for 
block 1   from:to       block 2   from:to   length                 searching 5000 blocks
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>   20 :  46 and <A HREF="/blocks-bin/getblock.sh?BL00042#BL00042B">BL00042B</A>    3 :  29 (27) score  39 ( 7.2  1.3e-02) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+2+/howard/blocks/bin/blocks.dat+BL00042B+3+27"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+20+/howard/blocks/bin/blocks.dat+BL00042B+3+27"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>    5 :  39 and <A HREF="/blocks-bin/getblock.sh?BL00324#BL00324C">BL00324C</A>    3 :  37 (35) score  27 ( 6.1  1.5e-01) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+5+/howard/blocks/bin/blocks.dat+BL00324C+3+35"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+5+/howard/blocks/bin/blocks.dat+BL00324C+3+35"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>   12 :  47 and <A HREF="/blocks-bin/getblock.sh?BL00622#BL00622">BL00622</A>     8 :  43 (36) score  33 ( 8.2  0.0e+00) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+12+/howard/blocks/bin/blocks.dat+BL00622+8+36"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+12+/howard/blocks/bin/blocks.dat+BL00622+8+36"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>   10 :  46 and <A HREF="/blocks-bin/getblock.sh?BL00894#BL00894A">BL00894A</A>    1 :  37 (37) score  26 ( 5.7  3.2e-01) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+10+/howard/blocks/bin/blocks.dat+BL00894A+1+37"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+10+/howard/blocks/bin/blocks.dat+BL00894A+1+37"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>    4 :  42 and <A HREF="/blocks-bin/getblock.sh?BL01043#BL01043A">BL01043A</A>    2 :  40 (39) score  29 ( 8.1  0.0e+00) [<A HREF="/blocks-bin/LAMA_show_alignment?/howard/blocks/bin/blocks.dat+BL01063B+4+/howard/blocks/bin/blocks.dat+BL01043A+2+39"><IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment"></A> <A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+4+/howard/blocks/bin/blocks.dat+BL01043A+2+39"><IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos"></A><A HREF="/about_logos.html">?</A>]
</pre>

The program version and execution parameters head the search output. 
Only alignments longer than the <STRONG>minimal length</STRONG> will 
be reported. The significance of very short alignments (fewer than 4 
positions) 
cannot be reliably estimated. Alignments with scores equal or above 
the <A NAME="SCORE_CUTOFF"><STRONG>score cutoff</STRONG></A> will be reported. 
The score cutoff is specified as a <STRONG>Z score</STRONG>. 
<A NAME="Z_SCORE">Z score</A> is 
the number of standard deviations between the score and the mean score. 
<A NAME="SHUFFLED_SCORES">T</A>he mean score and the standard deviations 
were calculated for the random scores from the alignment of a large number 
of shuffled unbiassed blocks (7 million block pairs; 
see <A HREF="#SUPPLMNT">first supplement</A>).
The <STRONG>Z score</STRONG> is related to the percentile of the score 
in the shuffled blocks scores. This dependence is not linear but sigmoidal
(see <A HREF="#SUPPLMNT">second supplement</A>).<BR>
For each reported alignment the program shows the names of the two 
<STRONG>aligned blocks</STRONG>, 
their <STRONG>position</STRONG> relative to one another,
the <STRONG>alignment length</STRONG>,
the <STRONG>score</STRONG>, 
and the <STRONG>expected number</STRONG>
of such scores when searching a given number of blocks. 
<A NAME="EXPECTED">T</A>he expected number is for chance (random) 
alignments of unbiassed blocks. 
It is calculated from the score percentiles between the shuffled 
unbiassed blocks.
In this example the expected number is for searching 5000 blocks. 
Blocks from the 
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
and from the
<A HREF="/blocks/about_blocks.html#prints">Prints database</A> 
will be linked to the database entries. The "alignment" link 
(<IMG SRC="/blocks/icons/aligned_blocks.gif" HEIGHT="11" WIDTH="21" ALT="alignment">) 
shows the alignment of the two blocks. This can also be seen by 
following the "logos" 
(<IMG SRC="/blocks/icons/logos.gif" HEIGHT="13" WIDTH="35" ALT="Logos">) 
link that shows the <A HREF="/blocks/about_logos.html">sequence logos</A> 
of aligned pairs of blocks.
<A HREF="/blocks/about_logos.html">Sequence logos</A> are graphical representations 
of the blocks. 
For example, 
<A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL01063B+12+/howard/blocks/bin/blocks.dat+BL00622+8+36">here</A> 
(PostScript viewer required) the logo of block 
<A HREF="/blocks-bin/getblock.sh?BL00622#BL00622">BL00622</A>
is shifted 4 positions relative to the logo of block 
<A HREF="/blocks-bin/getblock.sh?BL01063#BL01063B">BL01063B</A>
so that their similar segments (8-43 and 12-47) are aligned. 
Indeed, these segments both contain helix-turn-helix DNA binding motifs.
<P>
When both query and target blocks are provided by the user the 
<A HREF="#OUTPUT">output</A> can also contain the column scores 
of each reported alignment and the <A HREF="#LAMA_HOW">PSSMs</A> 
of every compared block.
<P>
Pay attention to any error or warning messages. Most will probably 
have to do with the <A HREF="#LAMA_INPUT_FORMAT">format of the input</A>.
<P>

<A NAME="EVALUATING_SCORES"><H3>Evaluating LAMA alignment scores</H3></A>
The alignment score is the average of the 
<A HREF="#Pr">column scores</A> in the alignment multiplied by 100. 
Since the column scores have a range of -1 to 1 the alignment score 
will range from -100 to 100. An alignment score of 46 means 
that on average the aligned positions had a correlation coefficient
of 0.46. <I>The significance of the alignment score depends on the 
length of the compared blocks.</I> Alignments between longer blocks 
will tend to be longer and have higher scores. 
The <A HREF="#Z_SCORE">Z score</A> and 
<A HREF="#EXPECTED">expected number</A> let us estimate the 
<A NAME="SCORE_SIGNIF">significance of the scores</A> 
and to compare alignments of different lengths. 
The higher the Z score the less likely the alignment is due
to chance. <I>How unlikely depends on the number of blocks searched.
The more blocks searched the greater the probability to find chance 
high scores.</I> For example, the output of the calibration with the 
<A HREF="#SHUFFLED_SCORES">shuffled blocks</A> contained 7 million 
scores but no alignments with Z scores greater than 8.3 . 
Hence an alignment with a score equal or higher than that Z score 
is unlikely by chance in a comparable or smaller number 
of alignments. The expected number shows this directly. 
The expected number is shown for searching 5000 blocks since version 9.1 of the
<A HREF="/blocks/help/blocks_release.html">Blocks Database</A>
contains 3300 blocks. For example, searching this release of 
the Blocks Database and finding an alignment expected to appear 
1.8e-01 times (0.18) suggests that this alignment is not due to chance.
Alignments with expected occurrences of 7.5e-03 or even 0 are almost
certainly genuine (or due to <A HREF="#BIASED_BLOCKS">biassed blocks</A>,
see <A HREF="#TABLE1">below</A>).<BR>

 A relation between two families by a single pair of blocks with a
high Z score is termed a <STRONG>single hit</STRONG>.
However, protein families often have a number of blocks. 
A <STRONG>multiple hit</STRONG> is when two or more block pairs 
from the same two families are similar:
<STRONG><PRE>
                                               multiple hit
     Family 1, blocks 1A, 1B, 1C, 1D.         1A=2B + 1D=2C
     Family 2, blocks 2A, 2B, 2C.
</PRE></STRONG>
We expect the order of the blocks in the hit to be the same in both 
families (in this example 1A -> 1D and 2B -> 2C).<BR>


 Individual block pairs with Z scores likely by chance 
by themselves can still indicate a genuine relation if they 
are in a multiple hit. While the shuffled blocks scores contained 
no single hit with Z score above 8.3, there were no multiple hit 
with Z scores less than 5.6 . Hence genuine relationships can also 
be indicated by <I>several</I> alignments whose Z scores are 
<I>individually</I> expected to occur by chance.<P>

When comparing blocks against a database the Z score cutoff is set as 5.6, 
corresponding to expected occurrence rate of 0.385 per searching 5000 blocks.
When both query and target blocks are provided other cutoffs can be 
<A HREF="#CUTOFF">chosen</A>.
<P>

False positive (high score but no relation) and false negative 
(low score but genuine relation) hits are still possible and
biological knowledge and common sense should be used. 
<A NAME="BIASED_BLOCKS">Compositionally</A> 
biassed blocks (consisting of sequence segments rich in a few amino 
acids or short repeats) are a common cause for false positive hits. 
You can check if a block is biassed <A HREF="/blocks/biassed_blocks.html">here</A>.
False negative hits can be caused by misalignment in the blocks .
<P>

<A NAME="TABLE1">E</A>ach entry in the 
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
version 8.6 (3174 blocks from 858 protein families)
was searched against the other entries in the database. 
All block pairs with Z scores larger than 5.6 were saved. 
Protein families related by more then one saved score were
considered as multiple hits and alignments with Z scores 
above 8.3 as single hits. This resulted in 141 pairs of families. 
Eighty percent of these were 
identified as genuine relationships (true positives) according to the 
family descriptions, by sharing common sequences, or by detailed 
examination. Compositional bias was responsible for another eight percent 
of the high scores. The remaining twelve percent of the high scores could 
not be classified as either genuine or false based on available evidence.<P>


<TABLE BORDER WIDTH=532>
<CAPTION>Distribution of top scoring family pairs</CAPTION>
<TR VALIGN=top><TD>Relation type</TD><TD>Genuine(1)</TD><TD>Biassed<BR>Composition</TD><TD>Unknown</TD><TD><B>Total</TD></TR>
<TR VALIGN=top><TD><PRE>Multiple block hits- independent(2)</TD><TD><PRE>  24 </TD><TD><PRE>  -</TD><TD><PRE>  1 </TD><TD><PRE><B>  25 </TD></TR>
<TR VALIGN=top><TD><PRE>                   - repeats(3)</TD><TD><PRE>  11 </TD><TD><PRE>  6 </TD><TD><PRE>  9 </TD><TD><PRE><B>  26 </TD></TR>
<TR VALIGN=top><TD><PRE>                   - inner repeats(4)</TD><TD><PRE>  15 </TD><TD><PRE>  4 </TD><TD><PRE>  2 </TD><TD><PRE><B>  21 </TD></TR>
<TR VALIGN=top><TD><PRE>Single block hits</TD><TD><PRE>  63</TD><TD><PRE>  1</TD><TD><PRE>  5</TD><TD><PRE><B>  69</TD></TR>
<TR VALIGN=top><TD><B>Total</TD><TD><PRE><B> 113</TD><TD><PRE><B> 11</TD><TD><PRE><B> 17</TD><TD><PRE><B> 141</TD></TR>
<TR VALIGN=top><TD><B>Fraction</TD><TD><PRE><B>  80%</TD><TD><PRE><B>  8%</TD><TD><PRE><B> 12%</TD><TD></TD></TR>
</TABLE>
<BR>
<PRE>
(1) Genuine relations were identified by the families prosite descriptions,
    detailed analysis of the literature or by sharing common sequences 
    (22 of the single and independent-multiple hits).
(2) An independent multiple hit is two different protein families 
    related by two or more different block pairs.
(3) A repeat multiple hit is two different protein families where a 
    block from one family is similar with two or more blocks from the 
    other family.
(4) An inner-repeat multiple hit is a case where the similarities are 
    between blocks from the same family.
</PRE>

<A NAME="EXAMPLES"><H2>Examples</H2></A>
<UL>
<LI><A NAME="FLAVOPROTEINS"><H3>Flavoproteins FAD binding and catalytic sites</H3></A><P>
     A comparison of all the Blocks Databases v8.6 entries with each other 
found the following hit between FAD flavoprotein subunits from two 
oxidoreductase enzyme complexes, BL00504 - succinate dehydrogenases 
(Sdh) and fumarate reductases (Frd) and BL00677 - D-amino oxidases (DAO):
<PRE>
                                            alignment     Z-score  expected number for
block 1   from:to       block 2   from:to   length                 searching 5000 blocks
BL00504A    2 :  20 and BL00677A    2 :  20 (19) score  51 (10.0  0.0e+00) [<A HREF="/blocks-bin/LAMA_logos?/blocks/help/LAMA/BL00504.dat+BL00504A+2+/blocks/help/LAMA/BL00677.dat+BL00677A+2+19">logos</A> <A HREF="/blocks/about_logos.html">?</A>]
</PRE>
A comparison with a lower cutoff found another hit supporting the first one:
<PRE>
BL00504D    3 :  35 and BL00677D   17 :  49 (33) score  26 ( 5.5  5.1e-01) [<A HREF="/blocks-bin/LAMA_logos?/blocks/help/LAMA/BL00504.dat+BL00504D+3+/blocks/help/LAMA/BL00677.dat+BL00677D+17+33">logos</A> <A HREF="/blocks/about_logos.html">?</A>]
</PRE>

Sequence annotations and a literature search revealed that block BL00504A 
is the FAD-binding site and BL00504D is the active site (Birch Machin 
<I>et al</I>., 1992) of the Sdh/Frd flavoproteins. Block BL00677A is 
the FAD-binding site of the DAO proteins. The FAD AMP-binding sites in 
both families are beta-alpha-beta ADP binding folds and were already 
noted as such (Birch-Machin et al., 1992; Schulz <I>et al</I>., 1982). 
This explains the first hit. 
<P>
     The DAO BL00677D block has a conserved histidine important for 
enzymatic activity of pig DAO (Miyano <I>et al</I>., 1991). This histidine 
is aligned with a conserved and essential histidine in the Sdh/Frd 
flavoproteins catalytic site (Birch-Machin et al., 1992; Schroder 
<I>et al</I>., 1991). Other positions in these aligned regions are also 
similar (column scores 0.31 to 0.98). The dissimilar positions have 
column scores close to zero (0.04 to -0.14). This finding suggests 
that the active site of DAO flavoproteins is in the BL00677D region with 
the conserved histidine as the crucial residue.
<P>
      BLAST and FASTA searches of the SwissProt protein database could 
not identify this similarity. No sequence from one family identified 
any sequence from the other family. Optimal local alignments of all the 
sequence pairs from the two families had scores expected by chance. 
Searching the Blocks Database with the sequences from the two families 
identified the relation between the families with 6 Sdh/Frd flavoproteins 
sequences (multiple hits with 98.1 to 76.2 percentiles of scores with 
shuffled sequence queries and P values of 8.4*10-3 to 1.1*10-1) but not 
with the other two sequences from that family or any of the sequences 
from the DAO family (single hits with less then 60.0 score percentiles).
<P>
<IMG SRC="LAMA/LAMA_flavoproteins.gif" HEIGHT=555 WIDTH=503>
<PRE>
Suggested catalytic site of DAO flavoproteins.
A, positions 17-49 of DAO flavoproteins (block BL00677D) aligned with
the catalytic region of Sdh/Frd flavoproteins (positions 3-35 of block
BL00504D). The histidines important for the enzymes catalytic activity
are outlined (the histidine in sequence DHSA_BACSU is misaligned due to
a two aa insertion). The start and end coordinates flank the sequences.
B, the column scores of the alignment.
</PRE>
<P>

Birch-Machin, M. A., Farnsworth, L., Ackrell, B. A., Cochran, B., Jackson, S.,
 Bindoff, L. A., Aitken, A., Diamond, A. G. & Turnbull, D. M. (1992). 
The sequence of the flavoprotein subunit of bovine heart succinate 
dehydrogenase. <I>J. Biol. Chem.</I> <B>267</B>, 11553-11558.<P>
Miyano, M., Fukui, K., Watanabe, F., Takahashi, S., Tada, M., Kanashiro, M. & 
Miyake, Y. (1991). Studies on Phe-228 and Leu-307 recombinant mutants of 
porcine kidney D-amino acid oxidase: expression, purification, and 
characterization. <I>J. Biochemistry</I> <B>109</B>, 171-177.<P>
Schroder, I., Gunsalus, R. P., Ackrell, B. A., Cochran, B. & Cecchini, G. 
(1991). Identification of active site residues of Escherichia coli fumarate 
reductase by site-directed mutagenesis. <I>J. Biol. Chem.</I> <B>266</B>, 
13572-13579.<P>
Schulz, G. E., Schirmer, R. H. & Pai, E. F. (1982). FAD-binding site of 
glutathione reductase. <I>J. Mol. Biol.</I> <B>160</B>, 287-308.
<HR>
<LI><A NAME="ST_CD59"><H3>Snake toxins and the CD59 extracellular domain</H3></A><P>

Conserved regions from snake toxins and the CD59 extracellular domain were found
similar to each other. The alignment score is not very striking but the two families 
seem be quite dissimilar. What is the connection between snake toxins, small 
extracellular proteins that bind to nerve receptors, and the CD59 domain, a domain 
that is found in one or more copies on GPI-linked cell surface glycoproteins ? 
a closer look at the alignment was taken by requesting to see the column scores.
These scores are shown above the score line for each of the 12 alignment positions 
(8,3 to 19,14):
<PRE>
Column scores for optimal alignment of <A HREF="/blocks-bin/getblock.sh?BL00272#BL00272B">BL00272B</A> and <A HREF="/blocks-bin/getblock.sh?BL00983#BL00983B">BL00983B</A> -
  8, 3   9, 4  10, 5  11, 6  12, 7  13, 8  14, 9  15,10  16,11  17,12  18,13 19,14
 0.262  0.169  0.138  0.286  0.995  1.000  0.368  0.224  0.986 -0.067  1.000 1.000
<A HREF="/blocks-bin/getblock.sh?BL00272#BL00272B">BL00272B</A>    8 :  19 and <A HREF="/blocks-bin/getblock.sh?BL00983#BL00983B">BL00983B</A>    3 :  14 (12) score  53 ( 6.5  6.0e-02) [<A HREF="/blocks-bin/LAMA_logos?/howard/blocks/bin/blocks.dat+BL00272B+8+/howard/blocks/bin/blocks.dat+BL00983B+3+12">logos</A> <A HREF="/blocks/about_logos.html">?</A>]
</PRE>

 Five of the positions [(12,7), (13.8), (16,11), (18,13) and (19,14)]
have very high column scores (0.986-1.000) 
indicating identical and almost identical amino acid distribution in these
column pairs. The other positions contribute less to the alignment score
and position (12,17) has a slightly negative score, actually detracting from
the alignment.<P>
Upon requesting to see the PSSMs of the blocks (below) or their aligned 
logos (link to 'logos' above) you will note 
that 3 of the alignment positions contributing to the score are highly 
conserved cysteine residues. This raises the possibility of identical 
patterns of disulphide bonds in both regions. We might give this 
alignment more attention since disulphide bonds are known to be well
conserved even between distantly related sequences. 
 More information can be found by following the block links to the
<A HREF="/blocks">Blocks Database</A> 
entries. Each family is accompanied by its 
<A HREF="http://www.ebi.ac.uk/interpro/">InterPro</A> 
annotation and the multiple alignment each block can be 
viewed as a graphical 
<A HREF="/blocks/help/about_logos.html">sequence logo</A>. 
The <A HREF="/blocks/help/LAMA/LAMA_cardiotoxin+CD59.JPEG">structures of both proteins</A> 
are known and confirm their relation. (The 
<A HREF="http://www.expasy.ch/sw3d/">SWISS-3DIMAGE</A> 
was the source for these images of the structures.)
<P>
<PRE>
PSSM of <A HREF="/blocks-bin/getblock.sh?BL00272#BL00272B">BL00272B</A>

  |                                       1   1   1   1   1   1   1   1   1   1
  |   1   2   3   4   5   6   7   8   9   0   1   2   3   4   5   6   7   8   9
--+----------------------------------------------------------------------------
A |   0   0   0  13   0   3   0   0   1   0   2   0   0   0   1   0   2   0   0
C |  87  12  12  11   0   0   0   0  21   0   6 100 100   0   0   0   0  99   0
D |   0   0   3   2  11   9   3   2   6   0   0   0   0   3   0  82  10   0   2
E |   0   5   2   8   3   5   2   9   4   6   9   0   0   7   0   5   5   0   0
F |   2   3   9   0   0   2   6   4   2   2   0   0   0   0   0   0   0   0   0
G |   1   1   1   1   3   0  24   8   7   0   0   0   0   2   3   0   1   0   2
H |   0   0   2   0   0   4   2   0   4   0  13   0   0   6   0   0   0   0   0
I |   0   0   4   0   2   1   0  17   7  30   3   0   0   0   6   0   0   0   0
K |   0   3  22   4  30   3   8   5  17   0  24   0   0  16   1   0  36   0   0
L |   0   0   1   0   1   3   8   3   0  14   5   0   0   0   0   0   2   0   0
M |   0   0   0   0  11   2   9   0   3   0   3   0   0   0   0   0   0   0   0
N |   0   0   0   5   2   7   2   2   2   0   2   0   0  16   0  13  22   1  96
P |   6  65   9   2   3  23   6   8   2   5   0   0   0   0   0   0   0   0   0
Q |   0   2   6   0   0   1   0   1   6   0   8   0   0   3   0   0   0   0   0
R |   0   2   4  15   8   2   6   6   2   3   8   0   0  10   0   0  19   0   0
S |   1   4   4   6  13   3   0   4   6   2   1   0   0  19  16   0   0   0   0
T |   3   4  14   5   5   5   1   5   0   4   3   0   0  18  72   0   0   0   0
V |   0   0   3  28   5   1   0  19   1  22   6   0   0   0   0   0   1   0   0
W |   0   0   0   0   0  22   0   0   0   0   0   0   0   0   0   0   0   0   0
X |   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Y |   0   0   5   0   3   2  23   7   9  11   7   0   0   0   0   0   2   0   0
- |   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0


PSSM of <A HREF="/blocks-bin/getblock.sh?BL00983#BL00983B">BL00983B</A>

  |                                       1   1   1   1   1
  |   1   2   3   4   5   6   7   8   9   0   1   2   3   4
--+--------------------------------------------------------
A |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
C |   0   0   0   0   0   0  91 100   0   0   0   0 100   0
D |   0   0   0   0   0   0   0   0   0   0  76   0   0   0
E |   0  17  29   0  20   0   0   0   0  42   0   0   0   0
F |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
G |  10   0   0   0   0   0   0   0  10   0   0   0   0   0
H |   0   0   0   0   0  39   0   0   0   0   0   0   0   0
I |  25   0   0   0   0   0   0   0   0   0   0   0   0   0
K |   0   0   0  30   0   0   0   0  28  23   0   0   0   0
L |   0   0  48   0   0   0   0   0   0   9   0 100   0   0
M |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
N |  25  23   0  13   0   0   0   0   0   0  24   0   0 100
P |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Q |   0  15   0   0   0  13   0   0  29   0   0   0   0   0
R |   0  12   0  12   0   0   0   0  24   0   0   0   0   0
S |   0  20   0  11   0  18   0   0   9  12   0   0   0   0
T |  23  13   0  23  35  10   0   0   0  14   0   0   0   0
V |  16   0  22  11   0   8   9   0   0   0   0   0   0   0
W |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
X |   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Y |   0   0   0   0  45  13   0   0   0   0   0   0   0   0
- |   0   0   0   0   0   0   0   0   0   0   0   0   0   0

</PRE>
("X" specifies unknown amino acids.)
<HR>
<LI><A NAME="IS30"><H3>IS30 transposases DNA-binding domain</H3></A><P>

        Excision and insertion of bacterial insertion sequence elements (IS)
require the activity of a transposase protein sometimes encoded by the
ISs. The IS30 transposase family (Dong et al., 1992) is represented by
five blocks in BLOCKS 8.6. A region of 21 positions from the first block
had high scores (Z scores 6.7 to 8.8) only to helix-turn-helix
DNA-binding motifs (hth) from four protein families (see the
<A HREF="#FIGURE_HTH_GRAPH">figure</A> in the
<A HREF="#HTH">next example</A>).
Hth DNA binding motifs occur in many proteins that bind specific DNA
sequences (Pabo & Sauer, 1992).<P>
        BLAST searches of the SwissProt protein database with the IS30
sequences did not identify any protein with known hth region. Searching
the Blocks Database with the IS30 sequences gave high scores with hth
blocks for two of the sequences (98.1 and 93.1 percentiles of scores
with shuffled sequence queries (Henikoff & Henikoff, 1994)). The other
two sequences had low scores with hth blocks (30.8 and 18.1 score
percentiles) and higher scores with non-hth blocks. However, each of the
transposases putative DNA binding regions was detected by the method of
Dodd and Egan (Dodd & Egan, 1990) as an almost certain hth domain.<P>
        Classification of the first IS30 block as a hth motif is supported by
the finding that the N-terminal region of an IS30 transposase,
containing the putative hth DNA-binding region, binds the IS30 element
(Stalder et al., 1990).<P>

<IMG SRC="LAMA/LAMA_IS30.gif" HEIGHT=129 WIDTH=548>
<PRE>
Hth-like region in IS30 transposases.
Block BL01043A of the IS30 transposases family. The regions similar to
the hth motifs in the block to block searches are underlined. The start
and end coordinates flank the sequences. The diagram shows the suggested
position of the hth motifs found by the hth  algorithm (Dodd & Egan, 1990). 
The algorithm scores for hth motifs were 5.19 standard deviation
units (SD), corresponding to 100% probability for TRA1_STRSL, 5.95 SD
and 100% for TRA4_BACFR, 4.13 SD and 90% for TRA8_ALCEU, and 5.72 SD and
100% for TRA8_ECOLI.
</PRE>

Dodd, I. B. & Egan, J. B. (1990). Improved detection of helix-turn-helix
DNA-binding motifs in protein sequences. Nucl. Acid. Res. 18, 5019-5026.<P>

Dong, Q., Sadouk, A., van der Lelie, D., Taghavi, S., Ferhat, A.,
Nuyten, J. M., Borremans, B., Mergeay, M. & Toussaint, A. (1992).
Cloning and sequencing of IS1086, an Alcaligenes eutrophus insertion
element related to IS30 and IS4351. J. Bacteriol. 174, 8133-8138.<P>

Henikoff, S. & Henikoff, J. G. (1994). Protein family classification
based on searching a database of blocks. Genomics 19, 97-107.<P>

Pabo, C. O. & Sauer, R. T. (1992). Transcription factors: structural
families and principles of DNA recognition. Annu. Rev. Biochem. 61,
1053-1095.<P>

Stalder, R., Caspers, P., Olasz, F. & Arber, W. (1990). The N-terminal
domain of the insertion sequence 30 transposase interacts specifically
with the terminal inverted repeats of the element. J. Biol. Chem. 265,
3757-3762.<P>

<HR>
<LI><A NAME="HTH"><H3>Hth motifs in the Blocks Database</H3></A><P>

     In comparing the entries in the Blocks Database v8.6 among 
themselves all fourteen hth blocks had high scores with two or more 
other hth blocks (<A HREF="#FIGURE_HTH_GRAPH">Figure</A>). 
The two high scoring non-hth blocks could be 
distinguished by relating to single hth block and having lower scores 
relative to the ones between the hth blocks. The blocks are from four 
types of protein families - bacterial regulatory proteins, homeobox 
domain proteins, sigma bacterial transcription initiation factors and IS 
transposases. Manual inspection of the Prosite annotation of the protein 
families in the Blocks Database and of blocks themselves found no 
other hth blocks in the database.<P>
     The hth blocks included different number of sequences, from 4 to 185. 
There was no correlation between the number of sequences in a block and 
its relation to other blocks. This suggests that even blocks with 4-6 
sequences can give a correct representation of conserved protein domains. 
More than 90% of the blocks in the database used had more than four 
sequences. This fraction is increasing with each release (>94% in BLOCKS 
9.0) as the number of new protein sequences is higher than the number of 
new protein families (Green <I>et al</I>., 1993; Koonin <I>et al</I>., 
1995; Koonin <I>et al</I>., 1994).<P>

     Hth blocks illustrate the problem of distinguishing genuine 
relationships from chance ones and suggest a solution. Two of the hth 
blocks (BL00622 and BL01063B) lie below the threshold for detection 
single-hit relations (Z score >=8.3, bold lines in 
<A HREF="#FIGURE_HTH_GRAPH">Figure</A>). Protein 
families with hth-motifs usually have no other common blocks to support the 
relation between the hth blocks. However, hth motifs are found in several 
protein families. These hth blocks all have high scores with each other, but 
not all these scores are high enough to identify genuine relationships by 
themselves. Nevertheless, blocks with a number of such scores to known hth 
blocks can be identified as hth blocks too. The two non-hth blocks have high 
scores to single hth blocks, and do not form part of the connected graph. An 
analogous strategy is the basis for detecting weak similarities in 
single-sequence alignments using the BLAST3 program (Altschul & Lipman, 1990).
<P>

<A NAME="FIGURE_HTH_GRAPH"><IMG SRC="LAMA/LAMA_hth_graph.gif" HEIGHT=444 WIDTH=693></A>
<PRE>
High scores of helix-turn-helix DNA binding blocks.
All 14 hth blocks found in BLOCKS 8.6 and their high scoring relationships 
with each other (true positives) and with other blocks (false positives, 
outward pointing lines). Each block had different sequences except two pairs 
of homeobox blocks that had common sequences (BL00027 with BL00032B and with 
BL00035B). Lines show scores above the 5.6 Z score cutoff. Thick lines 
correspond to scores above the 8.3 Z score cutoff. BRP - bacterial 
regulatory proteins.<P>
</PRE>

Since all the hth blocks are similar to one another we examined how well 
would one composite hth block identify other hth blocks. The 
<A HREF="http://www.ncbi.nlm.nih.gov/Complete_Genomes/Ecoli/README">
ecmot database</A> (Koonin et al., 1995) contains such a 
<A HREF="http://www.ncbi.nlm.nih.gov/cgi-bin/Complete_Genomes/mot2html?EC0157">
composite hth block</A>, with 609 sequence segments from many hth families. 
The <A HREF="#EC0157_LOGO">graphical representation</A> 
(<A HREF="/blocks/about_logos.html">logo</A>) of this block 
illustrates the conservation in each of its positions. This and the 
avoidance of particular amino acids at specific positions can also be seen in 
the <A HREF="LAMA/EC0157_.pssm.html">PSSM of block EC0157</A>.
This block had high scores with 18 blocks in Blocks Database v8.6 
(<A HREF="#TABLE2">Table</A>). 
Fourteen of those are the hth blocks discussed above. All the 
hth blocks had high to extremely high scores, the lowest one expected to 
occur 3.2e-3.<BR> 
(<A HREF="LAMA/EC0157_.blk">Here</A> you will find block 
EC0157 in a format you can use in a 
<A HREF="/blocks-bin/LAMA_search.sh?LAMA/EC0157_.blk">LAMA search</A>.)<P>

The four blocks at the end of the table have significantly lower scores 
(Z 5.6-6.5). These are non-hth blocks but their similarity to the 
composite hth block can be explained. Two of the blocks are from 
bacterial regulatory proteins families, occurring C-terminal to the hth 
motifs. One is a hth-similar region from the araC family (Brunelle & 
Schleif, 1989) and the other corresponds to the 
<A HREF="LAMA/LAMA_lacIs.html">hth helix3 and DNA 
binding hinge helix in the <I>E.coli</I> lac repressor protein</A> (Lewis et 
al., 1996). Another block is from the S3 ribosomal proteins (BL00548A). 
This protein binds RNA, and it is interesting to note the recent report 
of the RNA binding activity by a hth domain (Dubnau & Struhl, 1996). The 
last non-hth block is from L-lactate dehydrogenase (LDH) proteins. LDHs 
do not bind DNA but the 
<A HREF="LAMA/LAMA_LDHs.html">crystal structure of the detected region 
(alpha-2f to Beta-G) is a helix-turn followed by a helix or strand in 
different proteins</A> (Abad Zapatero et al., 1987; Grau et al., 1981; Iwata 
& Ohta, 1993).<P>

<A NAME="EC0157_LOGO">
<IMG SRC="LAMA/EC0157_.PSSM.logo.jpeg" HEIGHT=520 WIDTH=760></A><P>

<A NAME="TABLE2"><B>Blocks similar to composite hth block</A> <A HREF="LAMA/EC0157_.blk">EC0157</A></B>
<TABLE BORDER>
<TR VALIGN=top><TH><PRE>Protein family (1)</TH><TH><PRE>Z  score</TH></TR>
<TR VALIGN=top><TD><PRE>'Homeobox' domain proteins</TD><TD><PRE>18.4</TD></TR>
<TR VALIGN=top><TD><PRE>'Homeobox' antennapedia-type proteins</TD><TD><PRE>13.2</TD></TR>
<TR VALIGN=top><TD><PRE>'POU' domain proteins</TD><TD><PRE>11.7</TD></TR>
<TR VALIGN=top><TD><PRE>BRP crp family</TD><TD><PRE>12.1</TD></TR>
<TR VALIGN=top><TD><PRE>BRP gntR family</TD><TD><PRE>12.4</TD></TR>
<TR VALIGN=top><TD><PRE>BRP lysR family</TD><TD><PRE>14.4</TD></TR>
<TR VALIGN=top><TD><PRE>BRP lacI family (2)</TD><TD><PRE>11.7</TD></TR>
<TR VALIGN=top><TD><PRE>BRP luxR family</TD><TD><PRE>12.4</TD></TR>
<TR VALIGN=top><TD><PRE>BRP arsR family</TD><TD><PRE> 8.0</TD></TR>
<TR VALIGN=top><TD><PRE>BRP deoR family</TD><TD><PRE> 8.7</TD></TR>
<TR VALIGN=top><TD><PRE>BRP tetR family</TD><TD><PRE>14.1</TD></TR>
<TR VALIGN=top><TD><PRE>Sigma-54 factors family</TD><TD><PRE> 7.8</TD></TR>
<TR VALIGN=top><TD><PRE>Sigma-70 factors ECF subfamily</TD><TD><PRE> 8.3</TD></TR>
<TR VALIGN=top><TD><PRE>Transposases, IS30 family</TD><TD><PRE>11.2</TD></TR>
<TR VALIGN=top></TR>
<TR VALIGN=top><TD><PRE>BRP araC family</TD><TD><PRE> 6.5</TD></TR>
<TR VALIGN=top><TD><PRE>BRP lacI family (2)</TD><TD><PRE> 6.6</TD></TR>
<TR VALIGN=top><TD><PRE>Ribosomal S3 proteins</TD><TD><PRE> 5.8</TD></TR>
<TR VALIGN=top><TD><PRE>L-lactate dehydrogenase family</TD><TD><PRE> 5.8</TD></TR>
</PRE>
</TABLE>
<PRE>
(1) The family Blocks Database entry numbers are in the previous figure 
    except for BRP araC family - BL00041, L-lactate dehydrogenase - BL00064D 
    and Ribosomal protein S3 proteins - BL00548A.
    The non-hth blocks are separated at the end of the table.
(2) Two blocks from the lacI hth family are similar to the composite hth block -
    block BL00356A, the hth region, and block BL00356B, the following
    DNA-binding hinge region.
</PRE>

     Identifying all the hth regions in the Blocks Database illustrates 
the potential of the multiple alignment comparison method as an aid for 
annotating protein-family databases. Besides identifying the function of 
unknown regions, the approach outlined in this example can be useful in 
annotating databases that generate the multiple alignments automatically. 
Multiple alignments of characterized protein motifs (such as the hth, 
nucleotide binding folds or leucine zipper) could be used to identify other 
multiple alignments containing these motifs.<P>

Altschul, S. F. & Lipman, D. J. (1990). Protein database searches for multiple alignments. <I>Proc. Natl. Acad. Sci. USA</I> <B>87</B>, 5509-5513.<P>

Abad Zapatero, C., Griffith, J., Sussman, J. & Rossmann, M. (1987). 
Refined crystal structure of dogfish M4 apo-lactate dehydrogenase. 
<I>J Mol Biol</I> <B>198</B>, 445-467.<P>

Brunelle, A. & Schleif, R. (1989). Determining residue-base interactions 
between AraC protein and araI DNA. <I>J Mol Biol</I> <B>209</B>, 607-622.<P>

Dubnau, J. & Struhl, G. (1996). RNA recognition and translational 
regulation by a homeodomain protein. <I>Nature</I> <B>379</B>, 694-699.<P>

Grau, U., Trommer, W. & Rossmann, M. (1981). Structure of the active 
ternary complex of pig heart lactate dehydrogenase with S-lac-NAD at 2.7 
A resolution. <I>J Mol Biol</I> <B>151</B>, 289-307.<P>

Green, P., Lipman, D., Hillier, L., Waterston, R., States, D. & Claverie, J. M. 
(1993). Ancient conserved regions in new gene sequences and the protein databases. 
<I>Science</I> <B>259</B>, 1711-1716.<P>

Iwata, S. & Ohta, T. (1993). Molecular basis of allosteric activation of 
bacterial L-lactate dehydrogenase. <I>J Mol Biol</I> <B>230</B>, 21-27.<P>

Koonin, E., Tatusov, R. & Rudd, K. (1995). Sequence similarity analysis of 
Escherichia coli proteins: functional and evolutionary implications. 
<I>Proc Natl Acad Sci USA</I> <B>92</B>, 11921-11925.<P>

Koonin, E. V., Bork, P. & Sander, C. (1994). Yeast chromosome III: 
new gene functions. <I>EMBO J.</I> <B>13</B>, 493-503.<P>

Lewis, M., Chang, G., Horton, N. C., Kercher, M. A., Pace, H. C., 
Schumacher, M. A., Brenan, R. G. & Lu, P. (1996). Crystal Structure of 
the Lactose Operon Repressor and Its Complexes with DNA and Inducer. 
<I>Science</I> <B>271</B>, 1247 1254.<P>
<HR>
</UL>


<A NAME="SUPPLMNT"><H2>Supplements</H2></A>
To calibrate the LAMA scores the 
<A HREF="/blocks/about_blocks.html#blocks">Blocks Database</A>
was purged from <A HREF="#BIASED_BLOCKS">biassed blocks</A>, the PSSMs of 
the remaining blocks were each shuffled and then compared against the 
blocks from the unshuffled database. The best score from each of
the resulting 7 million comparisons was saved. These scores are due to chance
and were used to estimate the significance of alignment scores between blocks.
The mean and variance of chance alignments depend on the length of the 
compared blocks. Longer blocks will give longer alignments and higher scores
by chance alone. Grouping the chance scores by the length of the shorter 
block in each comparison gave very similar score distributions. The mean 
and standard deviation of each group was used to transform each score into
a <A HREF="#Z_SCORE">Z score</A>. The percentiles of all these Z scores was
then calculated. These percentiles are used to estimate the 
<A HREF="#EXPECTED">expected number</A> each score should appear not due 
to genuine relationship.<P>

Following are links to tables with this data. Note that the scores in the
tables are the raw scores of the alignments. The scores shown in the LAMA
output are normalized by dividing the raw score by the alignment length.

<UL>
<LI><A HREF="LAMA/LAMA.Z_stat.html">Mean and standard deviation for scores expected by chance</A>
<LI><A HREF="LAMA/LAMA.ZVp.html">Percentile of Z scores expected by chance</A>
</UL><P>

<A NAME="LAMA_CREDITS"><H2>Credits and citation</H2></A>
The multiple alignment comparison method and LAMA program were developed by
<A HREF="/~pietro">Shmuel Pietrokovski</A> 
in the lab of Steve Henikoff at the 
<A HREF="http://www.fhcrc.org">Fred Hutchinson Cancer Research Center</A>, 
<A HREF="http://www.cyberspace.com/bobk/">Seattle</A>.<P>
An article describing the method and its uses<BR>
"<STRONG>Searching Databases of Conserved Sequence Regions by 
 Aligning Protein Multiple-Alignments</STRONG>"<BR>
appeared in
<A HREF="http://www.oup.co.uk/oup/smj/journals/ed/titles/nar/Volume_24/Issue_19/6s0225_gml.abs.html">
Nucleic Acids Research 24(19) 3836-3845 (October 96')</A>. 
This article should be cited in research using this method.<BR>
<HR>
<A HREF="/blocks">[Blocks home]</A> 
<A HREF="/blocks/blocks_search.html">[Block Searcher]</A>
<A HREF="/blocks/make_blocks.html">[Block Maker]</A>
<A HREF="/blocks-bin/getblock.sh">[Get Blocks]</A>
<A HREF="/blocks/block_formatter.html">[format a block]</A>
<A HREF="/blocks/biassed_blocks.html">[check for biassed blocks]</A>
<A HREF="/blocks-bin/LAMA_search.sh">[LAMA Searcher]</A>
<HR>
Page last modified <MODIFICATION_DATE>January 1997</MODIFICATION_DATE> 
(thanks for Liz G.Wiz for useful comments)

<Address>
<A HREF="/~pietro">Shmuel Pietrokovski</A>
</Address>