File: hhsuite-userguide.tex

package info (click to toggle)
hhsuite 3.0~beta3%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 18,700 kB
  • sloc: cpp: 26,859; perl: 12,238; ansic: 11,057; python: 3,011; makefile: 59; sh: 24
file content (2609 lines) | stat: -rwxr-xr-x 180,206 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
% ????
% TO DO: 
% Include description of how to visualize MSAs using hhfilter and reformat!

%\documentclass[final,1p,times]{elsarticle}
\documentclass[11pt,a4paper]{article}
\usepackage{url}
\usepackage{cite}
\usepackage{longtable}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{fancyvrb}

\setlength{\parindent}{0mm}
\setlength{\oddsidemargin}{-2mm}
\setlength{\oddsidemargin}{-2mm}
\setlength{\topmargin}{0mm}
\setlength{\textheight}{245mm}
\setlength{\textwidth}{165mm}
\setlength{\headheight}{0mm}
\setlength{\headsep}{0mm}
\setlength{\parskip}{3mm}

\usepackage{bookmark,hyperref}

\hypersetup{
%     bookmarks=true,         % show bookmarks bar?
    unicode=true,          % non-Latin characters in Acrobat’s bookmarks
    pdftoolbar=true,        % show Acrobat’s toolbar?
    pdfmenubar=true,        % show Acrobat’s menu?
    pdffitwindow=false,     % window fit to page when opened
    pdfstartview={FitH},    % fits the width of the page to the window
    pdftitle={HHsuite User Guide},      % title
    pdfauthor={Johannes S\"oding},  % author
    pdfsubject={HHsuite for sensitive sequence searching based on HMM-HMM alignment},   % subject of the document
    pdfnewwindow=true,      % links in new window
    colorlinks=true,        % false: boxed links; true: colored links
    linkcolor=black,        % color of internal links
    citecolor=black,        % color of links to bibliography
%     filecolor=black,        % color of file links
%     urlcolor=black,         % color of external links
}
%\urlstyle{same}            % show urls using the same font as the text

\newcommand{\hhversion}{Version 3.0.0, March 2015} 

%\title{HHsuite for sensitive sequence searching\\[2mm] based on HMM-HMM alignment} 

\begin{document}
\bibliographystyle{unsrt}
%\maketitle

\begin{center}

\vspace{20mm}
 
{\huge \bf HHsuite for sensitive protein sequence\\[2mm] searching based on HMM-HMM alignment}\\[4mm] 

{\Large \bf User Guide}

  \hhversion\\[2mm]
{\copyright  Johannes S\"oding, Markus Meier, Martin Steinegger, Michael Remmert, Andreas Hauser, Christof Angerm\"uller, Armin Meier and Andreas Biegert}\\[2mm]\


{\bf \Large Summary}

\end{center}

\noindent The HHsuite is an open-source software package for sensitive protein sequence searching based on the pairwise alignment of hidden Markov models (HMMs). It contains HHsearch [1] and HHblits [2] among other programs and utilities. HHsearch takes as input a multiple sequence alignment (MSA) or profile HMM and searches a database of HMMs (e.g.\ PDB, Pfam, or InterPro) for homologous proteins. HHsearch is often used for protein structure prediction to detect homologous templates and to build highly accurate query-template pairwise alignments for homology modeling. In the CASP9 competition (2010), a fully automated version of HHpred based on HHsearch and HHblits was ranked best out of 81 servers in template-based structure prediction. HHblits can build high-quality MSAs starting from single sequences or from MSAs. It transforms these into a query HMM and iteratively searches through uniprot20 database by adding significantly similar sequences from the previous search to the updated query HMM for the next search iteration. Compared to PSI-BLAST, HHblits is faster, up to twice as sensitive and produces more accurate alignments. HHblits uses the same HMM-HMM alignment algorithms as HHsearch, but it employs a fast prefilter that reduces the number of database HMMs for which to perform the slow HMM-HMM comparison from tens of millions to a few thousands. 
\\[2mm]

{\bf References:}


\begin{tabular}[t]{ll}
[1] & S\"oding J. (2005)\\ 
    & Protein homology detection by HMM-HMM comparison.\\
    & {\it Bioinformatics} {\bf 21}, 951-960.\\[2mm]

[2] & Remmert M., Biegert A., Hauser A., and S\"oding J. (2011) \\
    & HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. \\
    & {\it Nat.\  Methods} {\bf 9}, 173-175.\\
\end{tabular}



\newpage

\setlength{\parskip}{0mm}
\tableofcontents
\setlength{\parskip}{2mm}

\newpage

\section{Introduction}
The HHsuite is an open-source software package for highly sensitive sequence searching and sequence alignment. Its two most important programs are HHsearch and HHblits. Both are based on the pairwise comparison of \textit{profile hidden Markov models} (HMMs). 

Profile HMMs are a concise representation of \textit{multiple sequence alignments} (MSAs) \cite{Durbin:2008,Krogh:1994}. Like sequence profiles, they contain for each position in the master sequence the probabilities to observe each of the 20 amino acids in homologous proteins. The amino acid distributions for each column are extrapolated from the homologous sequences in the MSA by adding \textit{pseudocounts} to the amino acid counts observed in the MSA. Unlike sequence profiles, profile HMMs also contain position-specific gap penalties. More precisely, they contain for each position in the master sequence the probability to observe an insertion or a deletion after that position (the log of which corresponds to gap-open penalties) and the probabilities to extend the insertion or deletion (the log of which corresponds to gap-extend penalties). A profile HMM is thus much better suited than a single sequence to find homologous sequences and calculate accurate alignments. By representing both the query sequence and the database sequences by profile HMMs, HHsearch and HHblits are more sensitive for detecting and aligning remotely homologous proteins than methods based on pairwise sequence comparison or profile-sequence comparison.

HHblits can build high-quality multiple sequence alignments (MSAs) starting from a single sequence or from an MSA. Compared to PSI-BLAST \cite{Altschul:1997}, HHblits is faster, finds up to two times more homologous proteins and produces more accurate alignments. It uses an iterative search strategy, adding sequences from significantly similar database HMMs from a previous search iteration to the query HMM for the next search. Because HHblits is based on the pairwise alignment of profile HMMs, it needs its own type of databases that contain multiple sequence alignments and the corresponding profile HMMs instead of single sequences. The HHsuite database uniprot20 is generated regularly by clustering the UniProt database \cite{uniprot:2010} from EBI/SIB/PIR and the nonredundant (nr) database from the NCBI into groups of similar sequences alignable over at least 80 \% of their length and down to $\sim 20 \%$ pairwise sequence identity. This database can be downloaded together with HHblits. HHblits uses the HMM-HMM alignment algorithms in HHsearch, but it employs a fast prefilter (based partly on code from Michael Farrar, \cite{Farrar:2007}) that reduces the number of database HMMs for which to perform the slow HMM-HMM comparison from tens of millions to a few thousands. At the same time, the prefilter is sensitive enough to reduce the sensitivity of HHblits only marginally in comparison to HHsearch. 

By generating highly accurate and diverse MSAs, HHblits can improve almost all downstream sequence analysis methods, such as the prediction of secondary and tertiary structure \cite{Jones:1999, Marks:2011}, of membrane helices, functionally conserved residues, binding pockets, protein interaction interfaces, or short linear motifs. The accuracy of all these methods depends critically on the accuracy and the diversity of the underlying MSAs, as too few or too similar sequences do not add significant information for the predictions. As an example, running the popular PSIPRED secondary structure prediction program \cite{Jones:1999} on MSAs generated by HHblits instead of PSI-BLAST improved the accuracy of PSIPRED significantly even without retraining PSIPRED on the HHblits alignments \cite{Remmert:2011}. 

HHsearch takes as input an MSA (e.g.\ built by HHblits) or a profile HMM and searches a database of HMMs for homologous proteins. The pdb70 database, for instance, consists of profile HMMs for a set of representative sequences from the PDB database \cite{PDB:2004}; the scop70 database has profile HMMs for representative domain sequences from the SCOP database of structural domains \cite{SCOP:2000}; the Pfam \cite{Finn:2010} domain database is a large collection of curated MSAs and profile HMMs for conserved, functionally annotated domains. HHsearch is often used to predict the domain architectures and the functions of domains in proteins by finding similarities to domains in the pdb70, Pfam or other databases. 

In addition to the command line package described here, the interactive web server at \url{http://hhpred.tuebingen.mpg.de} \cite{Soding:2005b, Hildebrand:2009} run HHsearch and HHblits. It offers extended functionality, such as Jalview applets for checking query and template alignments, histogram views of alignments, and building 3D models with MODELLER \cite{Sali:1993}. 

In the CASP9 competition (Critical Assessment of Techniques for Protein Structure Prediction) in 2010, a fully automated version of HHpred based on HHsearch and HHblits was ranked best out of the 81 servers in template-based structure prediction, the category most relevant for biological applications, while having an average response time of minutes instead of days like most other servers \cite{Mariani:2011} (\url{http://predictioncenter.org/casp9/groups_analysis.cgi?type=server&tbm=on}). 

Other popular programs for sensitive, iterative protein sequence searching are PSI-BLAST \cite{Altschul:1997} and HMMER (\url{http://hmmer.org/}). Since they are based on profile-to-sequence and HMM-to-sequence comparison, respectively, they have the advantage over HHblits and HHsearch of being able to search raw sequence databases.


\section{Installation of the HHsuite and its databases}

The HHsuite source code, executable RPM and DPKG packages for most Linux 64 bit platforms, MAC OS X, and BSD Unix, utility scripts in Perl/Python:
\begin{verbatim}
  https://github.com/soedinglab/HHsuite
\end{verbatim}


Databases can be downloaded at:
\begin{verbatim}
  http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/
\end{verbatim}

\subsection{Supported platforms and hardware requirements} \label{installation}

HHsuite needs a 64 bit system to read files larger than 4 GB. HHsuite has been extensively tested on 64 bit Linux, in particular Arch Linux, Debian, Ubuntu, Scientific Linux (SL), and Red Hat. We have done limited testing under BSD, MAX OS X, and CygWin. We plan to offer a 64 bit Windows version compiled under MinGW in the future.

HHsuite typically requires at least 4GB of main memory. Memory requirements can be estimated as follows:
%\textrm{Memory req.} = \textrm{Query\_length} \times \textrm{max\_db\_seq\_length} \times \textrm{num\_threads} \times (4 (SSE) or 8 (AVX2)) \trextrm{B} + \textrm{Query\_length} \times \textrm{max\_db\_seq\_length} \times \textrm{num\_threads} \times 8 \trextrm{B} + 1 \textrm{GB} 

\begin{equation}
\begin{split}
\textrm{Memory req.} & = \textrm{query\_length} \times \textrm{max\_db\_seq\_length} \times \textrm{num\_threads} \times \textrm{(4 (SSE) or 8 (AVX2))} \textrm{B} \\
 & + \textrm{query\_length} \times \textrm{max\_db\_seq\_length} \times \textrm{num\_threads} \times 8 \textrm{B} + 1 \textrm{GB} 
\end{split}
\end{equation}
Here, num\_threads is the number of threads specified with option \verb`-cpu <int>`.

Due to the high number of random file accesses by HHblits, it is important for good efficiency that the databases used by HHblits are placed in RAM disk or on a local solid state drive (SSD). See section \ref{efficiency} for details.


\subsubsection*{Support for SSSE3 instruction set} 

HHblits needs to run on CPUs supporting at least the SSSE3 (Supplemental Streaming SIMD Extensions 3) instruction set. (See \url{https://en.wikipedia.org/wiki/SSSE3} for an explanation of SSSE3.) Starting with the Woodcrest Xeons in 2006, all Intel CPUs support SSSE3. By default, the HHsuite binaries are compiled using the most evolved supported instruction set on the computer used for compiling.

However, you may limit the used instruction set to SSSE3 with the cmake compile flag \verb`-DHAVE_SSSE3=1`
The precompiled standard binaries are compiled with this flag. 

\subsection{Installation from source code} 

\newcounter{mycounter}  
\newenvironment{enum}
 {\begin{list}{\arabic{mycounter}.~~}{\usecounter{mycounter} \labelsep=0em \labelwidth=0em \leftmargin=0em \itemindent=0em}}
 {\end{list}}


\begin{enum}

\item Download the sources from \url{https://github.com/soedinglab/hh-suite/releases}, for example
\begin{verbatim}
$ mkdir ~/programs/hh/
$ cd ~/programs/hh/
$ wget https://github.com/soedinglab/hh-suite/releases/.../hhsuite-3.0-beta.1-Source.tar.gz 
\end{verbatim}
\vspace{2mm}


\item Then unzip and untar the file
\begin{verbatim}
$ tar -xzvf hhsuite-3.0-beta.1-Source.tar.gz
\end{verbatim}
This will unpack the sources to \verb`hhsuite-<VERSION>`.
\vspace{2mm}


\item Compilation: Run make in the source directory:
\begin{verbatim}
$ cd hhsuite-<VERSION>/
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -G "Unix Makefiles" \\
    -DCMAKE_INSTALL_PREFIX=${INSTALL_BASE_DIR} ..
$ make
$ make install
\end{verbatim}

Set \verb`${INSTALL_BASE_DIR}` to the absolute path of the base directory where you want install HHsuite.
For example, to install into \verb`/usr/local/hhsuite`:
\begin{verbatim}
$ cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -G "Unix Makefiles" \\
    -DCMAKE_INSTALL_PREFIX=/usr/local/hhsuite ..
\end{verbatim}

The HHsuite binaries will then be put into \verb`/usr/local/hhsuite/bin`.

\vspace{2mm}

\item Set HHLIB and paths: In your shell, set the environment variable HHLIB to \verb`${INSTALL_BASE_DIR}`, 
e.g, for bash, zsh, or ksh,
\begin{verbatim}
$ export HHLIB=${INSTALL_BASE_DIR}
\end{verbatim}
and, for csh or tcsh: \verb`$ setenv HHLIB=${INSTALL_BASE_DIR}`. 
HHsearch and HHblits look for the column state library file \verb`cs219.lib`
and the context library file \verb`context_data.crf` in \verb`$HHLIB/data/`. The HHsuite
python/perl scripts also read HHLIB (via file \verb`scripts/HHPaths.pm`) to locate HHsuite binaries and data files.

Put the location of your HHsuite binaries and scripts into your search path:
\begin{verbatim}
$ export PATH=$PATH:$HHLIB/bin:$HHLIB/scripts
\end{verbatim}

To avoid typing these commands every time you open a new shell, you may add the following lines to the \verb`.bashrc`, \verb`.kshrc`, \verb`.cshrc` or equivalent file in your home directory that is executed every time a shell is started:
\begin{verbatim}
export HHLIB=/usr/local/
PATH=$PATH:$HHLIB/bin>:$HHLIB/scripts
alias hhblits='hhblits -d <path_to/uniprot20>'
\end{verbatim}
The last line defines a default database for hhblits. 
\vspace{2mm}

\end{enum}

\subsection{Package installation}

\subsubsection*{Installation under x86 64bit Linux with the red hat package manager RPM}

If you use a RPM based distribution like Scientific Linux (SL), Red Hat Enterprise Linux (RHEL) or CentOS we provide precompiled x86\_64 packages for
Version 6.x, which might also work on Version 5.x and other RPM based distros
like SuSE.

\begin{enum}
\item Download and install:\\[-6mm]
\begin{verbatim}
$ cd <path>  # wherever you want to install HHsuite
$ wget https://github.com/soedinglab/hh-suite/releases/.../hhsuite-3.0-beta.1-Linux.rpm
$ rpm -hvU hhsuite-3.0-beta.1-Linux.rpm
\end{verbatim}
\vspace{2mm}

\item Set paths: To allow the HHsuite perl scripts to find the binaries, set the HHLIB variable to your hh directory 
and put the location of your HHsuite binaries and scripts into your search path:\\[-6mm]
\begin{verbatim}
$ export HHLIB=<path>/hhsuite-<version>
$ export PATH=$PATH:$HHLIB/bin:$HHLIB/scripts
\end{verbatim}

To avoid typing these commands every time you open a new shell, you may add the following lines to the \verb`.bashrc`, \verb`.kshrc`, \verb`.cshrc` or equivalent file in your home directory that is executed every time a shell is started:
\begin{verbatim}
export HHLIB=<path>/hhsuite-<version>
PATH=$PATH:$HHLIB/bin:$HHLIB/scripts
alias hhblits='hhblits -d <path_to/uniprot20>'
\end{verbatim}
The last line defines a default database for hhblits. 

\end{enum}

\subsubsection*{Installation under x86 64bit Linux with the Debian package manager DPKG}

To follow.

%\verb`  $ dpkg --install ~/<download_dir>/HHsuite-latest.deb`
%Then follow instructions under point 5 and 6 of the first subsection.

\subsubsection*{Installation under x86 64bit Max OS X}

To follow.


\subsubsection*{Installation under x86 64bit BSD Unix}

To follow.


\subsection{HHsuite databases} \label{hhblits_dbs}
The following HHsuite databases, which can be searched by HHblits and HHsearch, 
can be downloaded at \url{http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/}: 
\small 
\begin{verbatim}
 1 uniprot20    based on UniProt db from EBI/SIB/PIR, clustered to 20 % seq. identity
 3 pdb70        representatives from PDB (70% max. sequence identity), updated weekly
 4 scop70       representatives from SCOP (70% max. sequence identity)
 5 pfamA        Pfam A database from Sanger Inst., http://www.sanger.ac.uk/Software/Pfam/
\end{verbatim} 
\normalsize

The databases consist of up to eight files, which all start with the name of the database, followed by different extensions:
\begin{verbatim}
<dbname>_cs219.ffdata   packed file with column-state sequences for prefiltering
<dbname>_cs219.ffindex  index file for packed column-state sequence file
<dbname>_a3m.ffdata     packed file with MSAs in A3M format
<dbname>_a3m.ffindex    index file for packed A3M file
<dbname>_hhm.ffdata     packed file with HHM-formatted HMMs
<dbname>_hhm.ffindex    index file for packed HHM file
\end{verbatim}

The packed files \verb`<dbname>_cs219.ffdata`, \verb`<dbname>_hhm.ffdata` and \verb`<dbname>_a3m.ffdata` contain simply the concatenated A3M MSAs and HHMs, respectively, with a \verb`\0` character at the beginning of each file. They are therefore human-readable and are parsable for specific MSAs or models using tools such as grep or search functions in text editors (which however should be able to ignore the \verb`\0` character). The \verb`.ffindex` files contain indices to provide fast access to these packed files.

To get started, download the uniprot20 database files. For example:
\begin{verbatim}
$ cd /home/soeding/hh  % change to HHsuite directory
$ mkdir databases; cd databases
$ wget http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/uniprot20_2015_06.tgz
$ tar -xzvf uniprot20_2015_06.tgz
\end{verbatim}

Note that, in order to generate multiple sequence alignments (MSAs) by \emph{iterative} sequence searching using HHblits, you need to search the uniprot20, since since only this databases cover essentially all of the sequence 
universe. The pdb70, pfamA, and pdb70 are not appropriate to build MSAs by iterative searches.

Uniprot20 is obtained by clustering UniProt \cite{uniprot:2010} database from NCBI. Clusters contain sequences that need to be almost full-length (80\%) alignable and typically have pairwise sequence identities down to 20\%-30\%. The clustering is done by mmseqs (to be published), a very fast algorithm for all-against-all sequence comparison and clustering developed in our group. Sequences in each cluster are globally aligned into an MSA (using ClustalOmega \cite{Higgins:2011}). The clusters in uniprot20 thus treat all member sequences equally.
% and the annotatation of the cluster is a non-redundant sum of all annotations found in the cluster member sequences.
You need this type of database to build MSAs using iterative HHblits searches.

In the pdb70 and scop70 databases, each master sequence from the original sequence database is represented by an MSA. The MSAs are built by HHblits searches starting with the master sequence as a query. The MSAs and HMMs typically carry the name and annotation of the master sequence. In contrast to the clusters in the uniprot20 database, sequences can in principle occur in several MSAs. These homologous sequences merely serve to contribute evolutionary information to the master sequence. As the sequences in this type of database do not cover the entire sequence space, they are not suited for iterative searches. See section \ref{building_dbs} for how to build your own databases.

{\bf Note on efficiency}: Always try to keep the HHsuite databases on a local SSD drive of the computer running HHblits/HHsearch, if possible. Otherwise file access times on your (shared) remote file server of local HD drive might limit the performance of HHblits/HHsearch. For explanations, see section \ref{efficiency}.


\subsection{Running HHblits efficiently on a computer cluster} \label{efficiency}

When HHblits runs on many cores of a compute cluster and accesses the HHsuite database(s) 
via a central file system, hard disk random access times can quickly become limiting instead 
of the available compute power. To understand why, let us estimate the random file access 
bottleneck at an example. Suppose you want to search the uniprot20 database with HHblits 
on four compute nodes with 32 CPU cores each. You let each job run on a single core, 
which is the most efficient way when processing many jobs. Each HHblits iteration through 
uniprot20 on a modern 3GHz Intel/AMD core takes around, say, 60s on average. After each search iteration, 
HHblits needs to read around $3000$ a3m or hhm files on disk that passed the prefilter. 
The total number of expected file accesses per second is therefore 
\begin{equation}
\textrm{number of file accesses} = 4 \textrm{(nodes)} \times 32 \textrm{(cores)} \times 3000 \textrm{(files)} / 60s = 6400/s
\end{equation}
A typical server hard disk has a latency of around $5\,ms$ and can therefore randomly access 
only 200 files per second. Hence, if all cores need to access the files from the same RAID 
disk array, only 4 CPU cores would already saturate the RAID's disk drives! (You can slightly 
decrease latency by a factor of two by using RAID configurations with striping, but this 
won't help much.)

To efficiently search with HHblits with millions of sequences in parallel on a compute cluster, 
we therefore recommend one of the first two options.

1. The ideal solution is to automatically load the needed HHsuite databases from the 
file system into the virtual {\it RAM drive} of each compute node upon boot-up. This requires 
enough memory in your computer. Software for RAM drives exist on all major operating systems. 
On Linux from kernel 2.4 upwards, a RAM drive is installed by default at \verb`/dev/shm`. When 
memory is low, this software can intelligently swap rarely used files to a hard drive. 

2. Another good option is to hold the HHsuite databases on local solid state drive (SSD) 
of each of the cluster's compute nodes that can accept HHsuite jobs. The local SSDs also 
remove the file access bottleneck, as SSDs have typical random access times below 
$100 \mu s$. Reading the 3000 files from SSD after prefiltering therefore will take below $0.3 s$.

3. Another good option might be to use a RAM disk (or SSD) on your file server to hold the 
HHsuite databases. Then file access time is not problematic anymore, but you need to estimate 
if data throughput might get limiting. The four servers from our previous example would
 need to read data at a rate 
\begin{equation}
\textrm{transfer rate}\, = 4 \, \textrm{(nodes)} \times 32\, \textrm{(cores)} \times 3000\, \textrm{(files)} \times 5\,\textrm{kB}\, \textrm{(av. file size)}\,/ 60 \,s = 256 \,\textrm{Mb}/s
\end{equation}
If you have 16 instead of 4 compute servers or if data transfer from the file server has to be 
shared with many other compute servers, the network transfer may be limiting the performance.

4. A not so good option is to keep the HHsuite databases on local hard disk drives (HDDs). In this 
way, you at least distribute the file accesses to as many HDDs as you have compute nodes. 
But as we have seen, already at around 4 cores per HDD, you will be limited by file access
rather than CPU power.

5. You (or your sysadmin) should at least make sure that your compute nodes have sufficient memory reserves 
and proper cache settings in order to keep the prefiltering database file in cache (size: a few GB).
You can check this by running HHblits on a fresh compute node twice and observing that the 
second run takes much less time to read the prefiltering database file. But this only solves
the problem of having to read the prefiltering file every time hhblits is called. But, as discussed,
the usually more serious bottleneck is reading the a3m and hhm files that passed the prefilter 
via the file system. These can also be cached, but since each time only a small subset of 
all files are read, caching is inefficient for these.


\section{Brief tutorial to HHsuite tools}

\subsection{Overview of programs}

\small 
\begin{verbatim}
hhblits        (Iteratively) search an HHsuite database with a query sequence or MSA
hhsearch       Search an HHsuite database with a query MSA or HMM
hhmake         Build an HMM from an input MSA 
hhfilter       Filter an MSA by max sequence identity, coverage, and other criteria
hhalign        Calculate pairwise alignments etc. for two HMMs/MSAs
hhconsensus    Calculate the consensus sequence for an A3M/FASTA input file

reformat.pl    Reformat one or many MSAs
addss.pl       Add PSIPRED predicted secondary structure to an MSA or HHM file
hhmakemodel.pl Generate MSAs or coarse 3D models from HHsearch or HHblits results	
hhmakemodel.py Generates coarse 3D models from HHsearch or HHblits results and modifies 
               cif files such that they are compatible with MODELLER
hhsuitedb.py   Build HHsuite database with prefiltering, packed MSA/HMM, and index files
splitfasta.pl  Split a multiple-sequence FASTA file into multiple single-sequence files
renumberpdb.pl Generate PDB file with indices renumbered to match input sequence indices
Align.pm       Utility package for local and global sequence-sequence alignment
HHPaths.pm     Configuration file with paths to the PDB, BLAST, PSIPRED etc.

mergeali.pl    Merge MSAs in A3M format according to an MSA of their seed sequences
pdb2fasta.pl   Generate FASTA sequence file from SEQRES records of globbed pdb files
cif2fasta.py	  Generate a FASTA sequence from the pdbx_seq_one_letter_code entry 
               of the entity_poly of globbed cif files
pdbfilter.pl   Generate representative set of PDB/SCOP sequences from pdb2fasta.pl output
pdbfilter.py   Generate representative set of PDB/SCOP sequences from cif2fasta.py output
\end{verbatim} 
\normalsize

%TODO do we really want to add them???
%replaced by hhblits:
%alignhits.pl   Extract an MSA from a BLAST/PSI-BLAST output
%buildali.pl    Build a PSI-BLAST MSA from a sequence or MSA, add sec. structure

Call a program without arguments or with \verb`-h` option to get more detailed explanations.


\subsection{Searching databases of HMMs using HHsearch and HHblits}\label{searching_hm_dbs}

We will use the MSA \verb`query.a3m` in the \verb`data/` subdirectory of the HHsuite as an example query. To search for sequences in the \verb`scop70_1.75` database that are homologous to the query sequence or MSA in \verb`query.a3m`, type

\begin{verbatim}
$ hhsearch  -cpu 4 -i data/query.a3m -d dbs/scop70_1.75 -o data/query.hhr
\end{verbatim}

(The database \verb`scop70_1.75` can be obtained from \url{http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/scop70_1.75.tar.gz}. 
If the input file is an MSA or a single sequence, HHsearch calculates an HMM from it
and then aligns this query HMM to all HMMs in the \verb`scop70_1.75` database using the Viterbi 
algorithm. After the search, 
the most significant HMMs are realigned using the more accurate Maximum Accuracy (MAC) 
algorithm (subsection \ref{MAC}). After the realignment phase, the complete search results consisting of the 
summary hit list and the pairwise query-template alignments are written to the output file, \verb`data/query.hhr`.
The \verb`hhr` result file format was designed to be human readable and easily parsable.

The \verb`-cpu 4` option tells HHsearch to start four openMP threads for searching and realignment. This will typically results in almost fourfold faster execution on computers with four or more cores. Since the management of the threads costs negligible overhead, this option could be given by default through an alias definition of hhsearch and hhblits (see section \ref{installation}). 

\begin{figure}[h]
\begin{center}
\includegraphics[width=0.5 \textwidth]{hhblits-hhsearch.png}
\caption{Benchmark of HHsearch and HHblits on a SCOP20 dataset.}
\label{fig:hhsearch_hhblits_bench}
\end{center}
\end{figure}

The HHblits tool can be used in much the same way as HHsearch. 
It takes the same input data and produces a results file in the same format as HHsearch.
Most of the HHsearch options also work for HHblits, which has
additional options associated with its extended functionality for iterative searches. 
Due to its fast prefilter, HHblits runs between 30 and 3000 times faster than HHsearch
at the cost of only a few percent lower sensitivity (Fig. \ref{fig:hhsearch_hhblits_bench}).

The same search as above is performed here using HHblits instead of HHsearch:
\begin{verbatim}
$ hhblits  -cpu 4 -i data/query.a3m -d dbs/scop70_1.75 -o data/query.hhr -n 1
\end{verbatim}

HHblits first scans the column state sequences in \verb`scop70_1.75_cs219.ffdata` with its fast prefilter. HMMs whose column state sequences pass the prefilter are read from the packed file \verb`scop70_1.75_hhm.ffdata` (using the index file \verb`scop70_1.75_hhm.ffindex`) and are aligned to the query HMM generated from \verb`query.a3m` using the slow Viterbi HMM-HMM alignment algorithm. The search results are written to the default output file \verb`query.hhr`. The option \verb`-n 1` tells HHblits to perform a single search iteration. (The default is 2 iterations.)


\subsection{Generating a multiple sequence alignment using HHblits}\label{msa_hhblits}

To generate an MSA for a sequence or initial MSA in query.a3m, the database to be searched should cover the entire sequence space, such as uniprot20 or nr20. The option \verb`-oa3m <msa_file>` tells HHblits to generate an output MSA from the significant hits, \verb`-n 1` specifies a single search iteration.
\begin{verbatim}
$ hhblits -cpu 4 -i data/query.seq -d dbs/uniprot20 -oa3m query.a3m -n 1
\end{verbatim}

At the end of the search, HHblits reads from the packed database file containing the MSAs the sequences belonging to HMMs with E-value below the threshold. The E-value threshold for inclusion into the MSA can be specified using the \verb`-e <E-value>` option. After the search, \verb`query.a3m` will contain the MSA in A3M format.

We could do a second search iteration, starting with the MSA from the previous search, to add more sequences. Since the MSA generated after the previous search contains more information than the single sequence in \verb`query.seq`, searching with this MSA will probably result in many more homologous database matches.
\begin{verbatim}
$ hhblits -cpu 4 -i query.a3m -d dbs/uniprot20 -oa3m query.a3m -n 1
\end{verbatim}
Instead, we could directly perform two search iterations starting from \verb`query.seq`:
\begin{verbatim}
$ hhblits -cpu 4 -i data/query.seq -d dbs/uniprot20 -oa3m query.a3m -n 2 
\end{verbatim}
See section \ref{2_vs_1+1} for an explanation why the results can be slightly different.

In practice, it is recommended to use between 1 and 4 iterations for building MSAs, depending on the trade-off between reliability and specificity on one side and sensitivity for remotely homologous sequences on the other side. The more search iterations are done, the higher will be the risk of non-homologous sequences or sequence segments entering the MSA and recruiting more of their kind in subsequent iterations. This is particularly problematic when searching with sequences containing short repeats, regions with amino acid compositional bias and, although less dramatic,  with multiple domains. Fortunately, this problem is much less pronounced in hhblits as compared to PSI-BLAST due to hhblits's lower number of iterations, its more robust Maximum Accuracy alignment algorithm, and the higher precision of its HMM-HMM alignments. 

The parameter \verb`mact` (maximum accuracy threshold) lets you choose the trade-off between sensitivity and 
precision. With a low mact-value (e.g.\ \verb`-mact 0.01`) very sensitive, but not 
so precise alignments are generated, whereas a search with a high mact-value (e.g.\ \verb`-mact 0.9`) 
results in shorter but very precise alignments. The default value of mact in HHblits is $0.35$ 
(changed from 0.5 in the beta version). 

To avoid unnecessarily large and diverse MSAs, HHblits stops iterating when the diversity of the query MSA -- measured as number of effective sequences, see section \ref{aliformats} -- grows passed a threshold of 10.0. This threshold can be modified with the \verb`--neffmax <float>` option. See subsection \ref{hhmformat} for a description of how the number of effective sequences is calculated in HHsuite.

To avoid the final MSAs to grow unnecessarily large, by default the database MSAs that are going to be merged with the query MSA into the result MSA are filtered with the active filter options (by default \verb`-id 90` and \verb`-diff 1000`). The \verb`-all` option turns off the filtering of the result MSA. Use this option if you want to get all sequences in the significantly similar uniprot20 clusters: 
\begin{verbatim}
$ hhblits -cpu 4 -i data/query.seq -d dbs/uniprot20 -oa3m query.a3m -all
\end{verbatim}

The A3M format uses small letters to mark inserts and capital letters to designate match and delete columns (see subsection \ref{aliformats}), allowing you to omit gaps aligned to insert columns. The A3M format therefore uses much less space for large alignments than FASTA but looks misaligned to the human eye. Use the \verb`reformat.pl` script to reformat \verb`query.a3m` to other formats, e.g.\ for reformatting the MSA to Clustal and FASTA format, type
\begin{verbatim}
$ reformat.pl a3m clu query.a3m query.clu
$ reformat.pl a3m fas query.a3m query.fas
\end{verbatim}

Next, to add secondary structure information to the MSA we call the script \verb`addss.pl`. For \verb`addss.pl` to work, you have to make sure that the paths to BLAST and PSIPRED in the file \verb`$HHLIB/scripts/HHPaths.pm` are correctly filled in. Then type
\begin{verbatim}
$ addss.pl query.a3m
\end{verbatim}
When the sequence has a SCOP or PDB identifier as first word in its name, the script tries to add the DSSP states as well. Open the \verb`query.a3m` file and check out the two lines that have been added to the MSA. Now you can generate a hidden Markov model (HMM) from this MSA:
\begin{verbatim}
$ hhmake -i query.a3m
\end{verbatim}
The default output file is \verb`query.hhm`. By default, the option \verb`-M first` will 
be used. This means that exactly those columns of 
the MSAs which contain a residue in the query sequence will be assigned to Match 
/ Delete states, the others will be assigned to Insert states. (The query sequence is 
the first sequence not containing secondary structure information.) Alternatively, you 
may want to apply the 50\%-gap rule by typing \verb`-M 50`, which assigns only those columns 
to Insert states which contain more than 50\% gaps. The \verb`-M first` option makes sense 
if your alignment can best be viewed as a seed sequence plus aligned homologs to 
reinforce it with evolutionary information. This is the case in the SCOP and PDB 
versions of our HMM databases, since here MSAs are built around a single seed 
sequence (the one with known structure). On the contrary, when your alignment 
represents an entire family of homologs and no sequence in particular, it is best to 
use the 50\% gap rule. This is the case for Pfam or SMART MSAs, for instance. 
Despite its simplicity, the 50\% gap rule has been shown to perform well in practice.

When calling \verb`hhmake`, you may also apply several filters, such as maximum pairwise 
sequence identity (\verb`-id <int>`), minimum sequence identity with query sequence 
(\verb`-qid <int>`), or minimum coverage with query (\verb`-cov <int>`). But beware 
of reducing the diversity of your MSAs too much, as this will lower the sensitivity to
detect remote homologs.

Previous versions of HHsuite (the 'HHsearch package') included a perl script \verb`buildali.pl` to build MSAs for a query sequence using PSI-BLAST as its search engine. Because HHblits performs better than PSI-BLAST in all aspects that we have tested, we decided to remove this script from HHsuite. It can still be downloaded as part of HHsearch version 1.5.0.

\subsection*{Example: Comparative protein structure modeling using HHblits and MODELLER}

A three-dimensional (3D) structure greatly facilitates the functional characterization of proteins. However, for many proteins there are no experimental structures available, and thus, comparative modeling to known protein structures may provide useful insights. In this method, a 3D structure of a given protein sequence (target) is predicted based on alignments to one or more proteins of known structures (templates). In the following, we demonstrate how to create alignments for a unresolved protein with HHblits and the PDB70 database. We then convert search results from HHblits to build a comparative model using MODELLER (v9.16).

In 1999, Wu et all. reported the genomic sequence and evolutionary analysis of lactate dehydrogenase genes from \textit{Trichomonas vaginalis} (TvLDH). Surprisingly, the corresponding protein sequence was most similar to the malate dehydrogenase (TvMDH) of the same organism implying TvLDH arose from TvMDH by convergent evolution. In the meantime, the structure of TvLDH has been resolved, however, for instructional purposes suppose that there is no 3D structure for the protein. 

To get started we obtain the protein sequence of TvLDH from GeneBank (accession number \href{http://www.ncbi.nlm.nih.gov/nuccore/AF060233}{AF060233.1}). We first copy-paste the protein sequence into a new file named \verb`query.seq`. The content of this file should somewhat look similar to this.

\footnotesize 
\begin{verbatim}
>TvLDH
MSEAAHVLITGAAGQIGYILSHWIASGELYGDRQVYLHLLDIPPAMNRLTALTMELEDCAFPHLAGFVATTDP
KAAFKDIDCAFLVASMPLKPGQVRADLISSNSVIFKNTGEYLSKWAKPSVKVLVIGNPDNTNCEIAMLHAKNL
KPENFSSLSMLDQNRAYYEVASKLGVDVKDVHDIIVWGNHGESMVADLTQATFTKEGKTQKVVDVLDHDYVFD
TFFKKIGHRAWDILEHRGFTSAASPTKAAIQHMKAWLFGTAPGEVLSMGIPVPEGNPYGIKPGVVFSFPCNVD
KEGKIHVVEGFKVNDWLREKLDFTEKDLFHEKEIALNHLAQ
\end{verbatim}
\normalsize

The search results obtained by querying the TvLDH sequence against the PDB70 will be significantly better if we use a MSA instead of a single sequence. For this reason, we first query the protein sequence of TvLDH against the uniprot20 database which covers the whole protein sequence space. By executing

\begin{verbatim}
$ hhblits -i query.seq -d uniprot -oa3m query.a3m  -cpu 4 -n 1
\end{verbatim}

we obtain a MSA in a3m format which contains several sequences that are similar to TvLDH. Now we can use the \verb`query.a3m` and search the PDB70 database for similar protein structures.

\begin{verbatim}
$ hhblits -i query.a3m -o results.hhr -d pdb70 -cpu 4 -n 1
\end{verbatim}

Note that we now output a hhr file \verb`results.hhr` instead of an a3m file. Before we convert the search results to a format that is readable by MODELLER, let us quickly inspect \verb`results.hhr`.

\footnotesize 
\begin{verbatim}
Query         TvLDH
Match_columns 333
No_of_seqs    2547 out of 8557
Neff          11.6151
Searched_HMMs 1566
Date          Tue Aug 16 11:35:02 2016
Command       hhblits -i query.a3m -o results.hhr -d pdb70 -cpu 32 -n 2

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 7MDH_C MALATE DEHYDROGENASE; C 100.0 1.5E-39 1.3E-43  270.1   0.0  326    3-333    31-358 (375)
  2 4UUL_A L-LACTATE DEHYDROGENASE 100.0 1.8E-35 1.6E-39  244.5   0.0  332    1-332     1-332 (341)
  3 4UUP_A MALATE DEHYDROGENASE (E 100.0 2.2E-35 1.9E-39  243.8   0.0  332    1-332     1-332 (341)
  4 4UUM_B L-LACTATE DEHYDROGENASE 100.0 5.7E-35 5.1E-39  241.5   0.0  333    1-333     1-333 (341)
  5 1CIV_A NADP-MALATE DEHYDROGENA 100.0 1.6E-34 1.5E-38  241.3   0.0  326    2-332    40-367 (385)
  6 1Y7T_A Malate dehydrogenase(E. 100.0 3.4E-34 3.1E-38  235.3   0.0  324    1-331     1-325 (327)
  7 4I1I_A Malate dehydrogenase (E 100.0 3.8E-34 3.4E-38  236.4   0.0  319    2-327    22-343 (345)
  8 1BMD_A MALATE DEHYDROGENASE (E  99.9 5.9E-34 5.3E-38  233.9   0.0  324    1-331     1-325 (327)
  9 4H7P_B Malate dehydrogenase (E  99.9 9.9E-34 8.9E-38  233.9   0.0  319    2-327    22-343 (345)
 10 2EWD_B lactate dehydrogenase,   99.9 1.1E-33 9.5E-38  231.1   0.0  306    1-328     1-314 (317)
 
 (...)
\end{verbatim}
\normalsize

We find that there are several templates that have a high similarity to our query. Interestingly, the hit with the most significant E-value score is also malate dehydrogenase. We will use this structure as a basis for our comparative model. In order to build the model we first have to obtain the template structure(s). We can get 7MDH by typing the following commands

\begin{verbatim}
$ mkdir templates
$ cd templates
$ wget http://files.rcsb.org/download/7MDH.cif 
$ cd ..
\end{verbatim}

To convert our search results \verb`results.hhr` into an alignment that is readable by MODELLER we use \verb`hhmakemodel.py`.

\begin{verbatim}
$ python3 hhmakemodel.py results.hhr templates/ TvLDH.pir ./ -m 1
\end{verbatim}

This script takes four positional arguments: the results file in hhr format, the path to the folder containing all templates in cif format, the output pir file, and folder where the processed cif files should be written to. The \verb`-m` flag tells \verb`hhmakemodel.py` to only include the first hit in the pir alignment. The pir file together with processed cifs can be used as an input for MODELLER (please refer to the MODELLER documentation for further help).

\subsection{Visually checking an MSA for corrupted regions}

Iterative search methods such as PSI-BLAST and HHblits may generate alignments containing non-homologous sequence stretches or even large fractions of non-homologous sequences. The cause for this is almost always the overextension of homologous alignments into non-homologous regions. This has been termed \emph{homologous overextension} in \cite{Gonzalez:2010}. This effect occurs particularly in multidomain and repeat proteins. A single overextended alignment in the search results leads to more of such sequences to be included in the profile of the next iteration -- usually all homologous to the first problematic sequence but with even longer non-homologous stretches. Thus, after three or more iterations, large sections of the resulting MSA may be non-homologous to the original query sequence. This risk of homologous overextension is greatly reduced in HHblits in comparison to PSI-BLAST, because fewer iterations are usually necessary for HHblits and because HHblits uses the Maximum Accuracy alignment algorithm (see section \ref{MAC}), which is much less prone to overextend alignments than the Smith-Waterman/Viterbi algorithm used by most other programs including PSI-BLAST. Still, in important cases it is worth to visually check the MSAs of the query and the matched database protein for the presence of corrupted regions containing non-homologous sequence stretches. 

The recommended procedure to visually check an MSA is the following. We first reduce the MSA to a small set of sequences that could still fit into a single window of an alignment viewer:
\begin{verbatim}
$ hhfilter -i query.a3m -o query.fil.a3m -diff 30
\end{verbatim}
The option \verb`-diff` causes hhfilter to select a representative set of at least 30 sequences that best represent the full diversity of the MSA. Sequences that contain non-homologous stretches are therefore usually retained, as they tend to be the most dissimilar to the main sequence cluster. 

Next, we remove all inserts (option \verb`-r`) with respect to the first, master sequence in the MSA. The resulting MSA is sometimes called a master-slave alignment:
\begin{verbatim}
$ reformat.pl -r query.fil.a3m query.fil.fas
\end{verbatim}

This alignment is now very easy to inspect for problematic regions in any viewer that allows to color the residues according to their physico-chemical properties. We can recommend \verb`alnedit` (downloadable from \url{www.eb.tuebingen.mpg.de/?id=421}) or \verb`jalview`, for example:
\begin{verbatim}
$ java -jar ~/bioinfo/alnedit.jar query.fil.fas .
\end{verbatim}
%Here is an example of a corrupted MSA, visualized as described here with alnedit
%\begin{figure}[h]
%\begin{center}
%\includegraphics[width=0.5 \textwidth]{hhblits-hhsearch.png}
%\caption{Corrupted alignment: The master-slave MSA for sequence XXXX generated using three iterations of PSI-BLAST shows ...}
%\label{fig:hhsearch_hhblits_bench}
%\end{center}
%\end{figure}

\subsection{Building customized databases} \label{building_dbs}

It is simple to build custom HHsuite databases using the same tools we use to build the standard HHsuite databases (except uniprot20). An example application is to search for homologs among all proteins of an organism. To build your own HHsuite database from a set of sequences, you first need to generate an MSA with predicted secondary structure for every sequence in the set.

First of all we split the concatenated fasta file to an ffindex with ffindex\_from\_fasta.

\begin{verbatim}
$ ffindex_from_fasta -s <db>_fas.ff{data,index} <db.fas> 
\end{verbatim}

Now, to build an MSA with HHblits for each sequence in \verb`<db>_fas.ff{data,index}`, run\\[-1mm]

\begin{verbatim}
$ mpirun -np <number_threads> ffindex_apply_mpi <db>_fas.ff{data,index} \\
    -i <db>_a3m_wo_ss.ffindex -d <db>_a3m_wo_ss.ffdata -- 
    hhblits -d <path_to/uniprot20> -i stdin -oa3m stdout -n 2 -cpu 1 -v 0
\end{verbatim}

The MSAs are written to the ffindex \verb`<db>_a3m_wo_ss.ff{data,index}`. To be sure that everything went smoothly, check that the number of lines in \verb`<db>_a3m.ffindex` is the same as the number of lines in \verb`<db>_fas.ffindex`. 

The number of HHblits search iterations and the HMM inclusion E-value threshold for HHblits can be changed from their default values (2 and 0.01, respectively) using the \verb`'-n <int>'` and \verb`'-e <float>'` options. A higher number of iterations such as  \verb`'-n 3'` will result in very high sensitivity to discover remotely homologous relationships, but the ranking among homologous proteins will often not reflect their degree of relationship to the query. The reason is that any similarities that are higher than the similarities among the sequences in the query and database MSAs cannot be resolved. 

Now, add PSIPRED-predicted secondary structure and if possible DSSP secondary structure annotation to all MSAs:
\begin{verbatim}
$ mpirun -np <number_threads> ffindex_apply_mpi <db>_a3m_wo_ss.ff{data,index} \\
    -i <db>_a3m.ffindex -d <db>_a3m.ffdata -- addss.pl -v 0 stdin stdout
$ rm <db>_a3m_wo_ss_a3m.ff{data,index}
\end{verbatim}

We also need to generate an HHM file for each MSA file:

\begin{verbatim}
$ mpirun -np <number_threads> ffindex_apply_mpi <db>_a3m.ff{data,index} \\
    -i <db>_hhm.ffindex -d <db>_hhm.ffdata -- hhmake -i stdin -o stdout -v 0
\end{verbatim}

In order to build the ffindex containing the column state sequences for prefiltering for each a3m we run:

\begin{verbatim}
$ OMP_NUM_THREADS=<number_threads> cstranslate -A ${HHLIB}/data/cs219.lib \\
    -D ${HHLIB}/data/context_data.lib -x 0.3 -c 4 -f -i <db>_a3m \\
    -o <db>_cs219 -I a3m -b
$ ffindex_build -as <db>_cs219.ff{data,index}
\end{verbatim}

Next we want to optimize the hmm and a3m ffindices. For this purpose we sort the files in the
data files according to the number of columns in the MSAs. The number of columns can be retrieved
in the third row of \verb`<db>_cs219.ffindex`.

\begin{verbatim}
$ sort -k3 -n <db>_cs219.ffindex | cut -f1 > sorting.dat

$ ffindex_order sorting.dat <db>_hhm.ff{data,index} <db>_hhm_ordered.ff{data,index}
$ mv <db>_hhm_ordered.ffindex <db>_hhm.ffindex
$ mv <db>_hhm_ordered.ffdata <db>_hhm.ffdata

$ ffindex_order sorting.dat <db>_a3m.ff{data,index} <db>_a3m_ordered.ff{data,index}
$ mv <db>_a3m_ordered.ffindex <db>_a3m.ffindex
$ mv <db>_a3m_ordered.ffdata <db>_a3m.ffdata
\end{verbatim}

As with all perl/python scripts and binaries in the HHsuite, a list of additional options can be retrieved by calling the scripts without parameters.

\subsection*{Example: Building a database from the PDB}

%Applications in protein-protein or protein-drug interactions often require our users to find homologous protein sequences from the PDB. 
To make efficient sequence searches in the PDB we provide a precompiled PDB70 database containing PDB sequences clustered at 70 \% sequence identity. However, to find a larger variety of PDB templates, larger databases with more redundancy are required. For this reason, the HHsuite provides tools to build a custom PDB database. In this tutorial, we describe the steps required to build a custom PDB database.

First, download the entire PDB database from \href{http://www.rcsb.org/pdb/static.do?p=download/ftp/index.html}{RSCB} in cif file format (this is the successor of the pdb file format) by executing

\begin{verbatim}
$ rsync --progress -rlpt -v -z --port=33444 rsync.wwpdb.org::ftp/data/structures/
divided/mmCIF <cif_dir>
\end{verbatim}

and unzip the files into a single directory \verb`<all_cifs>`. Then, run \verb`cif2fasta.py` by typing

\begin{verbatim}
$ python3 cif2fasta.py -i <all_cifs> -o pdb100.fas -c <num_cores> -p pdb_filter.dat
\end{verbatim}

The script scans the folder \verb`<all_cifs>` for files with the suffix \verb`*.cif` and write each sequence and its associated chain identifier annotated in \verb`pdbx_seq_one_letter_code` entry from the \verb`entity_poly` table to the fasta file \verb`pdb100.fas`. By specifying the optional \verb`-p` flag, \verb`cif2fasta.py` creates an additional file \verb`pdb_filter.dat` which is required by \verb`pdbfilter.py` in a later step. Note that \verb`cif2fasta.py` by default removes sequences which are shorter than 30 residues and/or comprise only the residue 'X'. 

If you wish to exhaustively search the PDB, skip the following steps and continue with the instructions described in Section \ref{building_dbs}. However, to increase the speed of database searches, e.g. on systems with limited resources, you can reduce the number of sequences by clustering them with MMSeqs and selecting representative sequences with \verb`pdbfilter.py`. To cluster the sequences of \verb`pdb100.fas` at a sequence identity of \verb`X` and a coverage of \verb`Y` run MMSeqs using these options.

\begin{verbatim}
$ mmseqs createdb pdb100.fas <clu_dir>/pdb100
$ mmseqs clusteringworkflow <clu_dir>/pdb100 <clu_dir>/pdbXX_clu 
/tmp/clustering -c Y --min-seq-id X
$ mmseqs createtsv <clu_dir>/pdb100 <clu_dir>/pdb100 <clu_dir>/pdbXX_clu 
<clu_dir>/pdbXX_clu.tsv
\end{verbatim}

MMSeqs yields tab separated file \verb`pdbXX_clu.tsv` which contains cluster assignments for all sequences. Representative sequences are selected by \verb`pdbfilter.py` which chooses up to three sequences for each cluster by identifying the ones having either the highest resolution ($\mathring{A}$), the largest R-free value or the largest "completeness"\footnote{We compute the completeness of a protein structure by dividing the number of residues that are found in the ATOM section by the total number of residues declared in the pdbx\_seq\_one\_letter\_code entry of the entity\_poly table.}. 

\begin{verbatim}
$ python3 pdbfilter.py pdb100.fas pdbXX_clu.tsv pdb_filter.dat pdbXX.fas 
-i pdb70_to_include.dat -r pdb70_to_remove.dat
\end{verbatim}

\verb`pdbfilter.py` takes the original fasta file (\verb`pdb100.fas`) and the annotation file \verb`pdb_filter.dat` which both were created by \verb`cif2fasta.py`, and the cluster assignments from MMSeqs (\verb`pdb70_clu.tsv`) as input and outputs the final \verb`pdbXX.fas`. Use this fasta file to complete the creation of your database (see Section \ref{building_dbs}).

\subsection{Modifying or extending existing databases}

Assume you have a list of filenames in \verb`files.dat` you want to remove from an HHsuite database.

\begin{verbatim}
$ ffindex_modify -s -u -f files.dat <db>_a3m.ffindex
$ ffindex_modify -s -u -f files.dat <db>_hhm.ffindex
$ ffindex_modify -s -u -f files.dat <db>_cs219.ffindex
\end{verbatim}

This deletes the file entries from the ffindex files, however the files are still in the ffdata file.
This way HHblits and HHsuite won't be able to use them. If you want to get rid of them in the ffdata file
you may re-optimize the databases.

\begin{verbatim}
$ ffindex_build -as <db>_cs219_ordered.ff{data,index} \\
    -i <db>_cs219.ffindex -d <db>_cs219.ffdata
$ mv <db>_c219_ordered.ffindex <db>_cs219.ffindex
$ mv <db>_cs219_ordered.ffdata <db>_cs219.ffdata

$ sort -k3 -n <db>_cs219.ffindex | cut -f1 > sorting.dat

$ ffindex_order sorting.dat <db>_hhm.ff{data,index} <db>_hhm_ordered.ff{data,index}
$ mv <db>_hhm_ordered.ffindex <db>_hhm.ffindex
$ mv <db>_hhm_ordered.ffdata <db>_hhm.ffdata

$ ffindex_order sorting.dat <db>_a3m.ff{data,index} <db>_a3m_ordered.ff{data,index}
$ mv <db>_a3m_ordered.ffindex <db>_a3m.ffindex
$ mv <db>_a3m_ordered.ffdata <db>_a3m.ffdata
\end{verbatim}

If you want to check your HHsuite database you may run:
\begin{verbatim}
$ hhsuitedb.py -o <db> --cpu 1
\end{verbatim}

\verb`hhsuitedb.py` supports the flag \verb`--force` that tries to fix the database.
This may lead to the deletion of MSAs from the database.

Additionally you may add a3m, hhm and cs219 files with globular expressions to an existing database.
Please keep in mind that related a3m, hhm and cs219 files should have the same name. To make it easier
you may add just a3m files and the corresponding hhm and cs219 files will be calculated by the script.
\begin{verbatim}
$ hhsuitedb.py -o <db> --cpu 1 --ia3m=<a3m_glob> --ics219=<cs219_glob> --ihhm=<hhm_glob>
\end{verbatim}

\section{Frequently asked questions}

\subsection{How do I report a bug? \label{report-bug}}
If you think you found a bug, PLEASE report it to us. But we would be grateful if you could first make sure that the problem is not due to wrong usage or wrong file formats. Please reread the relevant sections of the user guide and check for possible warnings. Rerun the command with verbose option \verb`-v 3` or \verb`-v 4` to see if you can find where things go awry. If the problem persists, we will need everything to reproduce the bug on our machines. Please send us (1) the input file, (2) the database name and version, or, if it is not a standard database, the link to your ftp server to download the database files, (3) the output files, best with -v 3 or -v 4 option, (4) the command and its screen output, (5) the operating system and CPU description on which the bug occurred. Under Linux, please send the output of 
\small
\begin{verbatim}
$ echo $HHLIB $PATH; uname -a; lsb_release -a; head -n 25 /proc/cpuinfo; ulimit -a; free
\end{verbatim}
\normalsize
to \url{soeding@mpibpc.mpg.de}. Thanks a lot!

\subsection{What is HMM-HMM comparison and why is it so powerful?}
When searching for remote homologs, it is wise to make use of as much information about the query and database proteins as possible in order to better distinguish true from false positives and to produce optimal alignments. This is the reason why sequence-sequence comparison is inferior to profile-sequence comparison. Sequence profiles contain for each column of a multiple alignment the frequencies of the 20 amino acids. They therefore contain detailed information about the conservation of each residue position, i.e., how important each position is for defining other members of the protein family, and about the preferred amino acids. Profile Hidden Markov Models (HMMs) are similar to simple sequence profiles, but in addition to the amino acid frequencies in the columns of a multiple sequence alignment they contain information about the frequency of inserts and deletions at each column. Using profile HMMs in place of simple sequence profiles should therefore further improve sensitivity. Using HMMs both on the query and the database side greatly enhances the sensitivity/selectivity and alignment quality over sequence-profile based methods such as PSI-BLAST. HHsearch is the first software to employ HMM-HMM comparison and HHblits is the first profile-profile comparison method that is fast enough to do iterative searches to build MSAs. 

\subsection{When can the HHsuite be useful for me?}
Sequence search methods such as BLAST, FASTA, or PSI-BLAST are of prime importance for biological research because functional information of a protein or gene can be inferred from homologous proteins or genes identified in a sequence search. But quite often no significant relationship to a protein of known function can be established. This is certainly the case for the most interesting group of proteins, those for which no ortholog has yet been studied. In cases where conventional sequence search methods fail, HHblits and HHsearch quite often allow to make inferences from more remotely homologous relationships. HHblits builds better MSAs, with which more remote homologs can then be found using HHsearch or HHblits, e.g.\ by searching the PDB or domain databases such as Pfam. If the relationship is so remote that no common function can be assumed, one can often still derive hypotheses about possible mechanisms, active site positions and residues, or the class of substrate bound \cite{Todd:2001, Pawlowski:2000}. When a homologous protein with known structure can be identified, its stucture can be used as a template to model the 3D structure of the protein of interest \cite{Rychlewski:1998}, since even protein domains that shared a common ancestor some 3 billion years ago mostly have similar 3D structures \cite{Kinch:2002,Soding:2006a,Alva:2010}. The 3D model may then help to generate hypotheses to guide experiments. 

\subsection{What does homology mean and why is it important?}
Two protein sequences are homologous to each other if they descended from a common ancestor sequence. Generally, homologous proteins (or protein fragments) have similar structure because structures diverge much more slowly than their sequences \cite{Chothia:1986}. Depending on the degree of divergence between the sequences, the proteins may also have similar cellular functions, ligands, protein interaction partners, or enzymatic mechanisms \cite{Todd:2001}. On the contrary, proteins that have a similar structure by convergence (i.e., by chance) are said to be analogous. They don't generally share similar functions or biochemical mechanisms and are therefore much less helpful for making inferences. HHsearch and HHblits are tools for homology detection and as such do not normally detect analogous relationships \cite{Alva:2010,Remmert:2010}.


\subsection{How can I verify if a database match is homologous?}
Here is a list of things to check if a database match really is at least locally homologous.
 
{\bf Check probability and E-value}:
HHsearch and HHblits can detect homologous relationships far beyond the twilight zone, i.e., below 20\% sequence identity. Sequence identity is therefore not an appropriate measure of relatedness anymore. The estimated probability of the template to be (at least partly) homologous to your query sequence is the most important criterion to decide whether a template HMM is actually homologous or just a high-scoring chance hit. When it is larger than 95\%, say, the homology is nearly certain. Roughly speaking, one should give a hit serious consideration (i.e., check the other points in this list) whenever (1) the hit has $>50\%$ probability, or (2) it has $>30\%$ probability and is among the top three hits. The E-value is an alternative measure of statistical significance. It tells you how many chance hits with a score better than this would be expected if the database contained only hits unrelated to the query. At E-values below one, matches start to get marginally significant. Contrary to the probability, when calculating the E-value HHsearch and HHblits do not take into account the secondary structure similarity. Therefore, the probability is a more sensitive measure than the E-value.

{\bf Check if homology is biologically suggestive or at least reasonable}:
Does the database hit have a function you would expect also for your query? Does it come from an organism that is likely to contain a homolog of your query protein?

{\bf Check secondary structure similarity}:
If the secondary structure of query and template is very different or you can't see how they could fit together in 3D, then this is a reason to distrust the hit. Note however that if the query alignment contains only a single sequence, the secondary structure prediction is quite unreliable and confidence values are overestimated.

{\bf Check relationship among top hits}: 
If several of the top hits are homologous to each other, (e.g.\ when they are members of the same SCOP superfamily), then this will considerably reduce the chances of all of them being chance hits, especially if these related hits are themselves not very similar to each other. Searching the SCOP database is very useful precisely for this reason, since the SCOP family identifier (e.g.\ a.118.8.2) allows to tell immediately if two templates are likely homologs.

{\bf Check for possible conserved motifs}:
Most homologous pairs of alignments will have at least one (semi-)conserved motif in common. You can identify such putative (semi-)conserved motifs by the agglomeration of three or more well-matching columns (marked with a '\verb`|`' sign between the aligned HMMs) occurring within a few residues, as well as by matching consensus sequences. Some false positive hits have decent scores due to a similar amino acid composition of the template. In these cases, the alignments tend to be long and to lack conserved motifs.

{\bf Check residues and role of conserved motifs}: 
If you can identify possible conserved motifs, are the corresponding conserved template residues involved in binding or enzymatic function?

{\bf Check query and template alignments}: 
A corrupted query or template alignment is the main source of high-scoring false positives. The two most common sources of corruption in an alignment are (1) non-homologous sequences, especially repetitive or low-complexity sequences in the alignment, and (2) non-homologous fragments at the ends of the aligned database sequences. Check the query and template MSAs in an alignment viewer such as Jalview or ALNEDIT.

{\bf Realign with other parameters}: 
change the alignment parameters. Choose global instead of local mode, for instance, if you expect your query to be globally homologous to the putative homolog. Try to improve the probability by changing the values for minimum coverage or minimum sequence identity. You can also run the query HMM against other databases.

{\bf Build the query and/or database MSAs more aggressively}:
If your query (or template) MSA is not diverse enough, you could increase sensitivity substantially by trying to include more remotely homologous sequences into the MSA. Try using our HHsenser web server at \url{http://toolkit.tuebingen.mpg.de/hhsenser} \cite{Soding:2006b}. Check the HHsenser alignment manually using an alignment editor. Have non-homologous sequences or sequence segments been accidentally included? You can also try to build a more diverse MSA manually: Inspect the HHblits results after the first iteration and consider including hits above the E-value inclusion threshold of 0.001, based on biological plausibility, relatedness of the organism, a reasonable looking alignment, or just guessing. Then start the second HHblits search iteration HHblits with this manually enriched alignment. 

{\bf Try out other tools}:
Try other tools (e.g.\ for profile-profil comparison) and servers for remote homology detection and structure prediction. 
A list of servers can be found in \cite{Mariani:2011} and \cite{Battey:2007}.

{\bf Verify predictions experimentally}: 
The ultimate confirmation of a homologous relationship or structural model is, of course, the experimental verification of some of its key predictions, such as validating the binding to certain ligands by binding assays, measuring biochemical activity, or comparing the knock-out phenotype with the one obtained when the putative functional residues are mutated.


\subsection{What does the maximum accuracy alignment algorithm do?} \label{MAC}
HHblits and HHsearch use a better alignment algorithm than the quick and 
standard Viterbi method to generate the final HMM-HMM alignments. Both realign
all displayed alignments in a second stage using the more accurate Maximum Accuracy 
(MAC) algorithm \cite{Durbin:2008,Biegert:2008}. The Viterbi algorithm is employed 
for searching and ranking the matches. The realignment step is parallelized 
(\verb`-cpu <int>`) and typically takes a few seconds only.    

Please note: Using different alignment algorithms for scoring and aligning has the 
disadvantage that the pairwise alignments that are displayed are not always very similar to 
those that are used to calculate the scores. This can lead to confusing results 
where alignments of only one or a few residues length may have obtained significant
E-values. In such cases, run the search again with the \verb`-norealign` option, which will 
skip the MAC-realignment step. This will allow you to check if the Viterbi alignments 
are valid at all, which they will probably not be. The length of the MAC alignments 
can therefore give you additional information to decide if a match is valid. In order
to avoid confusion for users of our HHpred server \cite{Soding:2005b, Hildebrand:2009}, 
the \verb`-norealign` option is the default there, whereas for you pros who dare to use 
the command line package, realigning is done by default.

The posterior probability threshold is controlled with the \verb`-mact` option. 
This parameter controls the alignment algorithm's greediness. More precisely, the 
MAC algorithm finds the alignment that maximizes the sum of posterior probabilities 
minus mact for each aligned pair. Global alignments are generated with -mact 0, 
whereas -mact 0.5 will produce quite conservative local alignments. 

The -global and -local options now refer to both the Viterbi search stage as 
well as the MAC realignment stage. With -global (-local), the posterior probability 
matrix will be calculated for global (local) alignment. When -global is used in 
conjunction with -realign, the mact parameter is automatically set to 0 in order to 
produce global alignments. In other words, both following two commands will give 
global alignments:
\begin{verbatim}
$ hhsearch -i <query> -d <db> -realign -mact 0
$ hhsearch -i <query> -d <db> -realign -global
\end{verbatim}

The first version uses \emph{local} Viterbi to search and then uses MAC to realign the 
proteins globally (since mact is 0) on a \emph{local} posterior probability matrix. The 
second version uses \emph{global} Viterbi to search and then realigns globally (since mact 
is automatically set to 0) on a \emph{global} posterior matrix. To detect and align remote 
homologs, for which sometimes only parts of the sequence are conserved, the first 
version is clearly better. It is also more robust. If you expect to find globally 
alignable sequence homologs, the second option might be preferable. In that case, 
it is recommended to run both versions and compare the results. 

\subsection{How is the MSA diversity Neff calculated?} \label{Neff}
The number of effective sequences of the full alignment, which appears as NEFF in the header of each hhm file, is the average of local values \verb`Neff_M(i)` over all alignment positions $i$. The values \verb`Neff_M(i)` are given in the main model section of the hhm model files (subsection \ref{hhmformat}). They quantify the \emph{local} diversity of the alignment in a region around position $i$. More precisely, \verb`Neff_M(i)` measures the diversity of subalignment $Ali_M(i)$ that contains all sequences that have a residue at column $i$ of the full alignment. The subalignment contains all columns for which at least 90\% of these sequences have no end gap. End gaps are gaps to the left of the first residue or to the right of the last residue. The latter condition ensures that the sequences in the subalignment $Ali_M(i)$ cover most of the columns in it. The number of effective sequences in the subalignment $Ali_M(i)$ is exp of the average sequence entropy over all columns of the subalignment. Hence, \verb`Neff_M` is bounded by 0 from below and 20 from above. In practice, it is bounded by the entropy of a column with background amino acid distribution $f_a$: $N_\mathrm{eff} < \sum_{a=1}^{20} f_a \log f_a \approx 16$. Similarly, \verb`Neff_I(i)` gives the diversity of the subalignment $Ali_I(i)$ of all sequences that have an insert at position $i$, and \verb`Neff_D(i)` refers to the diversity of subaligment $Ali_D(i)$ of all sequences that have a Delete (a gap) at position $i$ of the full alignment. 



\subsection{On how to use HHsuite tools}

{\bf Do HHsearch and HHblits work fine with multi-domain sequences?}
HHblits and HHsearch have been designed to work with multi-domain queries. However, the chances for false positives entering the query alginment during the HHblits iterations is greater for multi-domain proteins. For long sequences, it may therefore be of advantage to first search the PDB or the SCOP domain database and then to cut the query sequence into smaller parts on the basis of the identified structural domains. Pfam or CDD are - in our opinion - less suitable to determine domain boundaries.

{\bf How do I best build HMMs for integral membrane proteins?}
Despite the biased amino acid composition in their integral membrane helices, membrane proteins behave very well in HMM-HMM comparisons. In many benchmark tests and evolutionary studies we have not noticed any higher rates of false positives than with cytosolic proteins. 

{\bf How do I reconcile overlapping and conflicting domain predictions}, for example when domain A is predicted from residues 2-50 with 98\% probability and domain B from 2-200 with 95\% probability? The probability that a pair of residues is correctly aligned is the product of the probability for the database match to be homologous (given by the values in the \verb`Probab` column of the hit list) times the posterior probability of the residue pair to be correctly aligned given the database match is correct in the first place. The posterior probabilities are specified by the confidence numbers in the last line of the alignment blocks (0 corresponds approximately to 0-10\%, 9 to 90-100\%). Therefore, an obvious solution would be to prune the alignments in the overlapping region such that the sum of total probabilities is maximized. There is no script yet that does this automatically.

{\bf How can I build a phylogenetic tree for HMMs?}
I would use a similarity measure like the raw score per alignment length. You might also add the secondary structure score to the raw score with some weight. Whereas probabilities, E-values, and P-values are useful for deciding whether a match is a reliable homolog or not, they are not suitable for measuring similarities because they strongly depend on the length of the alignment, roughly like $\mathrm{P-value} \propto \exp(\lambda \times \mathrm{average\_similarity} \times \mathrm{length})$, with some constant $\lambda$ of order 1. The probability has an even more complex dependence on length. Also, the ``Similarity'' given above the alignment blocks is a simple substitution matrix score per column between the query and template master sequences and does not capture the evolutionary information contained in the MSAs (see subsection \ref{aliblocksformat}). One note of caution: Large, diverse MSAs are usually more sensitive to find homologs than narrower ones. Therefore, I would limit the diversity of all HMMs to some reasonable number (perhaps around 5 or 7, depending on how far diverged your HMMs are). This filtering can be done using \verb`$ hhfilter -i <MSA.a3m> -o <MSA_filt.a3m> -neff 5`. 

{\bf Can I weight some residues more than others during the HMM-HMM alignment?} For example to select templates for homology modeling, it could make sense to weight residues around the known functional site more strongly. This might also positively impact the query-template alignment itself. Well, directly weighting query or template match states is not possible. What you can do with a bit of file fiddling, however, may even be better: You can modify the amount of pseudocounts that hhblits, hhsearch, or hhalign will automatically add to the various match state columns in the query or template profile HMM. This is equivalent to assuming a higher conservation for these positions, since pseudocounts are basically mutations that are added to the observed counts in the training sequences to generalize to more remotely related sequences. In HHsuite, the more diverse the sequence neighborhood around a match state column $i$ is, as measured by the number of effective sequences \verb`Neff_M(i)` (see subsection \ref{Neff}), the fewer pseudocounts are added. By manually increasing \verb`Neff_M(i)` for a match state column you increase the assumed conservation of this position in the protein. The \verb`Neff_M(i)` values are found in the hhm-formatted HMM model files in the eighth column of the second line of each match state block in units of $0.001$ (see subsection \ref{hhmformat}). You could, for example, simply increase \verb`Neff_M(i)` values by a factor of two for important, conserved positions. Setting \verb`Neff_M(i)` to 99999 will reduce pseudocounts to effectively zero. A word of caution is advised: HMMs built from a diverse set of training sequences already contain quantitative information about the degree of conservation for each position.


{\bf Don't I need to calibrate my query or database HMMs anymore?}
No. If you don't specify otherwise, the two parameters of the extreme-value distribution for the query are estimated by a neural network from the lengths and diversities ($N_\mathrm{eff}$)) of query and database HMMs that was trained on a large set of example queries-template pairs, in an approach similar to the one used in \cite{Sadreyev:2008}. However, the old calibration is still available as an option in HHsearch.

{\bf Should I use the \verb`-global` option to build MSAs} for an HHsuite database if I am intersted in global alignments to these database HMMs?
Never use \verb`-global` when building MSAs by iterative searches with HHblits. Remember that global alignments are very greedy alignments, where alignments stretch to the ends of either query of database HMM, no matter what. Global alignments will therefore often lead to non-homologous segments getting included in the MSA, which is catastrophic, as these false segments will lead to many false positive matches when searching with this MSA. But you may use \verb`-global` when searching (with a single iteration) for global matches of your query HMMs through your customized HHsuite database, for example.

{\bf How many iterations of hhblits should I use to search my customized genome database?} One iteration! Doing more than one search iterations in general only makes sense in order to build up a diverse MSA of homologous sequences by searching through the uniprot20 or nr20 databases. 



\subsection{About HHsuite databases}


{\bf How can I retrieve database A3M or HHM files from my HHsuite database?}
For reasons of efficiency, the MSA A3M and HHM model files are packed into two files, \verb`<db>_a3m.ffdata` and \verb`<db>_hhm.ffdata`, each with its own index file. The contained files are easy to extract from the packed files. For example, to dump \verb`d1aa7a_.a3m` in \verb`scop70` to standard output, type 
\begin{verbatim}
$ ffindex_get scop70_a3m.ffdata scop70_a3m.ffindex d1aa7a_.a3m
\end{verbatim}
You may write the extracted file to a separate file by appending \verb`'> d1aa7a_.a3m'` to the above command.
If you want to extract all files in the ffindex data structures to directory \verb`tmp/scop70`, type
\begin{verbatim}
$ ffindex_unpack scop70_a3m.ffdata scop70_a3m.ffindex tmp/scop70/ .
\end{verbatim}

{\bf How can I build my own UniProt database for HHblits?}
The procedure to cluster the nr or UniProt databases is more complicated than building a database for a genome or for the sequences in the pdb. As its first step it involves clustering these huge databases down to 20\%-30\% sequence identity. The clustering is done in our lab using a new method, kClust (Hauser M, Mayer CE, and S\"oding J., to be published), available under GPL at \url{ftp://ftp://toolkit.lmb.uni-muenchen.de/kClust/}. We will add all scripts to build these HHsuite databases to the HHsuite in due time. These scripts generate A3M files, HHM files, and consensus sequences. Because of the large number of files to generate, these scripts need to be run on a computer cluster and this would require considerable computer savvyness. 

{\bf Will you offer an HHsuite database that includes environmental sequences?}
We have not seen significant improvements in remote homology detection through inclusion of environmental sequences. A problem with these sequences is their low quality, in particular chimeric sequences from wrong assemblies. This can lead to corrupted profiles in iterative searches. We therefore have no immediate plans to provide a UniProt+env or nr+env for hhblits. If you think you really need the env sequences, you have to use our old \verb`buildali.pl` script or PSI-BLAST for the time being. An alternative is to use kClust to cluster the database yourself. See the previous question.

{\bf How do I read the sequence/HMM headers in the pdb70 databases?}
The pdb70 database is clustered to 70\% maximum pairwise sequence identity to reduce redundancy. The PDB identifiers at the end of the description line of sequence $X$ lists the sequences with lower experimental resolution that were removed for having higher than 70\% sequence identity with PDB sequence $X$. The asterisks indicate that a multi-atom ligand is bound in the structure. 


\subsection{About unexpected HHsuite behavior and troubleshooting}

{\bf Why do I get different results when reversing the roles of query and template?} When taking A as the query, the pairwise alignment and significance scores with template B in the database can be different than when B is the query and A the template. By default hhblits, hhsearch, and hhalign add context-specific pseudocounts to the query MSA or HMM whereas the much faster substitution matrix pseudocounts are added to the database HMMs/MSAs for reasons of speed. If you want to combine the significance estimates from the forward and reverse comparisons, we recommend to take the geometric mean of Probability, P-value and E-value. Since the less diverse HMM or MSA will profit more from the context-specific pseudocounts than the more diverse HMM or MSA, it might be even better to take the estimate from the comparison in which the HMM/MSA with the smaller diversity (measured as number of effective sequences, NEFF) is used as the query. 

{\bf Why do I sometimes get the same database hit twice with different probabilities?} Each line in the summary hit list refers to an alignment with an HMM in the database, not to the database HMM itself. Sometimes, alternative (suboptimal) alignments covering a different part of either the query or the database HMM may appear in the hit list. Usually, both the optimal and the alternative alignments are correct, in particular when the query or the database HMMs represent repeat proteins. 

{\bf Why do I get different results using hhblits and hhsearch} when I search with the same query through the same database? There are two reasons. First, some hits that hhsearch shows might not have passed the prefilter in hhblits. The option \verb`-prepre_smax_thresh <bits>` lets you modify the minimum score threshold of the first, gapless alignment prefilter (default is 10 bits). Option \verb`-pre_evalue_thresh <E-value>` sets the maximum E-value threshold for the second, gapped alignment prefilter (default is 1000). Second, while the probabilities and P-values of hhblits and hhsearch should be the identical for the same matches, the E-values are only similar. The reason is that hhblits heuristically combines the E-values of the second prefilter with the E-values from the full HMM-HMM Viterbi alignment into total E-values \cite{Remmert:2011}. These E-values are slightly better in distinguishing true from false hits, because they combine the partly independent information from two comparisons.

{\label{2_vs_1+1} \bf Why do I get different results when I perform $2$ and $1+1$ iterations with HHblits?} When you do a single iteration and start a second iteration with the A3M output file from the first iteration, the alignments should be nearly the same as if you do two iterations (\verb`-n 2`) Why not exactly the same? Because when doing multiple iterations in a single hhblits run, alignments for each hit are only calculated once. When the same database HMM is found in later search iterations, the ``frozen'', first alignment is used instead of recomputing it. This increases robustness with respect to homologous overextension. However, the significance values can be slightly different between the cases with$2$ and $1+1$ iterations. The reason is that the HHblits P-value is calculated by a weighted combination of the P-value of the Viterbi pairwise HMM alignment and the prefilter E-value. The prefilter E-values differ in the two cases described, since in the first case the E-value is calculated for the first prefilter search (before the first full search) and in the second case it is calculated for the second prefilter search, performed with the query MSA obtained \emph{after} the first search iteration. 

{\bf Why do I get a segmentation fault when calling hhblits on our new machine?}
Often, segmentation faults can be caused by too little available memory. HHblits needs memory to read the entire column state file into memory for prefiltering (\verb`*.cs219`). For Viterbi HMM-HMM alignment it needs about 
\begin{equation}
\textrm{Memory Viterbi} = \textrm{Query\_length} \times \textrm{max\_db\_seq\_length} \times (\textrm{num\_threads}+1) \times 5 \textrm{B} \,. 
\end{equation}
Here, num\_threads is the number of threads specified with option \verb`-cpu <int>`. For realignment, HHblits needs in addition 
\begin{equation}
\textrm{Memory Realing} = \textrm{Query\_length} \times \textrm{max\_db\_seq\_length} \times (\textrm{num\_threads}+1) \times 8 \textrm{B} \,, 
\end{equation}
but the amount of memory for realignment can be limited using the \verb`-maxmem <GB>` option, which will cause the realignment to skip the longest sequences. Under Linux, you can check your memory limit settings using \verb`ulimit -a`. Make sure you have at least 4GB per thread. Try 
\begin{verbatim}
$ ulimit -m 4000000 -v 4000000 -s 8192 -f 4000000
\end{verbatim}
Also try using \verb`-cpu 1 -maxmem 1` to reduce memory requirements. If it still does not work, follow the tipps in subsection \ref{report-bug} to report your problems to us.

{\bf I obtain different results for the same query sequence on the HHpred web server and on my local version using HHblits and HHsearch.}
\begin{itemize}
\item Check the exact command line calls for HHblits and HHsearch from the log file accessible on the results page of the server. If any of the commands differ, rerun your command line version with the same options. 
\item Check if the query multiple sequence alignments on the server and in your versions contain the same number of sequences and the same diversity. If not, try to find out why (see previous point, for example.)
\item For some of the alignments with differing scores, check if the db aligment on the server is the same as the one in your local database. You can download the db MSA by clicking the alignment logo above the query-template pairwise profile-profile alignment.
\end{itemize}

{\bf I find an alignment of my query and a template with the exact same sequence, but they are aligned incorrectly.} I find a gap inserted at residue position 92 in query and position 99 in template:

\begin{verbatim}
Q ss_pred             CcEEEeeCCCC-CCCeEEEEEECCeeEEEEEEECCCCcEEEEEEeeCCHHHHHHHHhhCCCccceeEEEecccCCCCCCe
Q Pr_1             81 GAFLIRESESA-PGDFSLSVKFGNDVQHFKVLRDGAGKYFLWVVKFNSLNELVDYHRSTSVSRNQQIFLRDIEQVPQQPT  159 (217)
Q Consensus        81 ~~~~~~~~~~~-~~~~~~s~~~~~~~~~~~i~~~~~~~~~~~~~~f~s~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  159 (217)
                      |.+++|.+... ++.+.+ +..++.++|+++.....+.|+.+...|.++..++.++...+....................
T Consensus        81 G~flvR~s~~~~~~~~~l-v~~~~~v~h~~i~~~~~g~~~~~~~~f~sl~~Lv~~y~~~~~~~~~~~~l~~~~~~~~~~~  159 (217)
T 1gri_A           81 GAFLIRESESAPGDFSLS-VKFGNDVQHFKVLRDGAGKYFLWVVKFNSLNELVDYHRSTSVSRNQQIFLRDIEQVPQQPT  159 (217)
T ss_dssp             TCEEEEECSSSTTCEEEE-EEETTEEEEEEEEECSSSCEESSSCEESSHHHHHHHHHHSCSSSSTTCCCCBCCCCCCCCC
T ss_pred             CeEEEEECCCCCCCcEEE-EECCCceeEEEEEECCCCcEEEeeEecCCHHHHHHHhhcCCCccccceeccccccccCCce
\end{verbatim}

The seemingly incorrect alignment is explained by the fact that hhsearch / hhblits aligns profile HMMs, not sequences directly. In this example, the query and template multiple sequence alignments (MSAs) from which the profile HMMs were built are \emph{not} identical, as one can see for instance from the differing Consensus annotation strings and different predicted secondary structure strings. The underying MSAs probably contain the exact same alignment ambiguity in this region, with some member sequences being shifted by one positions between positions 92 and 99  relative to the master sequence of the MSA.


\section{HHsearch/HHblits output: hit list and pairwise alignments}\label{outformat}

\subsection{Summary hit list}

Let's do a search with the human PIP49/FAM69B protein, for which we generated an MSA in \verb`query.a3m` with two iterations of HHblits in subsection \ref{msa_hhblits}:

\scriptsize
\begin{verbatim}
Search results will be written to query.hhr
query.a3m is in A2M, A3M or FASTA format
Read query.a3m with 272 sequences
Alignment in query.a3m contains 431 match states
149 out of 270 sequences passed filter (up to 91% position-dependent max pairwise sequence identity)
Effective number of sequences exp(entropy) = 5.2 
.................................................. 1000 HMMs searched
.................................................. 2000 HMMs searched
.................................................. 3000 HMMs searched
.................................................. 4000 HMMs searched
.................................................. 5000 HMMs searched
.................................................. 6000 HMMs searched
.................................................. 7000 HMMs searched
.................................................. 8000 HMMs searched
.................................................. 9000 HMMs searched
.................................................. 10000 HMMs searched
.................................................. 11000 HMMs searched
.................................................. 12000 HMMs searched
.................................................. 13000 HMMs searched
....................................
Realigning 183 query-template alignments with maximum accuracy (MAC) algorithm ...

Query         sp|Q5VUD6|FA69B_HUMAN Protein FAM69B OS=Homo sapiens GN=FAM69B PE=2 SV=3
Match_columns 431
No_of_seqs    149 out of 272
Neff          5.2 
Searched_HMMs 13730
Date          Wed Jan  4 17:44:24 2012
Command       hhsearch -i query.a3m -d /cluster/user/soeding/databases/scop -cpu 18 

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 d1qpca_ d.144.1.7 (A:) Lymphoc  99.7 4.5E-17 3.2E-21  154.3  10.2   99  203-320    56-157 (272)
  2 d1jpaa_ d.144.1.7 (A:) ephb2 r  99.7 4.3E-17 3.1E-21  156.8   8.8   99  203-321    75-177 (299)
  3 d1uwha_ d.144.1.7 (A:) B-Raf k  99.7 5.1E-17 3.7E-21  154.8   7.7  100  203-322    52-154 (276)
  4 d1opja_ d.144.1.7 (A:) Abelson  99.7 6.2E-17 4.5E-21  154.8   8.3  100  203-321    61-164 (287)
  5 d1mp8a_ d.144.1.7 (A:) Focal a  99.6 9.9E-17 7.2E-21  151.3   8.6  100  203-322    56-158 (273)
  6 d1sm2a_ d.144.1.7 (A:) Tyrosin  99.6 1.2E-16 8.8E-21  150.3   8.8   99  203-321    48-150 (263)
  7 d1u59a_ d.144.1.7 (A:) Tyrosin  99.6 2.4E-16 1.7E-20  150.9   9.5   99  203-321    57-158 (285)
  8 d1xbba_ d.144.1.7 (A:) Tyrosin  99.6 2.2E-16 1.6E-20  150.2   8.6   97  203-320    56-155 (277)
  9 d1vjya_ d.144.1.7 (A:) Type I   99.6 2.6E-16 1.9E-20  151.3   8.8   98  204-320    46-156 (303)
 10 d1mqba_ d.144.1.7 (A:) epha2 r  99.6 4.4E-16 3.2E-20  148.0   8.7  193  203-422    57-272 (283)
...
 64 d1j7la_ d.144.1.6 (A:) Type II  97.3 0.00014   1E-08   65.0   6.3   33  292-324   184-216 (263)
 65 d1nd4a_ d.144.1.6 (A:) Aminogl  96.7  0.0012 8.5E-08   58.5   6.6   31  292-322   176-206 (255)
 66 d1nw1a_ d.144.1.8 (A:) Choline  96.6  0.0011 7.8E-08   63.9   5.8   37  203-239    92-128 (395)
 67 d2pula1 d.144.1.6 (A:5-396) Me  95.6  0.0071 5.2E-07   58.3   6.4   32  290-322   222-253 (392)
 68 d1a4pa_ a.39.1.2 (A:) Calcycli  91.7    0.12 8.9E-06   40.0   5.4   62  140-202    18-80  (92)
 69 d1ksoa_ a.39.1.2 (A:) Calcycli  91.2    0.17 1.2E-05   39.5   5.8   56  147-203    28-83  (93)
 70 d1e8aa_ a.39.1.2 (A:) Calcycli  90.5    0.23 1.7E-05   38.3   6.0   56  147-203    27-82  (87)
...
175 d1qxpa2 a.39.1.8 (A:515-702) C  23.7      29  0.0021   28.8   3.8   49  137-197    69-118 (188)
176 d1tuza_ a.39.1.7 (A:) Diacylgl  23.5      55   0.004   25.3   5.3   55  143-201    44-106 (118)
177 d1ggwa_ a.39.1.5 (A:) Cdc4p {F  23.1      26  0.0019   27.0   3.2   66  129-197    35-101 (140)
178 d1topa_ a.39.1.5 (A:) Troponin  22.8      72  0.0052   24.5   6.0   58  140-199    65-123 (162)
179 d1otfa_ d.80.1.1 (A:) 4-oxaloc  22.5      66  0.0048   21.5   5.0   40  267-306    12-53  (59)
180 d1oqpa_ a.39.1.5 (A:) Caltract  22.2      32  0.0023   24.0   3.2   32  165-197     3-34  (77)
181 d1df0a1 a.39.1.8 (A:515-700) C  21.7      43  0.0032   27.0   4.5   51  137-199    67-118 (186)
182 d1zfsa1 a.39.1.2 (A:1-93) Calc  21.1      41   0.003   24.6   3.8   30  170-199     8-38  (93)
183 d1snla_ a.39.1.7 (A:) Nucleobi  20.9      23  0.0016   26.2   2.2   24  174-197    18-41  (99)

Done
\end{verbatim}
\normalsize
 
The summary hit list that is written to the screen shows the best hits from the 
database, ordered by the probability of being a true positive (column 4: 'Prob'). 
The meaning of the columns is the following:
\vspace{5mm}

\renewcommand{\arraystretch}{1.2}

\begin{description}
\item{\verb`No`}: the index of the datbase match.

\item{\verb`Hit`}: the first 30 characters of the name line.

\item{\verb`Prob`}: the Probability of template to be a true positive.
For the probability of being a true positive, the secondary structure score 
in column \verb`SS` is taken into account, together with the raw score in\
column \verb`Score`. 
True positives are defined to be either globally homologous or they are at least 
homologous in parts, and thereby locally similar in structure. More precisely, 
the latter criterion demands that the MAXSUB score between query and hit is at 
least 0.1. In almost all cases the structural similarity will we be due to a global
OR LOCAL homology between query and template.

\item{\verb`E-value`}:
The E-value gives the average number of false positives ('wrong hits') with a score 
better than the one for the template when scanning the database. It is a measure of 
reliability: E-values near to 0 signify a very reliable hit, an E-value of 10 means 
about 10 wrong hits are expected to be found in the database with a score at least 
this good. Note that E-value and P-value are calculated without taking the secondary 
structure into account!


\item{\verb`P-value`}: 
The P-value is the E-value divided by the number of sequences in the database.
It is the probability that in a \emph{pairwise} comparison a wrong hit will score at least 
this good.

\item{\verb`Score`}: the raw score is computed by the Viterbi HMM-HMM alignment excluding the
secondary structure score. It is the sum of similarities of the aligned profile columns
minus the position-specific gap penalties in bits. The column similarity score is the 
log-sum-of-odds score (base 2) as described in the original HHsearch paper 
(Soding, Bioinformatics 2005). The gap penalties are the log2 of the state transition
probabilities, e.g. from match state to insert or delete to match state.

\item{\verb`SS`}: the secondary structure score.
This score tells you how well the PSIPRED-predicted (3-state) or actual DSSP-determined 
(8-state) secondary structure sequences agree with each other. PSIPRED confidence 
values are used in the scoring, low confidences getting less statistical weight.

\item{\verb`Cols`}: the number of aligned Match columns in the HMM-HMM alignment.

\item{\verb`Query HMM`}: the range of aligned match states from the query HMM.

\item{\verb`Template HMM`}: the range of aligned match states from the database/template HMM and, 
in parenthesis, the number of match states in the database HMM.

\end{description}



\subsection{HMM-HMM pairwise alignments}\label{aliblocksformat}

The output file d1bpya1.hhr contains the same hit list plus the pairwise HMM alignments. One example is give here:

\scriptsize
\begin{verbatim}
No 68 
>d1a4pa_ a.39.1.2 (A:) Calcyclin (S100) {Human (Homo sapiens), P11 s100a10, calpactin [TaxId: 9606]}
Probab=91.65  E-value=0.12  Score=40.00  Aligned_cols=62  Identities=16%  Similarity=0.149  Sum_probs=42.0

Q ss_pred             ccCCCCCCcHHHHHHHHHHHHHhhcccCccHHHHHHHHHhhhhccCCCCcCHHHHHHH-HHHHH
Q sp|Q5VUD6|FA69  140 FDKPTRGTSIKEFREMTLSFLKANLGDLPSLPALVGQVLLMADFNKDNRVSLAEAKSV-WALLQ  202 (431)
Q Consensus       140 ~d~p~~g~s~~eF~emv~~~i~~~lg~~~~l~~L~~~~~~~~d~nk~g~vs~~e~~sl-waLlq  202 (431)
                      ||+..-..|.+||.+++.......++.+.+ ...+..++..+|.|+||+|++.|...+ ..|..
T Consensus        18 yd~ddG~is~~El~~~l~~~~~~~~~~~~~-~~~v~~~~~~~D~n~DG~I~F~EF~~li~~l~~   80 (92)
T d1a4pa_          18 FAGDKGYLTKEDLRVLMEKEFPGFLENQKD-PLAVDKIMKDLDQCRDGKVGFQSFFSLIAGLTI   80 (92)
T ss_dssp             HHGGGCSBCHHHHHHHHHHHCHHHHHHSCC-TTHHHHHHHHHCTTSSSCBCHHHHHHHHHHHHH
T ss_pred             HcCCCCEEcHHHHHHHHHHhccccccccCC-HHHHHHHHHHHhCCCCCCCcHHHHHHHHHHHHH
Confidence            444433449999999998876655554332 234566677899999999999997544 44443
\end{verbatim}\normalsize

This alignment shows an EF hand embedded in a kinase domain in PIP49/FAM69B. The first line, 
which begins with with a ``\verb`>`'', contains the name and description line of the template/database 
HMM. (We use ``template HMM'' and ``matched database HMM'' synonymously.) 
The next line summarizes the main statistics for the alignment: The probability for the query and 
template HMMs to be homologous (\verb`Probab`), the \verb`E-value`, the raw \verb`Score`, and the 
number of aligned columns are repeated from the summary hit list. The \verb`Identities` give the percentage
of aligned residue pairs of the query and the template master sequences that are identical. 
The \verb`Similarity` is the arithmetic mean of the substitution scores between the aligned residue
pairs from the query and template master sequences. The substitution matrix is the same as the
one used to calculate the pseudocounts for the database HMMs, by default the Gonnet matrix. 
(The matrix can be changed with the \verb`-Blosom<XX>` option.) 

The \verb`Sum_probs` value is the sum over the posterior probabilities of all aligned pairs 
of match states. These probabilities are calculated by the Forward-Backward algorithm. 
(They are used by the maximum accuracy algorithm 
which computes the final alignments.) When the template HMM has secondary structure annotation from
DSSP, the \verb`sum_probs` value runs only over aligned pairs for which the template has a 
valid DSSP state, not a \verb`-` sign. A  \verb`-` would indicate that the structural coordinates of
that residue are missing in the template. For homology modelling, this special treatment of templates 
with known structure makes \verb`sum_probs` a useful feature to use for ranking templates.



The pairwise alignment consists of one or more blocks with the following lines:

\small
\begin{verbatim}
Q ss_dssp:      the query secondary structure as determined by DSSP (when available)
Q ss_pred:      the query secondary structure as predicted by PSIPRED (when available)
Q <Q_name>:     the query master sequence
Q Consensus:    the query alignment consensus sequence
\end{verbatim}\normalsize

The predicted secondary structure states are shown in capital letters if the PSIPRED
confidence value is between 0.7 and 1.0, for lower confidence values they are given 
in lower-case letters. With the option {\tt '-ssconf'}, {\tt 'ss\_conf'} lines can 
be added to the alignments which report the PSIPRED confidence values by numbers 
between 0 and 9 (as in versions up to 1.5).

The consensus sequence uses capital letters for well conserved columns and
lower case for partially conserved columns. Unconserved columns are marked by 
a tilde \verb`~`. Roughly speaking, amino acids that occur with $\ge 60\%$ probability 
(before adding pseudocounts) are written as capital letters and amino acids that have 
$\ge 40\%$ probability are written as lower case letters, where gaps are included
in the fraction counts. More precisely, when the gap-corrected amino acid fraction
    \[p_i(a)*N_\mathrm{eff}(i)/(N_\mathrm{eff}+1)\]
is above 0.6 (0.4) an upper (lower) case letter is used for amino acid a.
Here, $p_i(a)$ is the emission probability for a in column i, $N_\mathrm{eff}$ is the effective 
number of sequences in the entire multiple alignment (between 1 and 20) and $N_\mathrm{eff}(i)$ is 
the effective number of sequences in the subalignment consisting of those sequences
that do not have a gap in column i. These percentages increase
approximately inversely proportionally with the fraction of gaps in the column, 
hence a column with only cysteines and 50\% gaps gets a lower case letter.
              
The line in the middle shows the column score between the query and template 
amino acid distributions. It gives a valuable indication for the alignment quality.
\small
\begin{verbatim}
  = : column score below -1.5
  - : column score between -1.5 and -0.5
  . : column score between -0.5 and +0.5
  + : column score between +0.5 and +1.5
  | : column score above   +1.5
\end{verbatim}\normalsize

A unit of column score corresponds approximately to 0.6 bits.
From the column score line the excellent alignment around the conserved 
{\tt 'D.n.DG.i...E'} motif in the turn between two helices is evident. The alignment around the 
gap by contrast scores only a bit better than zero per residue and is
therefore not very reliable.

After the template block, which consists of the following lines, 
\small 
\begin{verbatim}
T Consensus:    the template alignment consensus sequence
T <T_name>:     the template domain sequence
T ss_dssp:      the template secondary structure as determined by DSSP (when available)
T ss_pred:      the template secondary structure as predicted by PSIPRED (when available)
\end{verbatim}\normalsize

The last line in the block ({\tt Confidence}) reports the reliability of the pairwise 
query-template alignment. The confidence values are obtained from the posterior 
probabilities calculated in the Forward-Backward algorithm. A value of 8 indicates
a probability that this pair of HMM columns is correctly aligned between 0.8 and 0.9. 
The {\tt Confidence} line is only displayed when the -realign option is active.


\section{File formats}

\subsection{Multiple sequence alignment formats} \label{aliformats}

Multiple alignments can be read in A2M, A3M, or aligned FASTA format. (Check the -M option for 
using an input format different from the default A3M). You can transform MSAs 
from Clustal or Stockholm format to A3M or aligned FASTA with the \verb`reformat.pl` utility 
supplied in this package. 

To reformat from Clustal format to A3M:
\begin{verbatim}
  $ reformat.pl test.aln test.a3m
\end{verbatim}
or explicitly, if the formats can not be recognized from the extensions:
\begin{verbatim}
  $ reformat.pl clu a3m test.clustal test.a3m
\end{verbatim}
To reformat from Stockholm to aligned FASTA:
\begin{verbatim}
  $ reformat.pl test.sto test.fas
\end{verbatim}


\subsubsection*{Example for aligned FASTA format:}

\scriptsize
\begin{verbatim}
>d1a1x__ b.63.1.1 (-) p13-MTCP1 {Human (Homo sapiens)}
PPDHLWVHQEGIYRDEYQRTWVAVVEE--E--T--SF---------LR----------ARVQQIQVPLG-------DAARPSHLLTS-----QL
>gi|6678257|ref|NP_033363.1|:(7-103) T-cell lymphoma breakpoint 1 [Mus musculus]
HPNRLWIWEKHVYLDEFRRSWLPVVIK--S--N--EK---------FQ----------VILRQEDVTLG-------EAMSPSQLVPY-----EL
>gi|7305557|ref|NP_038800.1|:(8-103) T-cell leukemia/lymphoma 1B, 3 [Mus musculus]
PPRFLVCTRDDIYEDENGRQWVVAKVE--T--S--RSpygsrietcIT----------VHLQHMTTIPQ-------EPTPQQPINNN-----SL
>gi|11415028|ref|NP_068801.1|:(2-106) T-cell lymphoma-1; T-cell lymphoma-1A [Homo sapiens]
HPDRLWAWEKFVYLDEKQHAWLPLTIEikD--R--LQ---------LR----------VLLRREDVVLG-------RPMTPTQIGPS-----LL
>gi|7305561|ref|NP_038804.1|:(7-103) T-cell leukemia/lymphoma 1B, 5 [Mus musculus]
----------GIYEDEHHRVWIAVNVE--T--S--HS---------SHgnrietcvt-VHLQHMTTLPQ-------EPTPQQPINNN-----SL
>gi|7305553|ref|NP_038801.1|:(5-103) T-cell leukemia/lymphoma 1B, 1 [Mus musculus]
LPVYLVSVRLGIYEDEHHRVWIVANVE--TshS--SH---------GN----------RRRTHVTVHLW-------KLIPQQVIPFNplnydFL
>gi|27668591|ref|XP_234504.1|:(7-103) similar to Chain A, Crystal Structure Of Murine Tcl1
-PDRLWLWEKHVYLDEFRRSWLPIVIK--S--N--GK---------FQ----------VIMRQKDVILG-------DSMTPSQLVPY-----EL
>gi|27668589|ref|XP_234503.1|:(9-91) similar to T-cell leukemia/lymphoma 1B, 5;
-PHILTLRTHGIYEDEHHRLWVVLDLQ--A--ShlSF---------SN----------RLLIYLTVYLQqgvafplESTPPSPMNLN-----GL
>gi|7305559|ref|NP_038802.1|:(8-102) T-cell leukemia/lymphoma 1B, 4 [Mus musculus] 
PPCFLVCTRDDIYEDEHGRQWVAAKVE--T--S--SH---------SPycskietcvtVHLWQMTTLFQ-------EPSPDSLKTFN-----FL
>gi|7305555|ref|NP_038803.1|:(9-102) T-cell leukemia/lymphoma 1B, 2 [Mus musculus]
---------PGFYEDEHHRLWMVAKLE--T--C--SH---------SPycnkietcvtVHLWQMTRYPQ-------EPAPYNPMNYN-----FL
\end{verbatim}\normalsize

The sequence name and its description must be contained in a single name line beginning 
with the $>$ symbol and followed directly by the sequence name. The residue data is 
contained in one or more lines of arbitrary length following the name line. No empty 
lines should be used. In aligned FASTA the gaps are written with '-' and the n'th 
letter of each sequence (except newlines) is understood to build the n'th column of the 
multiple alignment. 


\subsubsection*{The same alignment in A2M format looks like this:}

\scriptsize
\begin{verbatim}
>d1a1x__ b.63.1.1 (-) p13-MTCP1 {Human (Homo sapiens)}
PPDHLWVHQEGIYRDEYQRTWVAVVEE..E..T..SF.........LR..........ARVQQIQVPLG.......DAARPSHLLTS.....QL
>gi|6678257|ref|NP_033363.1|:(7-103) T-cell lymphoma breakpoint 1 [Mus musculus]
HPNRLWIWEKHVYLDEFRRSWLPVVIK..S..N..EK.........FQ..........VILRQEDVTLG.......EAMSPSQLVPY.....EL
>gi|7305557|ref|NP_038800.1|:(8-103) T-cell leukemia/lymphoma 1B, 3 [Mus musculus]
PPRFLVCTRDDIYEDENGRQWVVAKVE..T..S..RSpygsrietcIT..........VHLQHMTTIPQ.......EPTPQQPINNN.....SL
>gi|11415028|ref|NP_068801.1|:(2-106) T-cell lymphoma-1; T-cell lymphoma-1A [Homo sapiens]
HPDRLWAWEKFVYLDEKQHAWLPLTIEikD..R..LQ.........LR..........VLLRREDVVLG.......RPMTPTQIGPS.....LL
>gi|7305561|ref|NP_038804.1|:(7-103) T-cell leukemia/lymphoma 1B, 5 [Mus musculus]
----------GIYEDEHHRVWIAVNVE..T..S..HS.........SHgnrietcvt.VHLQHMTTLPQ.......EPTPQQPINNN.....SL
>gi|7305553|ref|NP_038801.1|:(5-103) T-cell leukemia/lymphoma 1B, 1 [Mus musculus]
LPVYLVSVRLGIYEDEHHRVWIVANVE..TshS..SH.........GN..........RRRTHVTVHLW.......KLIPQQVIPFNplnydFL
>gi|27668591|ref|XP_234504.1|:(7-103) similar to Chain A, Crystal Structure Of Murine Tcl1
-PDRLWLWEKHVYLDEFRRSWLPIVIK..S..N..GK.........FQ..........VIMRQKDVILG.......DSMTPSQLVPY.....EL
>gi|27668589|ref|XP_234503.1|:(9-91) similar to T-cell leukemia/lymphoma 1B, 5;
-PHILTLRTHGIYEDEHHRLWVVLDLQ..A..ShlSF.........SN..........RLLIYLTVYLQqgvafplESTPPSPMNLN.....GL
>gi|7305559|ref|NP_038802.1|:(8-102) T-cell leukemia/lymphoma 1B, 4 [Mus musculus] 
PPCFLVCTRDDIYEDEHGRQWVAAKVE..T..S..SH.........SPycskietcvtVHLWQMTTLFQ.......EPSPDSLKTFN.....FL
>gi|7305555|ref|NP_038803.1|:(9-102) T-cell leukemia/lymphoma 1B, 2 [Mus musculus]
---------PGFYEDEHHRLWMVAKLE..T..C..SH.........SPycnkietcvtVHLWQMTRYPQ.......EPAPYNPMNYN.....FL
\end{verbatim}\normalsize

A2M format is derived from aligned FASTA format. It looks very similar, but it 
distinguishes between match/delete columns and insert columns. This information is 
important to uniquely specify how an alignment is transformed into an HMM. The 
match/delete columns use upper case letters for residues and the '-' symbol for 
deletions (gaps). The insert columns use lower case letters for the inserted residues. 
Gaps aligned to inserted residues are written as '.' Lines beginning with a hash \# 
symbol will be treated as commentary lines in HHsearch/HHblits (see below).


\subsubsection*{The same alignment in A3M:}

\scriptsize
\begin{verbatim}
>d1a1x__ b.63.1.1 (-) p13-MTCP1 {Human (Homo sapiens)}
PPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQQIQVPLGDAARPSHLLTSQL
>gi|6678257|ref|NP_033363.1|:(7-103) T-cell lymphoma breakpoint 1 [Mus musculus]
HPNRLWIWEKHVYLDEFRRSWLPVVIKSNEKFQVILRQEDVTLGEAMSPSQLVPYEL
>gi|7305557|ref|NP_038800.1|:(8-103) T-cell leukemia/lymphoma 1B, 3 [Mus musculus]
PPRFLVCTRDDIYEDENGRQWVVAKVETSRSpygsrietcITVHLQHMTTIPQEPTPQQPINNNSL
>gi|11415028|ref|NP_068801.1|:(2-106) T-cell lymphoma-1; T-cell lymphoma-1A [Homo sapiens]
HPDRLWAWEKFVYLDEKQHAWLPLTIEikDRLQLRVLLRREDVVLGRPMTPTQIGPSLL
>gi|7305561|ref|NP_038804.1|:(7-103) T-cell leukemia/lymphoma 1B, 5 [Mus musculus]
----------GIYEDEHHRVWIAVNVETSHSSHgnrietcvtVHLQHMTTLPQEPTPQQPINNNSL
>gi|7305553|ref|NP_038801.1|:(5-103) T-cell leukemia/lymphoma 1B, 1 [Mus musculus]
LPVYLVSVRLGIYEDEHHRVWIVANVETshSSHGNRRRTHVTVHLWKLIPQQVIPFNplnydFL
>gi|27668591|ref|XP_234504.1|:(7-103) similar to Chain A, Crystal Structure Of Murine Tcl1
-PDRLWLWEKHVYLDEFRRSWLPIVIKSNGKFQVIMRQKDVILGDSMTPSQLVPYEL
>gi|27668589|ref|XP_234503.1|:(9-91) similar to T-cell leukemia/lymphoma 1B, 5;
-PHILTLRTHGIYEDEHHRLWVVLDLQAShlSFSNRLLIYLTVYLQqgvafplESTPPSPMNLNGL
>gi|7305559|ref|NP_038802.1|:(8-102) T-cell leukemia/lymphoma 1B, 4 [Mus musculus] 
PPCFLVCTRDDIYEDEHGRQWVAAKVETSSHSPycskietcvtVHLWQMTTLFQEPSPDSLKTFNFL
>gi|7305555|ref|NP_038803.1|:(9-102) T-cell leukemia/lymphoma 1B, 2 [Mus musculus]
---------PGFYEDEHHRLWMVAKLETCSHSPycnkietcvtVHLWQMTRYPQEPAPYNPMNYNFL
\end{verbatim}\normalsize

The A3M format is a condensed version of A2M format. It is obtained by omitting all '.' 
symbols from A2M format. Hence residues emitted by Match states of the HMM are in upper 
case, residues emitted by Insert states are in lower case and deletions are written '-'.
A3M-formatted alignments can be reformatted to other formats like FASTA or A2M with 
the \verb`reformat.pl` utility:
\begin{verbatim}
  reformat.pl test.a3m test.a2m
\end{verbatim}
Lines beginning with a hash \# symbol will be treated as commentary lines in HHsearch/HHblits
(see below). Please note that A3M, though very practical and space-efficient, 
is not a standard format, and the name A3M is our personal invention.

\subsubsection*{Secondary structure information in A3M/A2M or FASTA MSAs for HHsearch/HHblits}

The alignments read in by HHblits, HHsearch or HHmake can also contain secondary structure 
information. This information can be included in sequences with special names, 
like in this A3M file:

\scriptsize
\begin{verbatim}
>ss_dssp
CCSEEEEEETTEEEETTSCEEEEEEEECSSCEEEEEECCCCCCCSCCCHHHHTTCSSCSEEEEETTTEEEETTSC
>aa_dssp
PPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQQIQVPLGDAARPSHLLTSQLPLMWQLYPEERYMDNNSR
>aa_pred 
PPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQQIQVPLGDAARPSHLLTSQLPLMWQLYPEERYMDNNSR
>ss_pred 
CCCEEEEECCCEECCCCCEEEEEEEEECCCCCCEEEEEEECCCCCCCCCCCCCCCCCCCEEEECCCCCEECCCCC
>ss_conf 
987689961870104587078999970578640132153103788788777774424614787217702035631
>d1a1x__ b.63.1.1 (-) p13-MTCP1 {Human (Homo sapiens)}
PPDHLWVHQEGIYRDEYQRTWVAVVEEETSFLRARVQQIQVPLGDAARPSHLLTSQLPLMWQLYPEERYMDNNSR
>gi|6678257|ref|NP_033363.1|:(7-103) T-cell lymphoma breakpoint 1 [Mus musculus]
HPNRLWIWEKHVYLDEFRRSWLPVVIKSNEKFQVILRQEDVTLGEAMSPSQLVPYELPLMWQLYPKDRYRSCDSM
>gi|7305557|ref|NP_038800.1|:(8-103) T-cell leukemia/lymphoma 1B, 3 [Mus musculus]
PPRFLVCTRDDIYEDENGRQWVVAKVETSRSpygsrietcITVHLQHMTTIPQEPTPQQPINNNSLPTMWRLESMNTYTGTDGT
>gi|11415028|ref|NP_068801.1|:(2-106) T-cell lymphoma-1; T-cell lymphoma-1A [Homo sapiens]
HPDRLWAWEKFVYLDEKQHAWLPLTIEikDRLQLRVLLRREDVVLGRPMTPTQIGPSLLPIMWQLYPDGRYRSSDSS
\end{verbatim}\normalsize

The sequence with name \verb`>ss_dssp` contains the 8-state DSSP-determined secondary
structure. \verb`>aa_dssp` and \verb`>aa_pred` contain the same residues as the query 
sequence (\verb`>d1a1x__` in this case). They are optional and used merely to check whether the 
secondary structure states have correctly been assigned to the alignment. \verb`>ss_pred` 
contains the 3-state secondary structure predicted by PSIPRED, and \verb`>ss_conf`
 contains the corresponding confidence values. The query sequence is the first sequence that 
does not start with a special name. It is not marked explicitly.


\subsubsection*{Name lines in alignments}

If you would like to create HMMs from alignments with a specified name which differ 
from the name of the first sequence, you can do so by adding name lines to 
your FASTA, A2M, or A3M alignment:

\scriptsize
\begin{verbatim}
#PF02043 Bac_chlorC:  Bacteriochlorophyll C binding protein
>ss_pred
CCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHCC
>ss_conf
9863234658887677777750019999999998878886403445886666775655576678777660667633039
>CSMA_CHLAU/1-79
ATRGWFSESSAQVAQIGDIMFQGHWQWVSNALQATAAAVDNINRNAYPGVSRSGSGEGAFSSSPSNGFRPKRIRSRFNR
>CSMA_CHLPH/2-51
NGGGVFTDILAASGRIFEVMVEGHWATVGYLFDSLGKGVSRINQNAYGNM-----------------------------
...
\end{verbatim}\normalsize

When creating an HMM from an A3M file with hhmake, the first word of the name line is 
used as the name and file name of the HMM (PF02043 in this case). The following is an 
optional description. The descriptions will appear in the hit list and alignment section 
of the search results. The name lines can be arbitrarily long and there can be any number of 
name/description lines included, marked by a '\#' as the first character in the line. 
Note that name lines are read by HHmake but are not a part of the standard definition
of the FASTA or A2M format.
 

\subsection{HHsearch/HHblits model format (hhm-format)} \label{hhmformat}

HMMs can be read by HHsearch/HHblits in HHsuite's own hhm format.
Performance is severely reduced when using HMMER-format, because HMMER adds more pseudocounts than HHsearch/HHblits
that we cannot remove anymore. \emph{We therefore strongly advise against using models from HMMER with HHsuite.} 
We recommend to get the original MSA files and generate hhm-formatted models using \verb`hhmake`, as described in 
subsection \ref{msa_hhblits}.

If you absolutely need to use HMMER format, you can convert it to hhm format with \verb`hhmake`:
\begin{verbatim}
  $ hhmake -i test.hmm -o test.hhm
\end{verbatim}
This works only for a single HMM per file, not for concatenated HMMs. You 
may add predicted secondary structure to the hmm file with \verb`addss.pl` as for hhm format.

HHsearch/HHblits uses a format HMM that is unchanged since HHsearch version 1.5. 
This is the example of an HHM model file produced by HHmake:

\scriptsize
\begin{verbatim}
HHsearch 1.5
NAME  d1mvfd_ b.129.1.1 (D:) MazE {Escherichia coli}
FAM   b.129.1.1
FILE  d1mvfd_
COM   hhmake1 -i d1mvfd_.a3m -o test.hhm 
DATE  Wed May 14 10:41:06 2011
LENG  44 match states, 44 columns in multiple alignment
FILT  32 out of 35 sequences passed filter (-id 90 -cov 0 -qid 0 -qsc -20.00 -diff 100)
NEFF  4.0 
SEQ
>ss_dssp
CBCEEETTEEEEECCHHHHHHTTCCTTCBEEEEEETTEEEEEEC
>ss_pred
CCCCCCCCCCCCCCHHHHHHHHCCCCCCEEEEEEECCEEEEEEC
>ss_conf
93233467666600578899808998986889874993798739
>Consensus
sxIxKWGNSxAvRlPaxlxxxlxlxxgdxixxxxxxxxivlxPv
>d1mvfd_ b.129.1.1 (D:) MazE {Escherichia coli}
SSVKRWGNSPAVRIPATLMQALNLNIDDEVKIDLVDGKLIIEPV
>gi|10176344|dbj|BAB07439.1|:(1-43) suppressor of ppGpp-regulated growth inhibitor [Bacillus halodurans]
TTIQKWGNSLAVRIPNHYAKHINVTQGSEIELSLgSDQTIILKP-
>gi|50120611|ref|YP_049778.1|:(3-43) suppressor of growth inhibitory protein ChpA [Erwinia carotovora]
-TVKKWGNSPAIRLSSSVMQAFDMTFNDSFDMEIRETEIALIP-
>gi|44064461|gb|EAG93225.1|:(2-42) unknown [environmental sequence]
-SVVKWGSYLAVRLPAELVLELGLKEGDEIDLVKDDGPVRVR--
>gi|31442758|gb|AAP55635.1|:(1-44) PemI-like protein [Pediococcus acidilactici]
TRLAKWGNSKAARIPSQIIKQLKLDDNQDMTITIENGSIVLTPI
>gi|44419085|gb|EAJ13619.1|:(3-43) unknown [environmental sequence]
SAIQKWGNSAAVRLPAVLLEQIDASVGSSLNADVRPDGVLLSP-
>gi|24376549|gb|AAN57947.1|:(3-44) putative cell growth regulatory protein [Streptococcus mutans UA159]
SAINKWGNSSAIRLPKQLVQELQLQTNDVLDYKVSGNKIILEKV
>gi|11344928|gb|AAG34554.1|:(1-44) MazE [Photobacterium profundum]
TQIRKIGNSLGSIIPATFIRQLELAEGAEIDVKTVDGKIVIEPI
>gi|45681193|ref|ZP_00192636.1|:(2-44) COG2336: Growth regulator [Mesorhizobium sp. BNC1]
-TIRKIGNSEGVILPKELLDRHNLKTGDALAIVEEGSDLVLKPV
#
NULL  3706 5728 4211 4064 4839 3729 4763 4308 4069  3323  5509 4640 4464 4937 4285 4423 3815 3783 6325 4665
HMM   A    C    D    E    F    G    H    I    K     L     M    N    P    Q    R    S    T    V    W    Y
      M->M M->I M->D I->M I->I D->M D->D Neff NeffI NeffD
      0    *    *    0    *    0    *    *    *     *
S 1   *    *    *    *    *    *    *    *    *     *     *    *    *    *    *    1012 988  *    *    *   1
      0    *    *    *    *    *    *    2817 0     0

S 2   2307 *    *    *    *    *    *    *    *     *     *    *    *    3178 3009 2179 1546 *    *    *   2
      0    *    *    *    *    *    *    3447 0     0

V 3   *    *    *    *    *    *    *    917  *     3009  *    *    *    *    *    *    *    1530 *    *   3
      0    *    *    *    *    *    *    3447 0     0
  .
  .
  .
V 44  *    *    *    *    *    *    *    1309 *     *     *    *    *    *    *    *    *    745  *    *   44
      0    *    *    0    *    *    *    2533 0     0

//
\end{verbatim}
\normalsize

The first line (\verb`HHsearch 1.5`) gives the format version, which corresponds to the HHsearch version for which this format was first introduced. Newer versions of HHsearch/HHblits may use previous format versions. The \verb`NAME` line gives the name of the HMM and an optional description. The first 30 characters of this field are used in the summary hit list of the search results in hhr format, the full name line is given above the query-template alignments of the search results. The \verb`FAM` line contains the family if the sequence is from SCOP of PFAM (used for calibration). \verb`COM` is the command that was used to generate the file. \verb`NEFF` is the diversity of the alignment, calculated as $\exp$ of the negative entropy averaged over all columns of the alignment. 

The \verb`SEQ` section contains a number of aligned, representative (pseudo) sequences in A3M format and is terminated with a line containing only a \verb`#`. The first sequence represents the DSSP secondary structure (if available, i.e., if contained in the A3M or FASTA alignment from which the HMM model was built), the second and third sequences contain the predicted secondary structure and the corresponding confidence values in the range 0--9 (if available). The fourth sequence is the consensus annotation sequence that is shown in the pairwise query-template alignments in the hhsearch output. The first \emph{real} sequence after the pseudo sequences is the \emph{seed} or \emph{master} sequence from which the alignment was built (\verb`>d1mvfd_`, in our example). If the alignment does not represent a single master sequence but an entire family, as in the case of PFAM alignments for example, the first real sequence may be a consensus sequence calculated for the entire alignment. This master sequence is shown in the pairwise query-template alignments in the hhsearch output. 

The next line specifies the null model frequencies, which are extracted from the selected substitution matrix used to add pseudocounts. Each of the positive integers is equal to 1000 times the negative logarithm of the amino acid frequency (which is between 0 and 1):
\begin{equation}
  -1000 \times \log_2( frequency)
  \label{transformation}
\end{equation}
 After the two annotation lines that specify the order of columns for the emission and transition probabilities that follow, there is a line which is not currently read by HHsearch and that lists the transition frequencies from the begin state to the first Match state, Insert state and Delete state.

 The last block contains two lines for each column of the HMM. The first line starts with the amino acid in the master sequence at that column in the HMM and the column number. Following are 20  positive integers representing the match state amino acid emission frequencies (see eq.\ \ref{transformation}). Asterisks \verb`*` stand for a frequency of 0 (which would otherwise be represented by 99999). Please note that, unlike in HMMER format,  \emph{the emission frequencies do not contain pseudo-counts} in the HHsearch model format. The second line contains the seven transition frequencies (eq.\ \ref{transformation}) coded as in eq.\ \ref{transformation}. The three local diversities, \verb`Neff_M`, \verb`Neff_I`, and \verb`Neff_D` are given in units of 0.001 (see next paragraph). The end of the model is indicated by a line containing only \verb`//`.


\section{Summary of command-line parameters}

This is just a brief summary of command line parameters for the various binaries and
perl scripts as they are displayed by the programs when calling them without 
command line parameters. On the help pages of the HHpred/HHblits web server
\url{http://toolkit.tuebingen.mpg.de} you can find more detailed explanations
about some of the input parameters  ('Parameters' section) and about how to interpret
the output ('Results' section). The FAQ section contains valuable practical hints on topics
such as how to validate marginally significant database matches or how to avoid high-scoring
false positives.


\subsection{{\tt hhblits} -- HMM-HMM-based lighting-fast iterative sequence search}

HHblits is a sensitive, general-purpose, iterative sequence search tool that represents
both query and database sequences by HMMs. You can search HHsuite databases starting
with a single query sequence, a multiple sequence alignment (MSA), or an HMM. HHblits
prints out a ranked list of database HMMs/MSAs and can also generate an MSA by merging
the significant database HMMs/MSAs onto the query MSA.

The binary hhblits\_omp supports the parallelization over several queries with OpenMP.
In this case the input needs to be the basename of an ffindex with MSA's.
The outputs are treated as the basenames for output ffindices.

Assume the ffindex database \verb`/home/user/databases/scop_a3m.ff{data,index}`, the corresponding basename
is \verb`/home/user/databases/scop_a3m`

\small 
\begin{verbatim}
Usage: hhblits -i query [options] 
 -i <file>      input/query: single sequence or multiple sequence alignment (MSA)
                in a3m, a2m, or FASTA format, or HMM in hhm format

<file> may be 'stdin' or 'stdout' throughout.

Options:                                                                        
 -d <name>      database name (e.g. uniprot20_29Feb2012)                        
                Multiple databases may be specified with '-d <db1> -d <db2> ...'
 -n     [1,8]   number of iterations (default=2)                               
 -e     [0,1]   E-value cutoff for inclusion in result alignment (def=0.001)       

Input alignment format:                                                       
 -M a2m         use A2M/A3M (default): upper case = Match; lower case = Insert;
               ' -' = Delete; '.' = gaps aligned to inserts (may be omitted)   
 -M first       use FASTA: columns with residue in 1st sequence are match states
 -M [0,100]     use FASTA: columns with fewer than X% gaps are match states   
 -tags/-notags  do NOT / do neutralize His-, C-myc-, FLAG-tags, and trypsin 
                recognition sequence to background distribution (def=-notags)  

Output options: 
 -o <file>      write results in standard format to file (default=<infile.hhr>)
 -oa3m <file>   write result MSA with significant matches in a3m format
 -opsi <file>   write result MSA of significant matches in PSI-BLAST format
 -ohhm <file>   write HHM file for result MSA of significant matches
 -oalis <name>  write MSAs in A3M format after each iteration
 -add_cons      generate consensus sequence as master sequence of query MSA (default=don't)
 -hide_cons     don't show consensus sequence in alignments (default=show)     
 -hide_pred     don't show predicted 2ndary structure in alignments (default=show)
 -hide_dssp     don't show DSSP 2ndary structure in alignments (default=show)  
 -show_ssconf   show confidences for predicted 2ndary structure in alignments
 -Ofas <file>   write pairwise alignments in FASTA xor A2M (-Oa2m) xor A3M (-Oa3m) format   
 -seq <int>     max. number of query/template sequences displayed (default=1)  
 -aliw <int>    number of columns per line in alignment list (default=80)       
 -p [0,100]     minimum probability in summary and alignment list (default=20)  
 -E [0,inf[     maximum E-value in summary and alignment list (default=1E+06)      
 -Z <int>       maximum number of lines in summary hit list (default=500)        
 -z <int>       minimum number of lines in summary hit list (default=10)        
 -B <int>       maximum number of alignments in alignment list (default=500)     
 -b <int>       minimum number of alignments in alignment list (default=10)     

Prefilter options                                                               
 -noprefilt                disable all filter steps                                        
 -noaddfilter              disable all filter steps (except for fast prefiltering)         
 -maxfilt                  max number of hits allowed to pass 2nd prefilter (default=20000)   
 -min_prefilter_hits       min number of hits to pass prefilter (default=100)               
 -prepre_smax_thresh       min score threshold of ungapped prefilter (default=10)               
 -pre_evalue_thresh        max E-value threshold of Smith-Waterman prefilter score (default=1000.0)
 -pre_bitfactor            prefilter scores are in units of 1 bit / pre_bitfactor (default=4)
 -pre_gap_open             gap open penalty in prefilter Smith-Waterman alignment (default=20)
 -pre_gap_extend           gap extend penalty in prefilter Smith-Waterman alignment (default=4)
 -pre_score_offset         offset on sequence profile scores in prefilter S-W alignment (default=50)

Filter options applied to query MSA, database MSAs, and result MSA              
 -all           show all sequences in result MSA; do not filter result MSA      
 -id   [0,100]  maximum pairwise sequence identity (def=90)
 -diff [0,inf[  filter MSAs by selecting most diverse set of sequences, keeping 
                at least this many seqs in each MSA block of length 50 
                Zero and non-numerical values turn off the filtering. (def=1000) 
 -cov  [0,100]  minimum coverage with master sequence (%) (def=0)             
 -qid  [0,100]  minimum sequence identity with master sequence (%) (def=0)    
 -qsc  [0,100]  minimum score per column with master sequence (default=-20.0)    
 -neff [1,inf]  target diversity of multiple sequence alignment (default=off)   
 -mark          do not filter out sequences marked by ">@"in their name line  

HMM-HMM alignment options:                                                       
 -norealign           do NOT realign displayed hits with MAC algorithm (def=realign)   
 -realign_old_hits    realign hits from previous iterations                          
 -mact [0,1[          posterior prob threshold for MAC realignment controlling greedi- 
                      ness at alignment ends: 0:global >0.1:local (default=0.35)       
 -glob/-loc           use global/local alignment mode for searching/ranking (def=local)
 -realign             realign displayed hits with max. accuracy (MAC) algorithm 
 -realign_max <int>   realign max. <int> hits (default=500)                        
 -ovlp <int>          banded alignment: forbid <ovlp> largest diagonals |i-j| of DP matrix (def=0)
 -alt <int>           show up to this many alternative alignments with raw score > smin(def=4)  
 -smin <float>        minimum raw score for alternative alignments (def=20.0)  
 -shift [-1,1]        profile-profile score offset (def=-0.03)                         
 -corr [0,1]          weight of term for pair correlations (def=0.10)                
 -sc   <int>          amino acid score         (tja: template HMM at column j) (def=1)
              0       = log2 Sum(tja*qia/pa)   (pa: aa background frequencies)    
              1       = log2 Sum(tja*qia/pqa)  (pqa = 1/2*(pa+ta) )               
              2       = log2 Sum(tja*qia/ta)   (ta: av. aa freqs in template)     
              3       = log2 Sum(tja*qia/qa)   (qa: av. aa freqs in query)        
              5       local amino acid composition correction                     
 -ssm {0,..,4}        0:   no ss scoring                                             
                      1,2: ss scoring after or during alignment  [default=2]         
                      3,4: ss scoring after or during alignment, predicted vs. predicted
 -ssw [0,1]           weight of ss score  (def=0.11)                                  
 -ssa [0,1]           ss confusion matrix = (1-ssa)*I + ssa*psipred-confusion-matrix [def=1.00)
 -wg                  use global sequence weighting for realignment!                   

Gap cost options:                                                                
 -gapb [0,inf[  Transition pseudocount admixture (def=1.00)                     
 -gapd [0,inf[  Transition pseudocount admixture for open gap (default=0.15)    
 -gape [0,1.5]  Transition pseudocount admixture for extend gap (def=1.00)      
 -gapf ]0,inf]  factor to increase/reduce gap open penalty for deletes (def=0.60) 
 -gapg ]0,inf]  factor to increase/reduce gap open penalty for inserts (def=0.60) 
 -gaph ]0,inf]  factor to increase/reduce gap extend penalty for deletes(def=0.60)
 -gapi ]0,inf]  factor to increase/reduce gap extend penalty for inserts(def=0.60)
 -egq  [0,inf[  penalty (bits) for end gaps aligned to query residues (def=0.00) 
 -egt  [0,inf[  penalty (bits) for end gaps aligned to template residues (def=0.00)

Pseudocount (pc) options:                                                        
 Context specific hhm pseudocounts:
  -pc_hhm_contxt_mode {0,..,3}   position dependence of pc admixture 'tau' (pc mode, default=2) 
               0: no pseudo counts:    tau = 0                                  
               1: constant             tau = a                                  
               2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)            
               3: CSBlast admixture:   tau = a(1+b)/(Neff[i]+b)                 
               (Neff[i]: number of effective seqs in local MSA around column i) 
  -pc_hhm_contxt_a  [0,1]        overall pseudocount admixture (def=0.9)                        
  -pc_hhm_contxt_b  [1,inf[      Neff threshold value for mode 2 (def=4.0)                      
  -pc_hhm_contxt_c  [0,3]        extinction exponent c for mode 2 (def=1.0)                     

 Context independent hhm pseudocounts (used for templates; used for query if contxt file is not available):
  -pc_hhm_nocontxt_mode {0,..,3}   position dependence of pc admixture 'tau' (pc mode, default=2) 
               0: no pseudo counts:    tau = 0                                  
               1: constant             tau = a                                  
               2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)            
               (Neff[i]: number of effective seqs in local MSA around column i) 
  -pc_hhm_nocontxt_a  [0,1]        overall pseudocount admixture (def=1.0)                        
  -pc_hhm_nocontxt_b  [1,inf[      Neff threshold value for mode 2 (def=1.5)                      
  -pc_hhm_nocontxt_c  [0,3]        extinction exponent c for mode 2 (def=1.0)                     

 Context specific prefilter pseudocounts:
  -pc_prefilter_contxt_mode {0,..,3}   position dependence of pc admixture 'tau' (pc mode, default=3) 
               0: no pseudo counts:    tau = 0                                  
               1: constant             tau = a                                  
               2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)            
               3: CSBlast admixture:   tau = a(1+b)/(Neff[i]+b)                 
               (Neff[i]: number of effective seqs in local MSA around column i) 
  -pc_prefilter_contxt_a  [0,1]        overall pseudocount admixture (def=0.8)                        
  -pc_prefilter_contxt_b  [1,inf[      Neff threshold value for mode 2 (def=2.0)                      
  -pc_prefilter_contxt_c  [0,3]        extinction exponent c for mode 2 (def=1.0)                     

 Context independent prefilter pseudocounts (used if context file is not available):
  -pc_prefilter_nocontxt_mode {0,..,3}   position dependence of pc admixture 'tau' (pc mode, default=2) 
               0: no pseudo counts:    tau = 0                                  
               1: constant             tau = a                                  
               2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)            
               (Neff[i]: number of effective seqs in local MSA around column i) 
  -pc_prefilter_nocontxt_a  [0,1]        overall pseudocount admixture (def=1.0)                        
  -pc_prefilter_nocontxt_b  [1,inf[      Neff threshold value for mode 2 (def=1.5)                      
  -pc_prefilter_nocontxt_c  [0,3]        extinction exponent c for mode 2 (def=1.0)                     

 Context-specific pseudo-counts:                                                  
  -nocontxt      use substitution-matrix instead of context-specific pseudocounts 
  -contxt <file> context file for computing context-specific pseudocounts (default=${HHLIB}/data/context_data.crf)
  -csw  [0,inf]  weight of central position in cs pseudocount mode (def=1.6)
  -csb  [0,1]    weight decay parameter for positions in cs pc mode (def=0.9)

Other options:                                                                   
 -v <int>       verbose mode: 0:no screen output  1:only warings  2: verbose (def=2)
 -neffmax ]1,20] skip further search iterations when diversity Neff of query MSA 
                becomes larger than neffmax (default=10.0)
 -cpu <int>     number of CPUs to use (for shared memory SMPs) (default=2)      
 -scores <file> write scores for all pairwise comparisons to file               
 -atab   <file> write all alignments in tabular layout to file                   
 -maxres <int>  max number of HMM columns (def=20001)             
 -maxmem [1,inf[ limit memory for realignment (in GB) (def=3.0)          

Example: hhblits -i query.fas -oa3m query.a3m -n 1  
\end{verbatim} 
\normalsize


\subsection{{\tt hhsearch} -- search a database of HMMs with a query MSA or HMM}

\small 
\begin{verbatim}
Usage: hhsearch -i query -d database [options]                       
 -i <file>      input/query multiple sequence alignment (a2m, a3m, FASTA) or HMM

<file> may be 'stdin' or 'stdout' throughout.
Options:                                                                        
 -d <name>      database name (e.g. uniprot20_29Feb2012)                        
                Multiple databases may be specified with '-d <db1> -d <db2> ...'
 -e     [0,1]   E-value cutoff for inclusion in result alignment (def=0.001)       

Input alignment format:                                                       
 -M a2m         use A2M/A3M (default): upper case = Match; lower case = Insert;
               '-' = Delete; '.' = gaps aligned to inserts (may be omitted)   
 -M first       use FASTA: columns with residue in 1st sequence are match states
 -M [0,100]     use FASTA: columns with fewer than X% gaps are match states   
 -tags/-notags  do NOT / do neutralize His-, C-myc-, FLAG-tags, and trypsin 
                recognition sequence to background distribution (def=-notags)  

Output options: 
 -o <file>      write results in standard format to file (default=<infile.hhr>)
 -oa3m <file>   write result MSA with significant matches in a3m format
 -opsi <file>   write result MSA of significant matches in PSI-BLAST format
 -ohhm <file>   write HHM file for result MSA of significant matches
 -add_cons      generate consensus sequence as master sequence of query MSA (default=don't)
 -hide_cons     don't show consensus sequence in alignments (default=show)     
 -hide_pred     don't show predicted 2ndary structure in alignments (default=show)
 -hide_dssp     don't show DSSP 2ndary structure in alignments (default=show)  
 -show_ssconf   show confidences for predicted 2ndary structure in alignments
 -Ofas <file>   write pairwise alignments in FASTA xor A2M (-Oa2m) xor A3M (-Oa3m) format   
 -seq <int>     max. number of query/template sequences displayed (default=1)  
 -aliw <int>    number of columns per line in alignment list (default=80)       
 -p [0,100]     minimum probability in summary and alignment list (default=20)  
 -E [0,inf[     maximum E-value in summary and alignment list (default=1E+06)      
 -Z <int>       maximum number of lines in summary hit list (default=500)        
 -z <int>       minimum number of lines in summary hit list (default=10)        
 -B <int>       maximum number of alignments in alignment list (default=500)     
 -b <int>       minimum number of alignments in alignment list (default=10)     

Filter options applied to query MSA, database MSAs, and result MSA              
 -all           show all sequences in result MSA; do not filter result MSA      
 -id   [0,100]  maximum pairwise sequence identity (def=90)
 -diff [0,inf[  filter MSAs by selecting most diverse set of sequences, keeping 
                at least this many seqs in each MSA block of length 50 
                Zero and non-numerical values turn off the filtering. (def=100) 
 -cov  [0,100]  minimum coverage with master sequence (%) (def=0)             
 -qid  [0,100]  minimum sequence identity with master sequence (%) (def=0)    
 -qsc  [0,100]  minimum score per column with master sequence (default=-20.0)    
 -neff [1,inf]  target diversity of multiple sequence alignment (default=off)   
 -mark          do not filter out sequences marked by ">@"in their name line  

HMM-HMM alignment options:                                                       
 -norealign          do NOT realign displayed hits with MAC algorithm (def=realign)   
 -ovlp <int>         banded alignment: forbid <ovlp> largest diagonals |i-j| of DP matrix (def=0)
 -mact [0,1[         posterior prob threshold for MAC realignment controlling greedi- 
                     ness at alignment ends: 0:global >0.1:local (default=0.35)       
 -glob/-loc          use global/local alignment mode for searching/ranking (def=local)
 -realign            realign displayed hits with max. accuracy (MAC) algorithm 
 -excl <range>       exclude query positions from the alignment, e.g. '1-33,97-168' 
 -realign_max <int>  realign max. <int> hits (default=500)                        
 -alt <int>          show up to this many alternative alignments with raw score > smin(def=4)  
 -smin <float>       minimum raw score for alternative alignments (def=20.0)  
 -shift [-1,1]       profile-profile score offset (def=-0.03)                         
 -corr [0,1]         weight of term for pair correlations (def=0.10)                
 -sc   <int>         amino acid score         (tja: template HMM at column j) (def=1)
        0       = log2 Sum(tja*qia/pa)   (pa: aa background frequencies)    
        1       = log2 Sum(tja*qia/pqa)  (pqa = 1/2*(pa+ta) )               
        2       = log2 Sum(tja*qia/ta)   (ta: av. aa freqs in template)     
        3       = log2 Sum(tja*qia/qa)   (qa: av. aa freqs in query)        
        5       local amino acid composition correction                     
 -ssm {0,..,4}    0:   no ss scoring                                             
                1,2: ss scoring after or during alignment  [default=2]         
                3,4: ss scoring after or during alignment, predicted vs. predicted
 -ssw [0,1]          weight of ss score  (def=0.11)                                  
 -ssa [0,1]          SS substitution matrix = (1-ssa)*I + ssa*full-SS-substition-matrix [def=1.00)
 -wg                 use global sequence weighting for realignment!                   

Gap cost options:                                                                
 -gapb [0,inf[  Transition pseudocount admixture (def=1.00)                     
 -gapd [0,inf[  Transition pseudocount admixture for open gap (default=0.15)    
 -gape [0,1.5]  Transition pseudocount admixture for extend gap (def=1.00)      
 -gapf ]0,inf]  factor to increase/reduce gap open penalty for deletes (def=0.60) 
 -gapg ]0,inf]  factor to increase/reduce gap open penalty for inserts (def=0.60) 
 -gaph ]0,inf]  factor to increase/reduce gap extend penalty for deletes(def=0.60)
 -gapi ]0,inf]  factor to increase/reduce gap extend penalty for inserts(def=0.60)
 -egq  [0,inf[  penalty (bits) for end gaps aligned to query residues (def=0.00) 
 -egt  [0,inf[  penalty (bits) for end gaps aligned to template residues (def=0.00)

Pseudocount (pc) options:                                                        
 Context specific hhm pseudocounts:
  -pc_hhm_contxt_mode {0,..,3}   position dependence of pc admixture 'tau' (pc mode, default=2) 
               0: no pseudo counts:    tau = 0                                  
               1: constant             tau = a                                  
               2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)            
               3: CSBlast admixture:   tau = a(1+b)/(Neff[i]+b)                 
               (Neff[i]: number of effective seqs in local MSA around column i) 
  -pc_hhm_contxt_a  [0,1]        overall pseudocount admixture (def=0.9)                        
  -pc_hhm_contxt_b  [1,inf[      Neff threshold value for mode 2 (def=4.0)                      
  -pc_hhm_contxt_c  [0,3]        extinction exponent c for mode 2 (def=1.0)                     

 Context independent hhm pseudocounts (used for templates; used for query if contxt file is not available):
  -pc_hhm_nocontxt_mode {0,..,3}   position dependence of pc admixture 'tau' (pc mode, default=2) 
               0: no pseudo counts:    tau = 0                                  
               1: constant             tau = a                                  
               2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)            
               (Neff[i]: number of effective seqs in local MSA around column i) 
  -pc_hhm_nocontxt_a  [0,1]        overall pseudocount admixture (def=1.0)                        
  -pc_hhm_nocontxt_b  [1,inf[      Neff threshold value for mode 2 (def=1.5)                      
  -pc_hhm_nocontxt_c  [0,3]        extinction exponent c for mode 2 (def=1.0)                     

 Context-specific pseudo-counts:                                                  
  -nocontxt      use substitution-matrix instead of context-specific pseudocounts 
  -contxt <file> context file for computing context-specific pseudocounts (default=${HHLIB}/data/context_data.crf)
  -csw  [0,inf]  weight of central position in cs pseudocount mode (def=1.6)
  -csb  [0,1]    weight decay parameter for positions in cs pc mode (def=0.9)

Other options:                                                                   
 -v <int>       verbose mode: 0:no screen output  1:only warings  2: verbose (def=2)
 -cpu <int>     number of CPUs to use (for shared memory SMPs) (default=2)      
 -scores <file> write scores for all pairwise comparisons to file               
 -atab   <file> write all alignments in tabular layout to file                   
 -maxres <int>  max number of HMM columns (def=20001)             
 -maxmem [1,inf[ limit memory for realignment (in GB) (def=3.0)          

Example: hhsearch -i a.1.1.1.a3m -d scop70_1.71
\end{verbatim} 
\normalsize


\subsection{{\tt hhfilter} -- filter an MSA}

Filter an alignment by maximum pairwise sequence identity, minimum coverage,
minimum sequence identity, or score per column to the first (seed) sequence etc.

\small 
\begin{verbatim}
Usage: hhfilter -i infile -o outfile [options]                  
 -i <file>      read input file in A3M/A2M or FASTA format                 
 -o <file>      write to output file in A3M format                         
 -a <file>      append to output file in A3M format                        

Options:                                                                  
 -v <int>       verbose mode: 0:no screen output  1:only warings  2: verbose
 -id   [0,100]  maximum pairwise sequence identity (%) (def=90)   
 -diff [0,inf[  filter MSA by selecting most diverse set of sequences, keeping 
                at least this many seqs in each MSA block of length 50 (def=0) 
 -cov  [0,100]  minimum coverage with query (%) (def=0) 
 -qid  [0,100]  minimum sequence identity with query (%) (def=0) 
 -qsc  [0,100]  minimum score per column with query  (def=-20.0)
 -neff [1,inf]  target diversity of alignment (default=off)

Input alignment format:                                                    
 -M a2m         use A2M/A3M (default): upper case = Match; lower case = Insert;
                '-' = Delete; '.' = gaps aligned to inserts (may be omitted)   
 -M first       use FASTA: columns with residue in 1st sequence are match states
 -M [0,100]     use FASTA: columns with fewer than X% gaps are match states   
                                                                          
Example: hhfilter -id 50 -i d1mvfd_.a2m -o d1mvfd_.fil.a2m          
\end{verbatim} 
\normalsize


\subsection{{\tt hhmake} -- build an HMM from an input alignment}

Build a profile hidden Markov models from an input alignment, formatted 
in A2M, A3M, or FASTA, or convert between HMMER format (.hmm) and 
HHsearch format (.hhm).   

\small 
\begin{verbatim}
Usage: hhmake -i file [options]                                       
 -i <file>     query alignment (A2M, A3M, or FASTA), or query HMM         

Output options:                                                           
 -o <file>     HMM file to be written to  (default=<infile.hhm>)          
 -a <file>     HMM file to be appended to                                 
 -v <int>      verbose mode: 0:no screen output  1:only warings  2: verbose
 -seq <int>    max. number of query/template sequences displayed (def=10)  
               Beware of overflows! All these sequences are stored in memory.
 -cons         make consensus sequence master sequence of query MSA 
 -name <name>  use this name for HMM (default: use name of first sequence)   

Filter query multiple sequence alignment                                     
 -id   [0,100]  maximum pairwise sequence identity (%) (def=90)   
 -diff [0,inf[  filter MSA by selecting most diverse set of sequences, keeping 
                at least this many seqs in each MSA block of length 50 (def=100) 
 -cov  [0,100]  minimum coverage with query (%) (def=0) 
 -qid  [0,100]  minimum sequence identity with query (%) (def=0) 
 -qsc  [0,100]  minimum score per column with query  (def=-20.0)
 -neff [1,inf]  target diversity of alignment (default=off)

Input alignment format:                                                    
 -M a2m        use A2M/A3M (default): upper case = Match; lower case = Insert;
               '-' = Delete; '.' = gaps aligned to inserts (may be omitted)   
 -M first      use FASTA: columns with residue in 1st sequence are match states
 -M [0,100]    use FASTA: columns with fewer than X% gaps are match states   

Pseudocount (pc) options:                                                        
 Context specific hhm pseudocounts:
  -pc_hhm_contxt_mode {0,..,3}   position dependence of pc admixture 'tau' (pc mode, default=0) 
               0: no pseudo counts:    tau = 0                                  
               1: constant             tau = a                                  
               2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)            
               3: CSBlast admixture:   tau = a(1+b)/(Neff[i]+b)                 
               (Neff[i]: number of effective seqs in local MSA around column i) 
  -pc_hhm_contxt_a  [0,1]        overall pseudocount admixture (def=0.9)                        
  -pc_hhm_contxt_b  [1,inf[      Neff threshold value for mode 2 (def=4.0)                      
  -pc_hhm_contxt_c  [0,3]        extinction exponent c for mode 2 (def=1.0)                     

 Context independent hhm pseudocounts (used for templates; used for query if contxt file is not available):
  -pc_hhm_nocontxt_mode {0,..,3}   position dependence of pc admixture 'tau' (pc mode, default=2) 
               0: no pseudo counts:    tau = 0                                  
               1: constant             tau = a                                  
               2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)            
               (Neff[i]: number of effective seqs in local MSA around column i) 
  -pc_hhm_nocontxt_a  [0,1]        overall pseudocount admixture (def=1.0)                        
  -pc_hhm_nocontxt_b  [1,inf[      Neff threshold value for mode 2 (def=1.5)                      
  -pc_hhm_nocontxt_c  [0,3]        extinction exponent c for mode 2 (def=1.0)                     

 Context-specific pseudo-counts:                                                  
  -nocontxt      use substitution-matrix instead of context-specific pseudocounts 
  -contxt <file> context file for computing context-specific pseudocounts (default=${HHLIB}/data/context_data.crf)

Example: hhmake -i test.a3m 
\end{verbatim} 
\normalsize



\subsection{{\tt hhalign} -- align a query MSA/HMM to a template MSA/HMM}

Align a query alignment/HMM to a template alignment/HMM by HMM-HMM alignment.
If only one alignment/HMM is given it is compared to itself and the best
off-diagonal alignment plus all further non-overlapping alignments above 
significance threshold are shown. The command also allows to sample
alignments randomly or to print out a list of indices of aligned residue pairs. 

\small 
\begin{verbatim}
Usage: hhalign -i query -t template [options]  
 -i <file>      input/query: single sequence or multiple sequence alignment (MSA)
                in a3m, a2m, or FASTA format, or HMM in hhm format
 -t <file>      input/template: single sequence or multiple sequence alignment (MSA)
                in a3m, a2m, or FASTA format, or HMM in hhm format

Input alignment format:                                                       
 -M a2m         use A2M/A3M (default): upper case = Match; lower case = Insert;
               '-' = Delete; '.' = gaps aligned to inserts (may be omitted)   
 -M first       use FASTA: columns with residue in 1st sequence are match states
 -M [0,100]     use FASTA: columns with fewer than X% gaps are match states   
 -tags/-notags  do NOT / do neutralize His-, C-myc-, FLAG-tags, and trypsin 
                recognition sequence to background distribution (def=-notags)  

Output options: 
 -o <file>      write results in standard format to file (default=<infile.hhr>)
 -oa3m <file>   write query alignment in a3m or PSI-BLAST format (-opsi) to file (default=none)
 -aa3m <file>   append query alignment in a3m (-aa3m) or PSI-BLAST format (-apsi )to file (default=none)
 -Ofas <file>   write pairwise alignments in FASTA xor A2M (-Oa2m) xor A3M (-Oa3m) format   
 -add_cons      generate consensus sequence as master sequence of query MSA (default=don't)
 -hide_cons     don't show consensus sequence in alignments (default=show)     
 -hide_pred     don't show predicted 2ndary structure in alignments (default=show)
 -hide_dssp     don't show DSSP 2ndary structure in alignments (default=show)  
 -show_ssconf   show confidences for predicted 2ndary structure in alignments

Filter options applied to query MSA, template MSA, and result MSA              
 -id   [0,100]  maximum pairwise sequence identity (def=90)
 -diff [0,inf[  filter MSAs by selecting most diverse set of sequences, keeping 
                at least this many seqs in each MSA block of length 50 
                Zero and non-numerical values turn off the filtering. (def=100) 
 -cov  [0,100]  minimum coverage with master sequence (%) (def=0)             
 -qid  [0,100]  minimum sequence identity with master sequence (%) (def=0)    
 -qsc  [0,100]  minimum score per column with master sequence (default=-20.0)    
 -mark          do not filter out sequences marked by ">@"in their name line  

HMM-HMM alignment options:                                                       
 -norealign     do NOT realign displayed hits with MAC algorithm (def=realign)   
 -mact [0,1[    posterior prob threshold for MAC realignment controlling greedi- 
                ness at alignment ends: 0:global >0.1:local (default=0.35)       
 -glob/-loc     use global/local alignment mode for searching/ranking (def=local)

Other options:                                                                   
 -v <int>       verbose mode: 0:no screen output  1:only warings  2: verbose (def=2)

Example: hhalign -i T0187.a3m -t d1hz4a_.hhm -o result.hhr
\end{verbatim} 
\normalsize


\subsection{{\tt reformat.pl} -- reformat one or many alignments}

Read one or many multiple alignments in one format and write them in another format

\small 
\begin{verbatim}
Usage: reformat.pl [informat] [outformat] infile outfile [options] 
  or   reformat.pl [informat] [outformat] 'fileglob' .ext [options] 

Available input formats:
   fas:     aligned fasta; lower and upper case equivalent, '.' and '-' equivalent
   a2m:     aligned fasta; inserts: lower case, matches: upper case, deletes: '-',
            gaps aligned to inserts: '.'
   a3m:     like a2m, but gaps aligned to inserts MAY be omitted
   sto:     Stockholm format; sequences in several blocks with sequence name at 
            beginning of line (HMMER output)
   psi:     format as read by PSI-BLAST using the -B option (like sto with -M first -r)
   clu:     Clustal format; sequences in several blocks with sequence name at beginning 
            of line
Available output formats:
   fas:     aligned fasta; all gaps '-'
   a2m:     aligned fasta; inserts: lower case, matches: upper case, deletes: '-', 
            gaps aligned to inserts: '.'
   a3m:     like a2m, but gaps aligned to inserts are omitted
   sto:     Stockholm format; sequences in just one block, one line per sequence
   psi:     format as read by PSI-BLAST using the -B option 
   clu:     Clustal format
If no input or output format is given the file extension is interpreted as format 
specification ('aln' as 'clu')

Options:
  -v int    verbose mode (0:off, 1:on)
  -num      add number prefix to sequence names: 'name', '1:name' '2:name' etc
  -noss     remove secondary structure sequences (beginning with >ss_)
  -sa       do not remove solvent accessibility sequences (beginning with >sa_)
  -M first  make all columns with residue in first seuqence match columns 
            (default for output format a2m or a3m)
  -M int    make all columns with less than X% gaps match columns 
            (for output format a2m or a3m)
  -r        remove all lower case residues (insert states) 
            (AFTER -M option has been processed)
  -r int    remove all lower case columns with more than X% gaps
  -g ''     suppress all gaps
  -g '-'    write all gaps as '-'
  -uc       write all residues in upper case (AFTER other options have been processed)
  -lc       write all residues in lower case (AFTER other options have been processed)
  -l        number of residues per line (for Clustal, FASTA, A2M, A3M formats) 
            (default=100)
  -d        maximum number of characers in nameline (default=1000)

Examples: reformat.pl 1hjra.a3m 1hjra.a2m  
          (same as reformat.pl a3m a2m 1hjra.a3m 1hjra.a2m)
          reformat.pl test.a3m test.fas -num -r 90
          reformat.pl fas sto '*.fasta' .stockholm
\end{verbatim} 
\normalsize


\subsection{{\tt addss.pl} -- add predicted secondary structure to an MSA or HMM}
Add PSIPRED secondary structure prediction (and DSSP annotation) to a multiple sequence alignment (MSA) 
or HMMER (multi-)model file. 

If the input file is an MSA, the predicted secondary structure and confidence values are added as 
special annotation sequences with names \verb`>ss_pred`, \verb`>ss_conf`, and \verb`>ss_dssp` to the top of the output 
A3M alignment. If no output file is given, the output file will have the same name as the input file, 
except for the extension being replaced by \verb`'.a3m'`. Allowed input formats are A2M/FASTA (default), 
A3M (-a3m), CLUSTAL (-clu), STOCKHOLM (-sto), HMMER (-hmm). Note that in order to add PSIPRED and DSSP annotations corresponding paths
have to be set in \verb`HHPaths.pm`. To add DSSP annotations \verb`addss.pl` first checks a folder of precomputed dssp files which 
can be obtained from \href{http://www.cmbi.ru.nl/dssp.html}{here}. If the dssp file for a particular structure is not available, \verb`addss.pl` tries to 
execute the DSSP binary \verb`mkdssp`. Please keep in mind that only the current DSSP version (2.2.1) supports structures in cif format, and that compiling the software 
may cause problems on several systems. If you have troubles to compile DSSP you may try to delete all occurrences of \verb`"static"` in its Makefile.

If the input file contains HMMER models, records SSPRD and SSCON containing predicted secondary 
structure and confidence values are added to each model. In this case the output file name is 
obligatory and must be different from the input file name.

\small 
\begin{verbatim}
Usage: perl addss.pl <ali file> [<outfile>] [-fas|-a3m|-clu|-sto]  
  or   perl addss.pl <hmm file> <outfile> -hmm  
\end{verbatim} 
\normalsize


\subsection{{\tt hhmakemodel.pl} and {\tt hhmakemodel.py} -- generate MSAs or coarse 3D models from HHsearch results file}

From the top hits in an hhsearch output file (hhr), you can  
\begin{itemize}
\item{generate a MSA (multiple sequence alignment) containing all representative 
template sequences from all selected alignments (options -fas, -a2m, -a3m, -pir)}
\item{generate several concatenated pairwise alignments in AL format (option -al)}
\item{generate several concatenated coarse 3D models in PDB format (option -ts)}
\end{itemize}
In PIR, PDB and AL format, the pdb files are required in order to read the pdb residue numbers 
and ATOM records. The PIR formatted file can be used directly as input to the MODELLER\cite{Sali:1993} 
homology modeling package.

\small 
\begin{verbatim}
 Usage: hhmakemodel.pl [-i] file.hhr [options]

 Options:
  -i   <file.hhr>        results file from hhsearch with hit list and alignments
  -fas <file.fas>        write a FASTA-formatted multiple alignment to file.fas
  -a2m <file.a2m>        write an A2M-formatted multiple alignment to file.a2m
  -a3m <file.a3m>        write an A3M-formatted multiple alignment to file.a3m
  -m   <int> [<int> ...] pick hits with specified indices  (default='-m 1')
  -p   <probability>     minimum probability threshold     
  -e   <E-value>         maximum E-value threshold      
  -q   <query_ali>       use the full-length query sequence in the alignment 
                         (not only the aligned part);
                         the query alignment file must be in HHM, FASTA, A2M,  
                         or A3M format.
  -N                     use query name from hhr filename (default: use same  
                         name as in hhr file)
  -first                 include only first Q or T sequence of each hit in MSA
  -v                     verbose mode

 Options when database matches in hhr file are PDB or SCOP sequences
  -pir <file.pir>        write a PIR-formatted multiple alignment to file.pir 
  -ts  <file.pdb>        write the PDB-formatted models based on *pairwise*  
                         alignments into file.pdb
  -al  <file.al>         write the AL-formatted *pairwise* alignments into file.al
  -d   <pdbdirs>         directories containing the pdb files (for PDB, SCOP, or DALI  
                         sequences)
  -s   <int>             shift the residue indices up/down by an integer           
  -CASP                  formatting for CASP (for -ts, -al options) 
                         (default: LIVEBENCH formatting)
\end{verbatim} 
\normalsize

Analogously, \verb`hhmakemodel.py` generates a MSA containing all representative template sequences from all selected alignments in PIR format. Note that in \verb`hhmakemodel.py` you have to specify the folder containing the *.cif files which make up the alignment. The script will modify the ATOM section of those cif files according to the residue numbering in the alignment. Use these renumbered cifs file together with the PIR alignment output in MODELLER\cite{Sali:1993}.

\small
\begin{verbatim}
usage: hhmakemodel.py [-h] [-v] [-m INT [INT ...]] [-e FLOAT] [-r FLOAT] [-c]
                      FILE DIR FILE DIR

Creates a MODELLER alignment (*.pir) from a HHSearch results file (*.hhr).

positional arguments:
  FILE              results file from HHsearch with hit list and alignment
  DIR               path to the folder containing cif files
  FILE              output file (PIR-formatted multiple alignment)
  DIR               path to the folder where modified cif files should be
                    written to

optional arguments:
  -h, --help        show this help message and exit
  -v, --verbose     verbose mode
  -m INT [INT ...]  pick hits with specified indices (e.g. -m 2 5)
  -e FLOAT          maximum E-Value threshold (e.g. -e 0.001)
  -r FLOAT          residue ratio (filter alignments that have contribute at
                    least residues according to the specified ratio).
  -c                convert non-canonical residues (default = True)
\end{verbatim} 
\normalsize

\subsection{{\tt hhsuitedb.pl} -- Build an HHsuite database }

Builds HHsuite database from a3m formatted MSAs and/or from HMMs (-o).
MSAs and HMMs can also be added (-a) to or removed (-r) from an existing database. 

\small 
\begin{verbatim}

Usage: hhsuitedb.py [options]

Options:
  -h, --help       show this help message and exit
  --ia3m=<GLOB>    Glob for a3m files
  --ics219=<GLOB>  Glob for cs219 files
  --ihhm=<GLOB>    Glob for hhm files
  -o FILE          Output hhsuite database basename
  --cpu=INT        Number of threads that shall used
  --force          Try to fix problems with the database

\end{verbatim} 
\normalsize

\subsection{{\tt multithread.pl} -- Run a command for many files in parallel using multiple threads}

\small 
\begin{verbatim}
 Usage: multithread.pl '<fileglob>' '<command>' [-cpu <int>] [-v {0,1,2}]

 <command> can include symbol 
    $file for the full filename, e.g. /tmp/hh/1c1g_A.a3m, 
    $name the filename without extension, e.g. /tmp/hh/1c1g_A, and 
    $base for the filename without extension and path, e.g. 1c1g_A.

  -cpu <int>  number of threads to launch (default = 8)
  -v {0,1,2}  verbose mode (default = 1)

 Example: multithread.pl '*.a3m' 'hhmake -i $file 1>$name.log 2>>error.log' -cpu 16
\end{verbatim} 
\normalsize

\subsection{{\tt splitfasta.pl} -- Split multi-sequence FASTA file into single-sequence files}
\small 
\begin{verbatim}
Write files into current directory and name each file by the first word after ">" in the name line. 

Usage: splitfasta.pl infile [option]
Option:
-fam       : use family-based name (for SCOP/ASTRAL sequences
-name      : use sequence name as file name (default)
-ext <ext> : extension for sequence files (default=seq)
\end{verbatim} 
\normalsize

\subsection{{\tt renumberpdb.pl} -- Renumber indices in PDB file to match input sequence indices}

\small 
\begin{verbatim}
 The program reads an input sequence in FASTA/A3M which must have a SCOP domain, PDB chain, or DALI 
 domain identifier (e.g. d1hz4a_, 1hz4_A, or 1hz4A_1). It reads the corresponding PDB file from the 
 given pdb directory and  generates a new PDB file by aligning the input sequence to the sequence 
 extracted from the ATOM records of the corresponding pdb chain and renumbering the residues in 
 columns 23-26 according to the position in the input sequence.
 (HETATM records for MSE (selenomethionine) will be interpreted as ATOM records for MET in the 
 alignment. MSE will be changed to MET in the output file.)
 
 Usage:   renumberpdb.pl [options] infile [outfile] 
 Example: renumberpdb.pl d1hz4a_.a3m /cluster/tmp/d1hz4a_.pdb 

 Options: 
   -o <file>    output file (default: <infile_wo_extension>.pdb
   -d <pdbdir>  give directory of pdb files (default=)
   -a [A|B]     filter alternative location identifier (e.g. A or B)
\end{verbatim} 
\normalsize

\subsection{{\tt cif2fasta.py} -- Create a fasta file from cif files}

An example of the usage of \verb`cif2fasta.py` is provided in Section \ref{building_dbs}. 

\small 
\begin{verbatim}
Usage: cif2fasta.py -i cif_folder -o *.fasta -c num_cores -v

cif2fasta.py takes a folder that contains cif files as input and outputs their
sequences into fasta file.

Options:
  -h, --help  show this help message and exit
  -i DIR      input cif folder.
  -o FILE     output fasta file.
  -p FILE     output PDB filter file (optional).
  -s FILE     SCOP annotation.
  -c INT      number of cores (default = 1).
  -l INT      Remove chains with a length < X (default = 30).
  -v INT      Verbose Mode (quiet = 0, full verbosity = 2).
\end{verbatim} 
\normalsize

\subsection{{\tt pdbfilter.py} -- Filter sequences from the PDB (requires MMSeqs)}

An example of the usage of \verb`pdbfilter.py` is provided in Section \ref{building_dbs}. Note that the annotations file which is required to select the proper PDB is created by \verb`cif2fasta.py` using the -p flag.

\small 
\begin{verbatim}
usage: pdbfilter.py [-h] [-i FILE] [-r FILE] [-v] FILE FILE FILE FILE

pdbfilter.py selects from sequence clusters (determined by MMSeqs) the
sequences which have the best resolution, R-free factor and/or completness and
writes them to a fasta file.

positional arguments:
  FILE                  input fasta file (created by cif2fasta.py)
  FILE                  sequence clusters (MMseqs)
  FILE                  annotations file (created by cif2fasta using the -p
                        flag, contains information about resolution, R-free
                        and completness of sequences).
  FILE                  output fasta file

optional arguments:
  -h, --help            show this help message and exit
  -i FILE, --include FILE
                        include PDB chains
  -r FILE, --remove FILE
                        exclude PDB chains
  -v, --verbose         verbose mode
\end{verbatim} 
\normalsize

\section{Selected changes from previous versions}


\subsection{Up to 3.0.0  (March 2015)}

\begin{itemize}
\item Changed building pipeline from gnu make to cmake.
\item SSE-parallelized Viterbi
\item Maximum Accuracy alignment speed-ups
\item Removed prefilter/viterbi block-shading
\item Removed support for old HHM format (HHM format version 1.2 with 1.1)
\item Added support for multiple databases in hhblits
\item Database format changed to \verb`<basename>\_{a3m,hmm,cs219}.ff{data,index}`
\item HHsearch requires now complete HHsuite databases (same as HHblits)
\item Replaced pthreads with openMP
\item openMP parallelized prefilter
\item Added openMP parallelized hhblits\_omp over several queries
\item Added support for compressed databases
\item Adjusted the meaning of -Oa3m/-oa3m in hhblits, hhsearch and hhalign
\item -qhhm no longer supported in hhblits
\item -tc no longer supported in hhalign
\item -index no longer supported in hhalign
\item hhalign is no longer able to print pngs of the alignments
\end{itemize}


\subsection{Up to 2.1.0  (March 2013)}

\begin{itemize}
\item Introduced a new format for hhblits databases: the prefilter flat files 
  *.cs219 containing the column-state sequences for prefiltering are replaced
  by a pair of ffindex files. This leads to a speed up for reading the prefilter 
  db by ~4 seconds and improved scaling of hhblits with the number of cores. 
  For compatibility with older versions of HHsuite, the *.cs219 files will 
  be still be provided with new db versions.

\item Improved sensitivity and alignment accuracy through introduction of a new,
  discriminative method to calculate pseudocounts, described in Angermuller C, 
  Biegert A, and Soding J, Bioinformatics 28: 3240-3247 (2012).

\item Increased speed of hhsearch about 8-fold and of HHblits about 1.5-fold by 
  implementing the HMM-HMM Viterbi alingment in SSE2 SIMD (single input 
  multiple data) instructions 

\item Added option -macins to hhblits, hhsearch and hhalign that controls the
  costs of internal gap positions in the Maximum Accuracy Algorithm. 
  0: gap cost is half the mact value. 1: no gap cost, leads to very gappy 
  alignments.

\item Expanded hhsuite user guide (how to runn hhblits efficiently, how to 
  modify or extend existing databases, how to visually check multiple sequence 
  alignments, troubleshooting)

\item Added options to hhsuitedb.pl for adding (-a) and removing (-r) files 
  from HHsuite database.

\end{itemize}

\subsection{Up to 2.0.16 (January 2013)}

\begin{itemize}
\item Renamed option -nodiff to -all. The name -nodiff is very misleading as
      the diff option is not actually switched off on the query and db alignments.

\item All error messages are now written to stderr (not stdout) and are of the form
  \verb`Error in <program_name>: <error description>`.

\item Added script splitfasta.pl to split multiple-sequence FASTA file in multiple
  single-sequence FASTA files. These are needed to run hhblits in parallel
  using multithread.pl.

\item Implemented -maxmem option specifying maximum memory in GB. This changes the 
  maximum length, up to which database sequences will be realigned using the
  memory-hungry maximum accuracy algorithm. This change will lead to more accurate
  results for longer HMMs (>5000 match states).
\end{itemize}


\subsection{Up to 2.0.2 (January 2012)}
\begin{itemize}

\item The iterative HMM-HMM search method HHblits has been added and the entire package is now called HHsuite. HHblits brings the power of HMM-HMM comparison to mainstream, general-purpose sequence searching and sequence analysis. 

\item Context-specific pseudocounts have been implemented HHsearch, improving its sensitivity and quality of alignments.

\item The speed of HHsearch was further increased through the use of SSE3 instructions.

\item An option \verb`-atab` for writing alignment as a table (with posteriors) to file was introduced

\item HHsearch is now able to read HMMER3 profiles (but should not be used due to a loss of sensitivity).

\item An optional local amino acid compositional bias correction was introduced(options \verb`-sc 5` and \verb`-scwin <window_halfsize>`). No improvements are seen on a standard SCOP single domain benchmark. However, their probably will be an improvement for more realistic sequences containing multiple domains, repeats, and regions of strong compositional bias. 

\item The score shift parameter and mact parameter have been optimized together on the optimization set of HHblits \cite{Remmert:2011}, which resulted in slight improvements of sensitivity and alignment quality. New default values are  \verb`shift -0.03` bits and \verb`-mact 0.30`. 

\item Use of \verb`.hhdefaults` still works but is deprecated. Use an alias in 
your \verb`.bashrc` or equivalent file instead (See Installation).

\item We removed a bug in \verb`-neff <target_diversity>` option that lead to the input MSA being reduced to only the master sequence when the target diversity was higher than the diversity of the input MSA.

\item We removed a bug in addss.pl that could lead to errors for FASTA-formatted MSAs.

\end{itemize}


\subsection{1.6.0 (November 2010)}

\begin{itemize}
 
\item{A new procedure for estimation of P- and E-values has been implemented that
  circumvents the need to calibrate HMMs. Calibration can still be done if 
  desired. By default, however, HHsearch now estimates the lambda and mu 
  parameters of the extreme value distribution (EVD) for each pair of query 
  and database HMMs from the lengths of both HMMs and the diversities of their 
  underlying alignments. Apart from saving the time for calibration, this 
  procedure is more reliable and noise-resistant. This change only applies to 
  the default local search mode. For global searches, nothing has changed. Note
  that E-values in global search mode are unreliable and that sensitivity is 
  reduced. \\
  Old calibrations can still be used:\\
   -calm 0 : use empirical query HMM calibration (old default)\\
   -calm 1 : use empirical db HMM calibration\\
   -calm 2 : use both query and db HMM calibration\\
   -calm 3 : use neural network calibration (new default)
}

\item{Previous versions of HHsearch sometimes showed non-homologous hits with high 
  probabilities by matching long stretches of secondary structure states, 
  in particular long helices, in the absence of any similarity in the amino 
  acid profiles. Capping the SS score by a linear function of the profile score
  now effectively suppresses these spurious high-scoring false positives. 
}

\item{The output format for the query-template alignments has slightly changed.
  A 'Confidence' line at the bottom of each alignment block now reports the 
  posterior probabilities for each alignment column when the -realign option
  is active (which it is by default). These probabilities are calculated in the 
  Forward-Backward algorithm that is needed as input for the Maximum ACcuracy 
  alignment algorithm. Also, the lines 'ss\_conf' with the confidence values 
  for the secondary structure prediction are omitted by default. (They can  
  be displayed with option '-showssconf'). To compensate, secondary structure 
  predictions with confidence values between 7 and 9 are given in capital 
  letters, while for the predictions with values between 0 and 6 lower-case 
  letters are used. 
}

\item{In the hhsearch output file in the header lines before each query-database 
  alignment, the substitution matrix score (without gap penalties) of the query 
  with the database sequence is now reported in bits per column. Also, the sum
  of probabilities for each pair of aligned residues from the MAC algorithm 
  is reported here (0 if no MAC alignment is performed).
}

\item{HHsearch now performs realign with MAC-alignment only around Viterbi-hit.
}

\end{itemize}

\subsection{1.5.0 (August 2007)}

\begin{itemize}

\item{By default, HHsearch realigns all displayed alignments in a second stage 
  using the more accurate Maximum Accuracy (MAC) alignment algorithm 
  (Durbin, Eddy, Krough, Mitchison: Biological sequence analysis, page 95;
  HMM-HMM version: J. S\"oding, unpublished). As before, the Viterbi algorithm 
  is employed for searching and ranking the matches. The realignment step is 
  parallelized (\verb`-cpu <int>`) and typically takes a few seconds only. You can 
  switch off the MAC realignment with the -norealign option. 
     The posterior probability threshold is controlled with the \verb`-mact` 
  option. This parameter controls the alignment algorithm's greediness. More 
  precisely, the MAC algorithm finds the alignment that maximizes the sum of 
  posterior probabilities minus mact for each aligned pair. Global alignments 
  are generated with -mact 0, whereas -mact 0.5 will produce quite conservative
  local alignments. Default value is -mact 0.35, which produces alignments of
  roughly the same length as the Viterbi algorithm. 
     The -global and -local (default) option now refer to both the Viterbi search 
  stage as well as the MAC realignment stage. With -global (-local), the 
  posterior probability matrix will be calculated for global (local) 
  alignment. Note that '-local -mact 0' will produce global alignments from
  a local posterior probability matrix (which is not at all unreasonable).
}
\item{An amino acid compositional bias correction is now performed by default.
  This increases the sensitivity by 25\% at 0.01 errors per query and by 5\% at 
  0.1 errors per query. By recalibrating the Probabilities, the increased 
  selectivity of this new version allows to give higher probabilities for the 
  same P-values. Also, the score offset could be increased from -0.1 bits to 0 
  as a consequence. 
}

\item{The algorithm that filters the set of the most diverse sequences (option 
  -diff) has been improved. Before, it determined the set of the N most 
  diverse sequences. In the case of multi-domain alignments, this could lead 
  to severely underrepresented regions. E.g.\ when the first domain is only 
  covered by a few fairly similar sequences and the second by hundreds of very 
  diverse ones, most or all of the similar ones were removed. The '-diff N' 
  option now filters the most diverse set of sequences, keeping at least N 
  sequences in each block of >50 columns. This generally leads to a total 
  number of sequences that is larger than N. Speed is similar. The default
  is '-diff 100' for hhmake and hhsearch. Speed is similar. Use -diff 0 to 
  switch this filter off.
}
\item{The sensitivity for the -global alignment option has been significantly 
  increased by a more robust statistical treatment. The sensitivity in -global
  mode is now only 0-10\% lower than for the default -local option on a SCOP
  benchmark, i.e., when the query or the templates represent single structural 
  domains. The E-values are now more realistic, although still not as 
  reliable as for -local. The Probabilities were recalibrated.
}
\item{A new binary hhalign has been added. It is similar to hhsearch, but performs
  only pairwise comparisons. It can produce tables of aligned 
  residues, and it can sample alternative alignments stochastically. It uses 
  the MAC algorithm by default. 
}
\item{HHsearch and hhalign can generate query-template multiple alignments in 
  FASTA, A2M, or A3M format with the -ofas, -oa2m, -oa3m options 
}
\item{Returned error values were changed to comply with convention that 0 means no errors:
\begin{enumerate}
   \item{Finished successfully}
   \item{Format error in input files}
   \item{File access error}
   \item{Out of memory }
   \item{Syntax error on command line}
   \item{Internal logic error (please report)}
   \item{Internal numeric error (please report)}
   \item{Other}
\end{enumerate}
}	

\end{itemize}

Is anyone still interested in Mac OSX/PPC or SunOS support? 


\section{License}

The HHsuite software package is distributed under Gnu Public License, Version 3. 

This means that the HHsuite is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program in the file \verb`LICENSE`.  
If not, see \url{http://www.gnu.org/licenses/}.

The hhsuite contains in file \verb`hhprefilter.C` code adapted from Michael 
Farrar (\url{http://sites.google.com/site/farrarmichael/smith-waterman}
and \cite{Farrar:2007}) His code is marked in the file hhprefilter.C. 
The copy right of his code is shown below:

\begin{verbatim}
Copyright 2006, by Michael Farrar.  All rights reserved. The SWSSE2
program and documentation may not be sold or incorporated into a
commercial product, in whole or in part, without written consent of
Michael Farrar.

For further information regarding permission for use or reproduction, 
please contact Michael Farrar at:

    farrar.michael@gmail.com

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
\end{verbatim}

\section{Contributors and Contact}

\emph{Johannes S\"oding}: main code base incl.\ perl scripts up to version 2.0, debugging, documentation\\ [1mm]
\emph{Markus Meier}: uniprot-1it database and associated hhblits code, code refactoring, tests and benchmarks\\[1mm]
\emph{Martin Steinegger}: Fast SIMD HMM-HMM Viterbi code and various speed-ups\\[1mm]
\emph{Michael Remmert}: early HHblits version\\[1mm]
\emph{Andy Hauser}: ffindex, speed/memory optimization\\[1mm]
\emph{Christof Angerm\"uller}: context-specific pseudocounts, speed optimizations\\[1mm]
\emph{Armin Meier}: HHblits column state alphabets, tests and benchmarks\\[1mm]
\emph{Andreas Biegert}: HHblits column state alphabet cs219, context-specific pseudocounts\\[1mm]

Contact:\\[2mm]
Johannes S\"oding\\
Max-Planck-Institut für biophysikalische Chemie\\
Computational Biology\\
Am Fassberg 11\\
37077 G\"ottingen\\
http://www.mpibpc.mpg.de/de/soeding\\
soeding@mpibpc.mpg.de
\section{Acknowlegements}

We thank the many users of HHsuite who sent bug reports or gave critical feedback that led to improvements in the software. 
We are grateful to Lazlo Kajan, Stefan M\"oller and the Debian team for making the HHsuite software available as a Debian package. J.S. thanks Andrei Lupas for his invaluable support, encouragement, and feedback during the first five years of the work on HHsuite in his department.

%\bibliographystyle{model3-num-names}
\addcontentsline{toc}{section}{References}
\bibliography{hhsuite-userguide.bib}

 


\vspace{20mm}
\begin{center}
Good luck with your work!

\end{center}


\end{document}