1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092
|
\documentstyle[palatino]{./article}
%UEA \documentstyle[11pt,twoside,implreport]{./article}
%UEA \implreportno{13}
%UEA \impltitle{Pseudoknot: a Float-Intensive Benchmark}
\newlength{\zerowidth}
\settowidth{\zerowidth}{0}
\newcommand{\z}[0]{\hspace*{\zerowidth}}
\newlength{\dotzerowidth}
\settowidth{\dotzerowidth}{.0}
\newcommand{\dz}[0]{\hspace*{\dotzerowidth}}
\newlength{\dotzerozerowidth}
\settowidth{\dotzerozerowidth}{.00}
\newcommand{\dzz}[0]{\hspace*{\dotzerozerowidth}}
\newcommand{\mm}[1]{\multicolumn{2}{c|}{#1}}
\newcommand{\mmm}[1]{\multicolumn{3}{c|}{#1}}
\newcommand{\sysbigloo}[0]{1}
\newcommand{\syscaml}[0]{2}
\newcommand{\syschalmers}[0]{3}
\newcommand{\sysclean}[0]{4}
\newcommand{\syscmucl}[0]{5}
\newcommand{\syserlang}[0]{6}
\newcommand{\sysfacile}[0]{20}
\newcommand{\sysfast}[0]{7}
\newcommand{\sysgambit}[0]{8}
\newcommand{\sysglasgow}[0]{9}
\newcommand{\sysid}[0]{10}
\newcommand{\sysmlworks}[0]{11}
\newcommand{\sysopal}[0]{12}
\newcommand{\syspolyml}[0]{13}
\newcommand{\syssisals}[0]{14}
\newcommand{\syssisalc}[0]{15}
\newcommand{\syssml}[0]{16}
\newcommand{\sysstoffel}[0]{17}
\newcommand{\systrafola}[0]{18}
\newcommand{\sysfloat}[0]{19}
\textheight 240mm
\addtolength{\topmargin}{-25mm}
\addtolength{\oddsidemargin}{-18mm}
\addtolength{\evensidemargin}{-18mm}
\addtolength{\footheight}{10mm}
\textwidth 160mm
\begin{document}
\thispagestyle{empty}
\title{Pseudoknot: a Float-Intensive Benchmark \\ for Functional Compilers (DRAFT)}
\author{
Pieter H. Hartel \thanks{Dept.\ of Computer Systems, Univ.\ of
Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands, e-mail:
pieter@fwi.uva.nl} \and
Marc Feeley \thanks{D\'epart.\ d'informatique et r.o., univ.\ de
Montr\'eal, succursale centre-ville, Montr\'eal H3C 3J7, Canada,
e-mail: feeley@iro.umontreal.ca} \and
Martin Alt \thanks{Informatik, Universit\"at des Saarlandes,
66041 Saarbr\"cken 11, Germany, e-mail: alt@cs.uni-sb.de} \and
Lennart Augustsson \thanks{Dept.\ of Computer Systems, Chalmers Univ.\ of
Technology, 412 96 G\"oteborg, Sweden, e-mail: augustss@cs.chalmers.se} \and
Peter Baumann \thanks{Dept.\ of Computer Science, Univ.\ of Zurich,
Winterthurerstr.\ 190, 8057 Zurich, Switzerland, e-mail:
baumann@ifi.unizh.ch} \and
Marcel Beemster \thanks{Dept.\ of Computer Systems, Univ.\ of
Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands, e-mail:
beemster@fwi.uva.nl} \and
Emmanuel Chailloux \thanks{LIENS, URA 1327 du CNRS, \'Ecole Normale
Sup\'erieure, 45 rue d'Ulm, 75230 PARIS C\'edex 05, France, e-mail:
Emmanuel.Chailloux@ens.fr} \and
Christine H. Flood \thanks{Laboratory for Computer Science, MIT, 545
Technology Square, Cambridge Massachusetts 02139, USA, e-mail:
chf@lcs.mit.edu} \and
Wolfgang Grieskamp \thanks{Berlin University of Technology,
Franklinstr. 28-29, 10587 Berlin, Germany, e-mail: wg@cs.tu-berlin.de} \and
John H. G.\ van Groningen \thanks{Faculty of Mathematics and Computer
Science, Univ.\ of Nijmegen, Toernooiveld 1, 6525 ED Nijmegen, The
Netherlands, e-mail: johnvg@cs.kun.nl} \and
Kevin Hammond \thanks{Dept.\ of Computing Science, Glasgow University,
17 Lilybank Gardens, Glasgow, G12 8QQ, UK, e-mail: kh@dcs.glasgow.ac.uk} \and
Bogumi\l~Hausman \thanks{Computer Science Lab, Ellemtel Telecom Systems
Labs, Box 1505, S-125 25 \"Alvsj\"o, Sweden, e-mail:
bogdan@erix.ericsson.se} \and
Melody Y. Ivory \thanks{Computer Research Group, Institute for
Scientific Computer Research, Lawrence Livermore National Laboratory,
P. O. Box 808 L-419, Livermore, CA 94550, e-mail: ivory1@llnl.gov} \and
Peter Lee\thanks{Dept.\ of Computer Science, Carnegie Mellon University,
5000 Forbes Avenue Pittsburgh, Pennsylvania 15213, USA, e-mail:
petel@cs.cmu.edu} \and
Xavier Leroy \thanks{INRIA Rocquencourt, projet Cristal,
B.P. 105, 78153 Le Chesnay, France. e-mail: Xavier.Leroy@inria.fr} \and
Sandra Loosemore \thanks{Dept.\ of Computer Science, Yale Univ., New haven,
Connecticut, e-mail: loosemore-sandra@cs.yale.edu} \and
Niklas R\"ojemo \thanks{Dept.\ of Computer Systems, Chalmers Univ.\ of
Technology, 412 96 G\"oteborg, Sweden, e-mail: rojemo@cs.chalmers.se} \and
Manuel Serrano \thanks{INRIA Rocquencourt, projet Icsla,
B.P. 105, 78153 Le Chesnay, France. e-mail: Manuel.Serrano@inria.fr} \and
Jean-Pierre Talpin \thanks{European Computer-Industry Research Centre,
Arabella Strasse 17, D-81925 Munich, Germany. e-mail: jp@ecrc.de} \and
Jon Thackray \thanks{Harlequin Ltd, Barrington Hall, Barrington,
Cambridge CB2 5RG, England, e-mail: jont@harlequin.co.uk} \and
Pierre Weis \thanks{INRIA Rocquencourt, projet Cristal,
B.P. 105, 78153 Le Chesnay, France. e-mail: Pierre.Weis@inria.fr} \and
Peter Wentworth \thanks{Dept.\ of Computer Science, Rhodes Univ.,
Grahamstown 6140, South Africa, e-mail: cspw@cs.ru.ac.za}
}
\maketitle
\sloppy
\begin{abstract}
Over 20 implementations of different functional languages are
benchmarked using the same program, a floating-point intensive
application taken from molecular biology. The principal aspects
studied are compile time and execution time for the varying
implementations. An important consideration is how the program can be
modified and tuned to obtain maximal performance on each language
implementation.
With few exceptions, the compilers take a significant amount of time
to compile this program, though almost all compilers were faster than
the current GNU C compiler. Compilers that generate C or Lisp are
often slower than those that generate native code directly; the cost
of compiling the intermediate form is normally a large fraction of the
total compilation time.
There is no clear distinction between the runtime performance of eager
and lazy implementations when appropriate annotations are used: lazy
implementations have clearly come of age when it comes to implementing
largely strict applications, such as the pseudoknot program. The speed
of C can be approached by some implementations, and is even exceeded
by Sisal on the Cray, but in order to achieve this performance,
special measures such as strictness annotations are required by
non-strict implementations.
\end{abstract}
\section{Introduction}
At the Dagstuhl Workshop on Applications of Functional Programming in
the Real World in May 1994~\cite{Gie94}, several interesting applications of
functional languages were presented. One of these applications, the
pseudoknot problem~\cite{Fee94} had been written in several languages,
including C, Scheme~\cite{Ree91}, Multilisp~\cite{Hal85} and
Miranda\footnote{Miranda is a trademark of Research Software Ltd.}~\cite{Tur85}.
A number
of workshop participants decided to test their compiler technology
using this particular program. One point of comparison is the speed of
compilation and the speed of the compiled program. Another important
point is how the program can be modified and tuned to obtain maximal
performance on each language implementation available. It is also
interesting to compare the performance of typed and untyped languages.
Finally, an interesting question is whether laziness is or is not
beneficial for this application.
The initial benchmarking efforts revealed important differences between
the various compilers. The first impression was that compilation speed
should generally be improved. After the workshop we have continued to
work on improving both the compilation and execution speed of the
pseudoknot program. Some researchers not present at Dagstuhl joined the
team and we present the results as a record of a small scale, but
exciting collaboration with contributions from many parts of the world.
As is the case with any benchmarking work, our results should be taken
with a grain of salt. We are using a realistic program that performs a
useful computation, however it stresses particular language features
that are probably not representative of the applications for which the
language implementations were intended. Implementations invariably
trade-off the performance of some programming feature for others in the
quest for the right blend of usability, flexibility, and performance on
``typical'' applications. It is clear that a single benchmark is not a
good way to measure the overall quality of an implementation.
Moreover, the performance of an implementation usually (but not always)
improves with new releases as the implementors repair bugs, add new
features, and modify the compiler. We feel that our choice of
benchmark can be justified by the fact that it is a real world
application, that it had already been translated into C and several
functional languages, and that we wanted to compare a wide range of
languages and implementations. The main results agree with those found
in earlier studies~\cite{Can92,Har92b}.
Section~\ref{sec:languages} briefly characterises the functional
languages that have been used. The pseudoknot application is introduced
in Section~\ref{sec:application}. The compilers and interpreters for
the functional languages are presented in Section~\ref{sec:compilers}.
Section~\ref{sec:translation} describes the translations of the program
from one language into the other. The results and conclusions are given
in the last two sections.
\begin{table}
\begin{minipage}{\hsize}
\begin{center}
\begin{tabular}{|l|l|l|l|l|l|l|}
\hline
language & source & ref. & typing &evaluation & order&match \\
\hline
Caml & INRIA &\cite{Wei93} & strong, poly & eager &higher&pattern \\
Clean & Nijmegen &\cite{Pla94} & strong, poly & lazy &higher&pattern \\
Common Lisp & Committee &\cite{Ste90} & dynamic & eager &higher& access \\
Erlang & Ericsson &\cite{Arm93} & dynamic & eager & first&pattern \\
Facile & ECRC &\cite{Tho93a}& strong, poly & eager &higher&pattern \\
Gofer & Yale &\cite{Jon94c}& strong, poly & lazy &higher&pattern \\
Haskell & Committee &\cite{Hud92a}& strong, poly & lazy &higher&pattern \\
ID & MIT &\cite{Nik90a}& strong, poly & eager \footnote{eager, non-strict}
&higher&pattern \\
Miranda & Kent &\cite{Tur85} & strong, poly & lazy &higher&pattern \\
Opal & TU Berlin &\cite{Did94} & strong, param & eager &higher&pattern \\
RUFL & Rhodes &\cite{Wen92} & strong, poly & lazy &higher&pattern \\
Scheme & Committee &\cite{Ree91} & dynamic & eager &higher& access \\
Sisal & LLNL &\cite{McG85} & strong, mono & eager &first & none \\
Standard ML & Committee &\cite{Mil90} & strong, poly & eager &higher&pattern \\
Stoffel & Amsterdam &\cite{Bee92a}& strong, poly & lazy &higher& case \\
Trafola & Saarbr\"ucken &\cite{Alt93} & strong, poly & eager &higher&pattern \\
\hline
ANSI C & Committee &\cite{Ker88} & weak & eager & first& none \\
\hline
\end{tabular}
\end{center}
\end{minipage}
\caption{Language characteristics. For each language its source is given, as
well as a key reference to the language definition. The rest of the
table presents important language characteristics.}
\label{tbl:language}
\end{table}
\section{Languages}
\label{sec:languages}
Details of the languages which were benchmarked may be found in
Table~\ref{tbl:language}. C has been used as a reference language in
order to allow comparison with imperative languages. The first column
of the table gives the name of the language. The second column gives
the source (i.e.\ a University or a Company) if a language has been
developed in one particular place. Some languages were designed by a
committee, which is also shown. The third column of the table gives a
key reference to the language definition.
The last four columns describe some important properties of the
languages. The typing discipline may be {\it strong} (and static),
{\it dynamic}, or {\it weak}; a strong typing discipline may be
monomorphic ({\it mono}), polymorphic ({\it poly}), or parametrically
polymorphic ({\it param}). The evaluation strategy may be {\it
eager}, {\it lazy} or {\it eager} with {\it non-strict}
evaluation~\cite{Nik90a}. The language may be {\it first order} or
{\it higher order}. Accessing components of data structures may be
supported by either pattern-matching on function arguments, local
definitions and/or as part of case expressions ({\em pattern},
{\em case}), by compiler generated access functions to destructure data
({\em access}), or not at all ({\em none}). The reader should consult
the references provided for further details of individual languages.
\section{Application}
\label{sec:application}
The pseudoknot program is derived from a ``real-world'' molecular
biology application that computes the three-dimensional structure of
part of a nucleic acid molecule. The program exhaustively searches a
discrete space of shapes and returns the set of shapes that respect
some structural constraint.
\begin{table}
\begin{center}
\begin{tabular}{|l|r|l|r|}
\hline
\multicolumn{2}{|c}{floating point operations} &
\multicolumn{2}{|c|}{square root and trigonometric functions} \\
\hline
$\times$ & 3,567,672 & & \\
$+$ & 2,798,571 & $\sqrt{\;\;}$ & 69,600 \\
$>$, $\leq$, $<$ & 129,656 & $\arctan$ & 40,184 \\
$-$ & 330,058 & $\cos$ & 40,184 \\
$/$ & 40,184 & $\sin$ & 40,184 \\
\hline
total & 6,866,141 & total & 190,152 \\
\hline
\end{tabular}
\end{center}
\caption{Breakdown of the ``real'' work involved in the pseudoknot
problem as counted by the FAST system. The floating point operations
occurring in the trigonometric functions and the square root are {\em
not} counted separately.}
\label{tbl:float}
\end{table}
The program is heavily floating point oriented. For example the C
version of the program spends 25\% of its time in the C library
trigonometric and square root routines.
Statistics from the lazy FAST~\cite{Har91} compiler show that with lazy
evaluation the fastest version of the program does about 7~M floating
point operations, which excludes those performed by the 190~K
trigonometric and square root functions. A detailed breakdown of these
statistics is shown in Table~\ref{tbl:float}. The program makes about
1.5~M function calls and claims about 15~Mbytes of space (The maximum
heap size required is about 30~Kbytes).
Statistics from the eager MLWorks system show that with eager
evaluation the program spends about 50\% of its time performing
floating point operations.
All this suggests that the application can be characterised as
somewhere in between ``numeric'' and ``symbolic''.
The pseudoknot program does not explicitly depend on laziness for its
correct behaviour. The program is thus ideal for comparing lazy and
non-lazy implementations of functional languages. It is relatively easy
to translate the program from a lazy to a strict functional
language. Inserting strictness annotations in a lazy version of the
program is also relatively easy.
The program uses mostly algebraic data types as data structures. Some
of these are rather large. Having to use pattern-matching style
destructuring on data structures with as many components as 12 or even
30 leads to distinctly unreadable code.
With the obvious exception of the C version, all other versions of the
program are purely functional.
The original program is described in detail by Feeley, et.\ al~\cite{Fee94}. The
program used in the present benchmarking effort differs in two ways
from the original.
Firstly, the original program only computed the number of solutions
found during the search. However, it is the location of each atom in
the solutions that are of interest to a biologist because these
solutions typically need to be screened manually by visualizing them
one after another. The program was thus modified to compute the
location of each atom in the structures that are found. In order to
minimize I/O overhead, a single value is printed: the distance from
the origin to the farthest atom in any solution (this requires that the
position of each atom be computed).
Secondly, the original program did not attempt to exploit laziness in
any way. However, there is an opportunity for laziness in the
computation of the absolute position of atoms. To compute the absolute
position of an atom, it is necessary to transform a 3D vector from one
coordinate system to another by multiplying a 3D vector by a $3 \times
4$ transformation matrix. The reason for this is that the position of
an atom is expressed relatively to the nucleotide it is in. Thus the
position of an atom is specified by two values: the 3D position of the
atom in the nucleotide, and the absolute position and orientation of
the nucleotide in space. However, the position of atoms in structures
that are pruned by the search process are not needed (unless they are
needed to test a constraint). There are thus two formulations of the
pseudoknot program which differ in the way the position computation of
atoms is expressed:
\begin{description}
\item[Late formulation]
The absolute position of an atom is computed each time it is needed;
either for testing a constraint or for computing the distance to
origin if it is in a solution. This formulation may duplicate
work. Because the placement of a nucleotide is shared by all the
structures generated in the subtree that stems from that point in the
search, there can be as many recomputations of a position as there are
nodes in the search tree (which can number in the thousands to
millions).
\item[Early formulation]
The absolute position of an atom is computed as soon as the nucleotide
it is in is placed. If this computation is done lazily, this
formulation avoids the duplication of work because the coordinate
transformation for an atom is at most done once.
\end{description}
The original program used the late formulation. To explore the
benefits of laziness, the program was modified so that it is easy to
obtain the alternative early formulation (only a few lines need to be
commented and uncommented to switch from one formulation to the
other).
The C, Scheme, and Miranda versions of the program support the early
formulation. The Miranda version implicitly uses lazy evaluation. The
Scheme version uses the explicitly lazy {\em delay} and {\em force}
constructs. The C version requires lazy computation to be explicitly
programmed.
\begin{table}
\begin{center}
\begin{tabular}{|l|l|l|l|l|}
\hline
language &version&source & ref. & FTP / Email \\
\hline
Bigloo & 1.7 &INRIA Rocquencourt &\cite{Ser94a}& Ftp: {\small ftp.inria.fr:} \\
& & & & {\small /INRIA/Projects/icsla/Implementations/} \\
Caml Light & 0.61 &INRIA Rocquencourt &\cite{Ler93} & Ftp: {\small ftp.inria.fr:} \\
& & & & {\small /lang/caml-light/} \\
Caml Gallium & &INRIA Rocquencourt &\cite{Ler92} & Email: {\small Xavier.Leroy@inria.fr} \\
Camloo & 0.2 &INRIA Rocquencourt &\cite{Ser94b}& Ftp: {\small ftp.inria.fr:} \\
& & & & {\small /lang/caml-light/} \\
CeML & 0.22 & LIENS &\cite{Cha92a}& Email: {\small Emmanuel.Chailloux@ens.fr} \\
Clean & 1.0b &Nijmegen &\cite{Sme91} & Ftp: {\small ftp.cs.kun.nl:} \\
& & & & {\small /pub/Clean/} \\
CMU CL & 17e &Carnegie Mellon &\cite{Mac92} & Ftp: {\small lisp-sun1.slisp.cs.cmu.edu:} \\
& & & & {\small /pub/} \\
Erlang & 6.0.4 &Ellemtel AB &\cite{Hau94} & commercial \\
& & & & Email: {\small erlang@erix.ericsson.se} \\
Facile & Antigua & ECRC &\cite{Tho93a}& Email: {\small facile@ecrc.de} \\
FAST & 33 &Southampton/ &\cite{Har91} & Email: {\small pieter@fwi.uva.nl} \\
& &Amsterdam & & \\
Gofer & 2.30a &Yale &\cite{Jon94c}& Ftp: {\small nebula.cs.yale.edu:} \\
& & & & {\small /pub/haskell/gofer/} \\
Haskell & 999.6 &Chalmers &\cite{Aug93c}& Ftp: {\small ftp.cs.chalmers.se:} \\
& & & & {\small /pub/haskell/chalmers/} \\
Haskell & 0.22 &Glasgow &\cite{Pey93c}& Ftp: {\small ftp.dcs.glasgow.ac.uk:} \\
& & & & {\small /pub/haskell/glasgow/} \\
Haskell & 2.1 &Yale &\cite{Yal94} & Ftp: {\small nebula.cs.yale.edu:} \\
& & & & {\small /pub/haskell/yale/} \\
ID &TL0 2.1&MIT/Berkeley &\cite{Nik90a}& Email: {\small chf@lcs.mit.edu} \\
MLWorks & n.a. &Harlequin Ltd. &\cite{Har94x}& commercial \\
Miranda & 2.018 &Research Software Ltd.&\cite{Tur90a}& commercial \\
& & & & Email: {\small mira-request@ukc.ac.uk} \\
Nearly Haskell & Pre rel. &Chalmers &\cite{Roj94} & Ftp: {\small ftp.cs.chalmers.se:} \\
& & & & {\small /pub/haskell/nhc/} \\
Opal & 2.1c &Berlin &\cite{Sch91} & Ftp: {\small ftp.cs.tu-berlin.de:} \\
& & & & {\small /pub/local/uebb/ocs} \\
RUFL & 1.8.4 &Rhodes &\cite{Wen91} & Ftp: {\small cs.ru.ac.za:} \\
& & & & {\small /pub/rufl/} \\
Gambit Scheme& 2.2 &Montr\'eal &\cite{Fee90} & Ftp: {\small ftp.iro.umontreal.ca:} \\
& & & & {\small /pub/parallele/gambit/} \\
Sisal &12.9.2 &LLNL &\cite{Can92c}& Ftp: {\small sisal.llnl.gov} \\
& & & & {\small /pub/sisal} \\
SML/NJ & 1.03z &AT\&T Bell Labs. &\cite{App92} & Ftp: {\small research.att.com:} \\
& & & & {\small /dist/ml/} \\
Poly/ML & 2.05M &Abstract Hardware Ltd.&\cite{Fin92a}& commercial \\
Stoffel & & Amsterdam &\cite{Bee93} & Email: {\small beemster@fwi.uva.nl} \\
Trafola & 1.2 & Saarbr\"ucken &\cite{Alt93} & Email: {\small alt@cs.uni-sb.de} \\
\hline
\end{tabular}
\end{center}
\caption{Compiler details consisting of the name of the compiler and/or language, the
University or Company that built the compiler, one key reference to the
description of the implementation and the address from which the
compiler can be obtained.}
\label{tbl:compiler}
\end{table}
\section{Compilers}
\label{sec:compilers}
An overview of the compilers which were studied may be found in
Table~\ref{tbl:compiler}. The first column gives the name of the
language and/or compiler, the second shows the source of the compiler. A key reference
that describes the compiler is given in the third column. The last
column gives instructions for obtaining the compiler by FTP or email.
\begin{table}
\begin{center}
\begin{tabular}{|l|l*{3}{|r@{\,}r@{\,}r}|}
\hline
C compiler & version & \mmm{compilation} & \mmm{execution} & \mmm{compilation} \\
& & \mmm{complete} & \mmm{complete} & \mmm{stripped} \\
\hline
cc --O & SunOS 4.1.2 & 325\dzz &+& 26\dzz & 3.35 &+& 0.23 & 12.9 &+& 2.4 \\
gcc --O & 2.5.8 & 910\dzz &+& 96\dzz & 2.70 &+& 0.19 & 13.4 &+& 2.3 \\
\hline
ratio & & 0.35 &+& 0.26 & 1.23 &+& 1.15 & \mmm{$\approx 1+1$} \\
\hline
\end{tabular}
\end{center}
\caption{The compilation and execution times (user+system seconds) of
the C version of the pseudoknot program with the two most commonly
used C compilers.}
\label{tbl:c_gcc}
\end{table}
Most of the experiments were performed on Sun SPARC workstations. The
basis of the experiments is formed by the performance of the C
compiler. We have used either Sun's bundled C compiler cc or the GNU C
compiler gcc. The gcc compiler generates slightly faster code for the
pseudoknot program, but it takes about three times as long to compile.
Table~\ref{tbl:c_gcc} gives the detailed results as measured on
machine~\sysfast~(c.f. Table~\ref{tbl:machine}) using gcc version
2.5.8. These results indicate that the C version of the pseudoknot
program is difficult to compile, even for a C compiler. This is due to
the presence of four large routines (consisting of 574, 552, 585 and
541 lines of code, respectively) which together build the conformation
data base of the nucleotides. Stripping the body of these four
functions to be empty (which still leaves a program of 1073 lines of C
code) reduces the compilation time to about 13 seconds, for cc as well
as gcc. All the functional versions of the program have the same
structure as the C version, so the functional compilers are faced with
the same difficulty of compiling the code that builds the conformation
data base.
An interesting difference between cc and gcc is that over 75\% of the
compilation time for cc is spent in the assembler, whereas gcc spends
99\% of its time in the main compiler pass (\verb=cc1=).
Table~\ref{tbl:option} shows some properties of the compilers and
experiments that have an effect on the runtime performance of the
implementations. In addition to the compile and runtime options
(columns~2 and~3), the fourth column shows the type of garbage
collector used, and the fifth column shows what floating point
precision is used.
The floating point precision is important to take into account when
comparing results for two reasons. Firstly, using single precision
floating point numbers makes it easier to generate fast code. Secondly,
some architectures implement single precision floating point arithmetic
significantly faster than double precision (DEC Alpha). However, on the
SPARC that has been used for the majority of the experiments reported
here there is only a small difference between the speed of single and
double precision arithmetic.
\begin{table}
\begin{minipage}{\hsize}
\begin{center}
\begin{tabular}{|l|l|l|l|l|}
\hline
compiler & compiler options& execution options & collector& float\\
\hline
Bigloo & --unsafe --O4 & & mark-scan& double \\
Caml Light & & & gen. & double \\
Caml Gallium & & & gen. & double \\
Camloo & --unsafe --O4 & & mark-scan& double \\
CeML & --O & & 1-space & single \\
Chalmers & --c --Y--S --H50Mg --Y--A500k --cpp & & 2-space & single \\
Clean & & --nt --s 10k --h ... & 2-sp/ms & double \\
CMU & see text & see text & 2-space & single \\
Erlang BEAM & --fast & --h 600000 & 2-space & double \\
Facile & & & gen. & double \\
FAST & --fcg & --v 1 --h ... --s 400K 1 & 2-space & single \\
Gambit & --h14336 & --h14336 & 2-space & double \\
Glasgow & --O --fvia--C --O2--for--C & +RTS --H1M&gen.~\cite{San93}& single \\
Gofer & --DTIMER & & 2-space & single \\
ID & strict, merge--partitions (tlc: opt) & & none & double \\
MLWorks &
no details available\footnote{MLWorks is not yet available. Compilation
was for maximum optimisation, no debugging or statistics collection
was taking place at runtime.} & & 2-space & double \\
Miranda & & /heap ...; /count & mark-scan& double \\
NHC(HBC) & --H30M & & 2-space & single \\
NHC(NHC) & --h2M & & 1-space & single \\
Opal & opt=full debug=no & & refcount & single \\
Poly/ML & --h 48000 & --h 48000 & gen. & single \\
RUFL & --w & --m300 & mark-scan& double \\
RUFLI & --iw & --m300 --r32000 & mark-scan& double \\
Sisal & --cpp --seq --O --c atan2 --cc=--O & & refcount & double \\
SML/NJ & & & gen. & double \\
Stoffel & --O2 (for C) & & 2-space & double \\
Trafola & --TC --INLINE 1 & --HE 8000000 & 1-space & sgl/dbl\\
Yale & see text & see text & 2-space & single \\
\hline
\end{tabular}
\end{center}
\end{minipage}
\caption{Compiler and execution options. The type of garbage collector
is one of 2-space (non-generational 2-space copying collector); mark-scan; gen.
(generational with two or more spaces); 1-space (mark-scan, one space
compactor); or reference counting. Floating point arithmetic used is
either single or double precision.}
\label{tbl:option}
\end{table}
\section{Translation, annotations and optimisations}
\label{sec:translation}
The pseudoknot program was translated by hand from the Scheme version
to the various other languages. All versions were hand tuned to
exhibit the best performance available with the compiler being used.
The following set of guide lines has been used to make the comparison
as fair as possible:
\begin{enumerate}
\item
Algorithmic changes are forbidden but slight modifications to the code
to better use a particular feature of the language or programming
system are allowed. A modification that would be valid for all
languages involved is not permitted.
\item
Only small changes to the data structures are permitted (e.g.\ a tuple
may be turned into an array or a list).
\item
Annotations are permitted, for example strictness annotations, or
annotations for inlining and specialisation of code.
\item
All changes and annotations should be documented.
\item
Anyone should be able to repeat the experiments. So all sources and
measurement procedures should be made public (by \verb=ftp= somewhere).
\item
All programs must produce the same output (the number 33.7976 to 6
significant figures).
\end{enumerate}
The optimisations and annotations made to obtain best performance with
each of the compilers are discussed in the subsections to follow. Often
we will make reference to particular parts of the program text. As the
program is relatively large it would be difficult to reproduce it in
full here, and in many versions. The reader is invited to consult the
archive that contains most of the versions of the program at
\verb=ftp.fwi.uva.nl=, file \verb=/pub/functional/pseudoknot.tar.Z=.
\subsection{Bigloo}
The Scheme version of the pseudoknot benchmark was changed only
slightly to serve as input to the Bigloo optimising Scheme compiler.
This compiler translates to C, as do many other compilers. In the C
code it produces, the many vectors of floating point constants present
in the conformation data base are not represented as ordinary C
constants but as a large string. This enables the C compiler to compile
the generated C much more quickly. During initialisation at runtime the
string representation is transformed into the required numeric data.
\subsection{Caml}
Three compilers for the Caml dialect of ML have been benchmarked: Caml
Light, a simple bytecode compiler; Camloo, a Caml to C compiler derived
from the Bigloo optimizing Scheme compiler; and Gallium, an
experimental native-code compiler.
The source code is a straightforward translation to Caml of the SML
version of the benchmark. The only changes made to increase performance
was the use of an array instead of a list in the functions
\verb|list_of_atoms| and \verb|most_distant_atoms|. For the Gallium
compiler only, some tuples have been converted into records to guide
the data representation heuristics; this transformation makes no
difference for the other two compilers.
The Caml Gallium compiler uses unboxed double precision float in
monomorphic parts of the source code. Since the pseudoknot benchmark does
not use polymorphism, all floats are unboxed at all times. This is the
main reason why the Gallium compiler generates faster code than most
of the other compilers.
\subsection{CeML}
CeML is another dialect of Caml. The purpose of the development of CeML
and its compiler is to prove that ML can be compiled into efficient C.
CeML does not tag immediate values and uses a garbage collector with
ambiguous roots which manages its own stack. This is important for the
pseudoknot program because the floats are represented as a full word
(32 bits).
The CeML source of the pseudoknot program differs only in one aspect
from the Camloo source. To avoid rebuilding the tuple \verb=(i,t,n)=
the definition of \verb=get_var= was changed from:
\begin{verbatim}
> fun get_var id ((i,t,n)::lst) = if id = i then (i,t,n)
> else get_var id lst
\end{verbatim}
to:
\begin{verbatim}
> fun get_var id (((i,_,_) as v)::lst) = if id = i then v
> else get_var id lst
\end{verbatim}
The total compilation time consist mainly of the gcc optimized
compilation (--O2) time, particularly to build the data base. If the
program is sliced into different modules and the module corresponding
to the data base is not compiled with the --O2 option, the total
compilation time is about ten minutes and the execution time does not
change.
\subsection{CLEAN}
The syntax of the Clean 1.0 programming language is similar to the
syntax of Haskell and Miranda. Porting the Miranda program to Clean 1.0
was easy. The guards were changed to Haskell syntax, the first
character of each type name was changed to upper case, and the names of
a few functions were changed. Some patterns on the right hand side of
function definitions were moved to the left hand side or replaced by
case expressions. This was necessary, because the current compiler
cannot yet handle all patterns on the right hand side, only tuple and
record patterns are currently allowed there. For example:
\begin{verbatim}
> anticodon_constraint v partial_inst
> | i==33 = ...
> where Var i t n = v
\end{verbatim}
was replaced by:
\begin{verbatim}
> anticodon_constraint v=:(Var i t n) partial_inst
> | i==33 = ...
> where
\end{verbatim}
To achieve good performance the following five changes have been
made. The basic floating point operations and some constants were
inlined. The components of the main algebraic data types \verb=Pt= and
\verb=Tfo= and the integer component of \verb=Var= were annotated as
strict. The function \verb=absolute_pos= was inlined in the function
\verb=atom_pos=. If a node on the right hand side of a definition
occurred in the pattern, the code was changed to prevent rebuilding the
node, for example:
\begin{verbatim}
> atom_pos atom (Var i t n) = absolute_pos (Var i t n) (atom n)
\end{verbatim}
was changed to:
\begin{verbatim}
> atom_pos atom v=:(Var i t n) = absolute_pos v (atom n)
\end{verbatim}
Finally, in the function \verb=pseudoknot_domains= the values of
\verb=rA=, \verb=rC=, \verb=rG=, \verb=rU= and \verb=rCs= were assigned
to local functions as: \verb|ra=rA| etc. The local functions were used
throughout the body of \verb=pseudoknot_domains=. This is necessary
because the current compiler does not yet support constant applicative
forms (CAFs)~\cite{Pey87}.
\subsection{Chalmers Haskell}
\label{sec:hbc}
The Haskell version of the pseudoknot program is a straightforward
translation of the Miranda version. The latter often destructs and
reconstructs a data item. In the Haskell version care has been taken to
introduce ``as'' patterns thus:
\begin{verbatim}
> atom_pos atom v@(Var i t n) = absolute_pos v (atom n)
\end{verbatim}
It proved necessary to split the program into a number of modules, as
the program as a whole could not be compiled (see
Section~\ref{sec:NHC}). Splitting the program into modules was
straightforward since each of the four functions that initialise the
conformation data base can be separated into its own large module of
about 600 lines. The global data structure definitions are placed in
their own module which is imported by each worker module, and the main
program imports all of these modules.
The Chalmers Haskell compiler (HBC) provides a compiler option (--DSTR)
to annotate all data structures as strict. The late formulation of the
pseudoknot program runs about twice as fast with this option
selected.
\subsection{Erlang}
Erlang is a concurrent functional programming language designed for
prototyping and implementing reliable real-time systems~\cite{Arm93}.
The Erlang BEAM compiler~\cite{Hau94} is an efficient and portable
implementation of Erlang where Erlang programs are compiled into C.
However, Erlang is a concurrent language and has explicit
syntax for referring to time (time-outs). The implementation therefore
has to provide a scheduling mechanism and a notion of time which
introduces constant overhead to the Erlang BEAM execution.
Porting the benchmark from Miranda to Erlang required some minor changes.
Since Erlang is not a higher order language some function calls were
done by applying \verb=p_apply/N=, for example \verb=reference/3= is
called as:
\begin{verbatim}
> p_apply(reference,Arg1,Arg2,Arg3) -> reference(Arg1,Arg2,Arg3).
\end{verbatim}
when all arguments are known.
Another problem is typing. Some limited typing can be done in Erlang
using guard tests. For example, the following Erlang function types its
arguments as floats:
\begin{verbatim}
> pt_sub({X1,Y1,Z1},{X2,Y2,Z2})
> when float(X1),float(Y1),float(Z1),
> float(X2),float(Y2),float(Z2) ->
> {X1-X2,Y1-Y2,Z1-Z2}.
\end{verbatim}
The type information obtained from guards and extended by compile-time
type analysis is used by the Erlang BEAM compiler to optimize
different arithmetic operations.
Since Erlang BEAM compiles Erlang programs into C, the Erlang BEAM
compilation time presented in Table~\ref{tbl:compilation} includes the
costs both of compiling Erlang to C and of calling gcc (version 2.5.8)
to compile the generated C program. The total compilation time consists
mainly of the gcc compilation time (4751 + 20 seconds). The gcc
compiler was run with the -O2 option which increases compilation time
but a better performance of the generated code was considered more
important than the compilation time.
The best execution time was achieved by extending the available heap
space to minimize the garbage collection time, and by inlining floating
point operations after compile-time type analysis.
Erlang BEAM provides double precision floating point arithmetic. All
floating point numbers are stored as boxed objects in the heap.
Integers are 32 bits objects, which may be stored directly on stack.
The relatively poor performance of the Erlang BEAM compiler (reported
in Table~\ref{tbl:execution}) is due to the fact that the pseudoknot
benchmark is float-intensive and thus tests mainly the implementation
of floating point arithmetic. As shown later in Section~\ref{sec:OPAL},
having single precision floating point arithmetic and using an unboxed
representation of floating point data may improve the execution time by
a factor of 4. In the Erlang BEAM compiler giving up the double
precision floating point arithmetic would similarly improve both access
and garbage collection time by storing single-float objects directly on
stack instead of heap.
To support the above claim we did the following simple experiment. We
ran the two functions \verb=tfo_apply/2= and \verb=tfo_combine/2=,
which are very floating-point intensive, using first integers and then
(double precision) floats as input data. The experiment was carried out
on machine~\sysfloat from Table~\ref{tbl:machine}. The results showed
that using integers instead of floats improved the Erlang BEAM
execution time of those two functions by a factor of 6. To be sure that
the improvement comes from better data representation and not from
better integer arithmetic provided by the hardware, a simple C program
was written to compare the speed of floating-point and integer
arithmetic. The results showed that the floating-point arithmetic was
as efficient as the integer arithmetic.
Since single precision floats can be represented in Erlang BEAM in the
same way as integers we can expect that giving up the double precision
floating point arithmetic would improve the pseudoknot execution time
to a large extent.
\subsection{Facile}
Facile is a high-level, higher-order programming language for systems
that require a combination of complex data manipulation and concurrent
and distributed computing. It combines Standard ML (SML), with a model
of higher-order concurrent processes based on CCS. Facile is
well-suited for running on loosely connected, physically distributed
systems with distributed memory. It is possible to execute Facile
programs on both local area networks (LANs) and wide area networks
(WANs).
The basic computational model of Facile consists of one or more nodes
or virtual processors, on each of which there are zero or more
processes (light weight threads). Processes execute by evaluating
expressions, and they can communicate values between each other by
synchronising over typed channels. All types of values, including user
defined types, channels, functions and process scripts, are
transmissible.
% Facile enriches Standard ML with primitives for distribution,
% concurrency and communication over typed channels. The additional data
% types provided in the language include node identifiers, process
% scripts and communication channels. All of these are first-class values
% that can be manipulated in an applicative style and, in particular, be
% communicated. New nodes and channels can be created dynamically and
% processes executing a given script can be spawned dynamically on a
% given node. Special care has been given to developing the formal
% foundations of the Facile model. The basic philosophy of the Facile
% project has been to develop a language comprising a basic set of
% primitive constructs that are intuitive for programming, have a clear
% semantic formalisation and have a quite efficient implementation. These
% primitives are powerful enough for defining a variety of abstractions
% commonly employed in concurrent and distributed computing.
The pseudoknot benchmark does not use the distributed capabilities of
the Facile system. Yet it is interesting to see that the cost of
starting up a sequential program in the Facile system is small (26
milli seconds on machine~\sysfacile (c.f. Table~\ref{tbl:machine}).
\begin{table}
\begin{center}
\begin{tabular}{|l|r@{\,}r@{\,}r|r@{\,}r@{\,}r|}
\hline
version & \mmm{unoptimised} &\mmm{optimised} \\
\hline
curried & 8.48 &+& 0.51 & 8.38 &+& 0.51 \\
uncurried & 8.03 &+& 0.50 & 7.66 &+& 0.49 \\
\hline
\end{tabular}
\end{center}
\caption{Four versions of the pseudoknot program executed with the
Facile Antigua system, showing execution times (user+system seconds).}
\label{tbl:facile}
\end{table}
We did several experiments with the Pseudoknot benchmark program. The
results are shown in Table~\ref{tbl:facile}. Firstly, the Standard ML
version of the pseudoknot program was run unoptimised and unmodified
(i.e.\ with curried function definitions). Secondly, the execution time
was found to be unaffected by the following compiler optimisations:
loop unrolling, scope localisation of free variables, hoisting,
contraction, common subexpression elimination, range analysis, and
inlining. Thirdly, uncurrying function definitions made the program
slightly faster, and fourthly, provided more scope for the compiler
optimisations. The Facile compiler is essentially based on the SML/NJ,
version 0.93. In Section~\ref{sec:SML} we describe experiments with a
more recent release of the SML/NJ compiler, which is about twice as
fast for the pseudoknot benchmark.
\subsection{FAST}
\label{sec:FAST}
The FAST compiler input language is a subset of Miranda; the port from
Miranda to this subset is trivial. Three changes have been made to
achieve best performance. Firstly, all components of the main algebraic data types
(\verb=pt=, \verb=tfo= and \verb=var=) in the program were annotated as
strict. Secondly, the basic definitions of floating point operations
such as \verb|fadd = (+)| and \verb|fsin = sin| were inlined. Finally,
one construct that the FAST compiler should be able to deal with, but
which it cannot at present handle has been changed.
Functions such as:
\begin{verbatim}
> atom_pos atom (Var i t n) = absolute_pos (Var i t n) (atom n)
\end{verbatim}
were replaced by:
\begin{verbatim}
> atom_pos atom v = absolute_pos v (atom (var2nuc v))
> var2nuc (Var i t n) = n
\end{verbatim}
\begin{table}
\begin{center}
\begin{tabular}{|l|r@{\,}r@{\,}r|l|r@{\,}r@{\,}r|l|}
\hline
strictness & \multicolumn{4}{c|}{late} & \multicolumn{4}{c|}{early} \\
annotations& \mmm{seconds} & Mbytes & \mmm{seconds} & Mbytes \\
\hline
without & 31.8 &+& 0.7 & 89.5 & 30.1 &+& 4.9 & 94.1 \\
with & 11.0 &+& 0.5 & 15.5 & 12.1 &+& 1.2 & 35.3 \\
\hline
\end{tabular}
\end{center}
\caption{Four versions of the pseudoknot program executed with the (lazy)
FAST system, showing execution times (user+system seconds) and heap
allocation (Mbytes).}
\label{tbl:fast}
\end{table}
To explore the possible advantage of laziness, a number of different
versions of the pseudoknot program have been compiled, executed and
analysed using the FAST system. Table~\ref{tbl:fast} shows the total
execution time (user+system seconds) and the total amount of heap
space (Mbytes) consumed by four different versions of the
program. Both the late and early formulations of the program were
tested with and without strictness annotations. The strictness
annotations declare the three primary data types (\verb=pt=,
\verb=tfo= and \verb=var=) of the program strict.
Strictness annotations have a profound effect on the performance of the
pseudoknot program. The user time of the late formulation is reduced by a
factor of 2.9 and the user time of the early formulation is reduced by a
factor of 2.5. These improvements are due to the reduction in heap
allocation.
A less profound effect is caused by the laziness of the program.
There are two differences between the early and late formulations of the
pseudoknot program: the early formulation does fewer computations but it
allocates more heap space.
Regardless of whether strictness annotations are used or not, the
number of floating point multiplications made by the early formulation is
19\% less and the number of floating point additions is 14\% less.
Without strictness annotations the early formulation allocates 5\% more heap
space than the late formulation. The combined effect of allocating a
little more space, but doing less work makes the early formulation 9\%
faster than the late formulation.
With strictness annotations the early formulation allocates 44\% more space
than the late formulation. This time the reduction in heap space more than
cancels the savings in floating point operations. Now the late formulation
is 10\% faster than the early formulation.
\subsection{Gambit}
Gambit is an R4RS Scheme conformant optimizing compiler which supports
the whole numeric tower (complex, real, rational, and integer).
Measurements could not be done on the SPARC because currently the only
fully complete code generator is for M680x0 processors. Since the
original program had been developed with Gambit the source code was
not modified in any way from the distributed source code (\cite{Fee94}
gives the details of the development process and an analysis of the
program when compiled with Gambit).
Both the late and early formulations were tested.
Table~\ref{tbl:gambit} shows the total execution time for the program
when compiled by Gambit 2.2 and gcc 2.0 on
machine~\sysgambit~(c.f. Table~\ref{tbl:machine}). On this benchmark,
the code generated by Gambit executes at about 37\% the speed of C.
This slowdown is due in a large part to the fact that Gambit boxes all
floating point numbers. To better evaluate this cost, the two most
numerically intensive functions (\verb=tfo_combine= and
\verb=tfo_apply=) were hand coded in assembler to avoid the boxing of
intermediate floating point values; the input and result of the
functions were still boxed. The execution time of the late formulation
went down by a substantial 34\% to 11.4 seconds, that is 55.6\% the
speed of C. Since the two functions are relatively small and have a
trivial data flow and control flow, a simple ``local'' type analysis
would probably be sufficient for the compiler to generate much better
code for this benchmark.
\begin{table}
\begin{center}
\begin{tabular}{|l|l|l|}
\hline
Compiler & late & early \\
\hline
Gambit 2.2 & 17.30+0.22 & 25.90+0.35 \\
gcc 2.0 & 6.34+0.12 & 9.75+0.15 \\
\hline
\end{tabular}
\end{center}
\caption{Execution times (user+system seconds) for both formulations
of the pseudoknot program when compiled with Gambit and gcc 2.0.}
\label{tbl:gambit}
\end{table}
For both Gambit and gcc, the early formulation is roughly 50\% slower than
the late formulation. Thus, for the pseudoknot data set, the cost of
lazy computation in the early formulation is larger than the saving
in computation it provides. It is conceivable however that the early
formulation can surpass the late formulation if the program is run with a
data set which generates few solutions compared to the number of
shapes searched. In such a case, a small proportion of the atom
positions will have to be computed.
\subsection{Glasgow Haskell}
\label{sec:ghc}
The Glasgow compiler is a highly-optimising Haskell compiler, with
elaborate performance monitoring tools. The pseudoknot program was
subjected to much optimisation on the basis of results from these
tools.
The late formulation of the program (identical to the Chalmers Haskell
program described in Section~\ref{sec:hbc}) ran in 10.2 seconds on
machine~\sysglasgow~(c.f. Table~\ref{tbl:machine}). Based on the raw
time profiling information from this program
(Table~\ref{tbl:glasgow-profile}-a), it is obvious that a few
functions account for a significant percentage of the time used, and
over 80\% of the total space usage. Three of the top four functions by
time (\verb=tfo_combine=, \verb=tfo_apply= and
\verb=tfo_align=) manipulate \verb=TFO=s and \verb=Pt=s, and the
remainder are heavy users of the \verb=Var= structure. Since these
functions can be safely made strict, they are prime candidates for the
use of unboxed types \cite{Pey91b}; types whose values can be
represented literally rather than by pointers to nodes in the heap.
Unboxing is desirable when possible both because execution is faster
and because a program's space requirements can generally be reduced.
\begin{table}
\begin{center}
\begin{tabular}{lr}
\begin{tabular}{|l|l|l|}
\hline
{\bf Cost Centre} & \%time & \%alloc \\
\hline
tfo\_combine & 18.0 & 4.7 \\
tfo\_apply & 15.9 & 0.0 \\
p\_o3' & 8.3 & 25.5 \\
tfo\_align & 6.2 & 1.9 \\
dgf\_base & 5.9 & 21.6 \\
get\_var & 5.9 & 0.0 \\
absolute\_pos & 4.7 & 24.1 \\
... & ... & ... \\
\hline
\multicolumn{3}{l}{(a) Original profile (by time)}
\end{tabular}
&
\begin{tabular}{|l|l|l|}
\hline
{\bf Cost Centre} & \%time & \%alloc \\
\hline
get\_var & 11.1 & 0.0 \\
tfo\_combine & 10.5 & 26.5 \\
p\_o3' & 8.5 & 13.0 \\
pseudoknot\_constraint & 7.8 & 9.9 \\
search & 7.8 & 2.6 \\
tfo\_align & 5.2 & 5.6 \\
pt\_phi & 5.2 & 0.0 \\
tfo\_apply & 5.2 & 0.0 \\
... & ... & ... \\
\hline
\multicolumn{3}{l}{(c) Maximum map (by time)}
\end{tabular}
\\
\begin{tabular}{|l|l|l|}
\hline
{\bf Cost Centre} & \%time & \%alloc \\
\hline
tfo\_apply & 11.1 & 0.0 \\
tfo\_combine & 10.5 & 23.2 \\
search & 9.9 & 2.3 \\
pseudoknot\_constraint & 8.2 & 8.7 \\
get\_var & 7.0 & 0.0 \\
var\_most\_distant& 6.4 & 8.7 \\
tfo\_align & 5.8 & 4.9 \\
... & ... & ... \\
\hline
\multicolumn{3}{l}{(b) Strict types (by time)}
\end{tabular}
&
\begin{tabular}{|l|l|l|}
\hline
{\bf Cost Centre} & \%time & \%alloc \\
\hline
tfo\_combine & 9.7 & 26.5 \\
p\_o3' & 7.7 & 13.0 \\
pseudoknot\_constraint & 4.6 & 9.9 \\
mk\_var & 2.0 & 6.6 \\
tfo\_align & 10.2 & 5.6 \\
tfo\_inv\_ortho & 2.6 & 5.6 \\
... & ... & ... \\
\hline
\multicolumn{3}{l}{(d) Maximum map (by allocation)}
\end{tabular}
\end{tabular}
\end{center}
\caption{Time and allocation profile by function, as a percentage of
total time/heap allocations.}
\label{tbl:glasgow-profile}
\end{table}
Unboxing these data structures semi-automatically, and strictifying the
lazy pattern match in the definition of \verb=var_most_distant_atom=
gives a significant performance improvement, of roughly a factor of 3.
This is similar to the improvements which are possible by simply
annotating the relevant data structures to make them strict. However,
further unboxing optimisations are possible if the three uses of the function composition
\verb=maximum . map= are replaced by a new function \verb=maximum_map=
as shown below. This function maps a function whose result is a
floating-point number over a list of arguments, and selects the maximum
result. It is not possible to map a function directly over a list of
unboxed values using the normal Prelude \verb=map= function, because
unboxed values cannot be passed to polymorphic functions.
\begin{verbatim}
> maximum_map :: (a->Float#) -> [a]->Float#
> maximum_map f (h:t) =
> max f t (f h)
> where max f (x:xs) m = max f xs (let fx = f x in
> if fx `gtFloat#` m then fx else m)
> max f [] m = m
> max :: (a->Float#) -> [a] -> Float# -> Float#
\end{verbatim}
This optimisation is suggested indirectly by the time profile
(Table~\ref{tbl:glasgow-profile}-b) which shows that the top function
by time is \verb=tfo_apply=. This is called through \verb=absolute_pos=
within \verb=most_distant_atom=. It seems likely that the three nested
calls to produce the maximum value of a function over a list of values
can be merged to hold the current maximum in a register (an extreme
form of deforestation \cite{Wad90b}). This optimisation reduces the
total execution time to 1.8 seconds user time.
The final time profile (Table~\ref{tbl:glasgow-profile}-c) shows
\verb=get_var= and \verb=p_o3'= jointly using 20\% of the Haskell
execution time with \verb=tfo_combine=, \verb=tfo_align= and
\verb=tfo_apply= accounting for a further 20\%. (The minor differences
in percentage time for \verb=tfo_combine= in
Tables~\ref{tbl:glasgow-profile}-c and \ref{tbl:glasgow-profile}-d are
probably explained by sampling error over such a short run.) While the
first two functions could be optimised to use a non-list data
structure, it is not easy to optimise the latter functions any further.
Since the total execution time is now close to that for C, with a large
fraction of the total remaining time being spent in the Unix
mathematical library, and since the allocation profile
(Table~\ref{tbl:glasgow-profile}-d) also suggests that there are no
space gains which can be obtained easily, it was decided not to attempt
further optimisations. The Glasgow Haskell overall time and space results are
summarised in Table~\ref{tbl:glasgow-haskell-times}. The heap usage is
the total number of bytes allocated in each case, with the maximum live
data residency after a major garbage collection shown in parentheses.
As with Erlang and other C-based compilers, much of the compilation
time is spent in the C compiler. A native code generator is
available, and is faster, but this does not allow all the
optimisations which were attempted.
It is planned to incorporate an automatic generalised version of the
deforesting optimisation, the foldr/build transformation~\cite{Gil94},
into the Glasgow Haskell compiler in due course.
\begin{table}
\begin{center}
\begin{tabular}{|l|r@{\,}r@{\,}r|r|l|r@{\,}r@{\,}r|r|l|}
\hline
Version & \multicolumn{5}{c|}{late} & \multicolumn{5}{c|}{early} \\
& \mmm{seconds} & Mbytes& (Residency) & \mmm{seconds} & Mbytes & (Residency) \\
\hline
Original & 10.0 &+& 0.2 & 36.8 & (55K) & 11.0 &+& 0.3 & 57.6 & (419K) \\
Strict Types & 3.5 &+& 0.3 & 10.1 & (53K) & 3.4 &+& 0.2 & 30.6 & (215K) \\
Maximum Map & 1.8 &+& 0.1 & 7.6 & (46K) & 2.9 &+& 0.1 & 29.3 & (215K) \\
\hline
\end{tabular}
\end{center}
\caption{Time and Heap Usage of three pseudoknot variants compiled for
machine~\sysglasgow~by the Glasgow Haskell compiler.}
\label{tbl:glasgow-haskell-times}
\end{table}
\subsection{Gofer}
The Gofer version of the program was derived very quickly from the
Miranda version supplied. The only significant changes were needed to
accommodate the different notations for infix operators and comments.
The Gofer primitive \verb=atan2= function was used in place of the
\verb=math_atan2= function in the Miranda code. Using \verb=math_atan2=
instead of the primitive \verb=atan2= adds only 4\% to the total
execution time.
Gofer does not use any significant optimizations, nor does it provide
a mechanism for annotating data type or function definitions to
indicate, for example, strictness properties. No attempts were made to
try to optimize the program to run faster or more efficiently in
Gofer.
\subsection{ID}
ID is an eager, non-strict, mostly functional, implicitly parallel
language. Only the purely functional subset of ID was used in this
benchmark. The compiler used in this study was developed jointly
between MIT and Berkeley. This first part of the compiler generates
dataflow graphs from ID source code. The Berkeley back end compiles
these dataflow graphs into Tl0 which is an abstract multi-threaded
machine~\cite{Gol94}. Then Tl0 is compiled into C. All of the other ID
compilers are targeted to dataflow machines. The Berkeley back end
creates separate files for each ID function, which contributes to
the long compile time for ID.
The port of the pseudoknot benchmark from Haskell to ID was trivial.
To improve the performance four changes were made. A number of the
lists were changed to arrays; the recursive functions \verb=get_var=
and \verb=search= were changed to be iterative using ID while loops;
the local function \verb=generate= within \verb=p_03'= was replaced by
an ID array comprehension; deforestation was applied (by hand) to the
function composition \verb=maximum . map=, as with the Glasgow
compiler (Section~\ref{sec:ghc}). These changes cut the execution
time roughly in half. They also reduced the space required for
execution. There is associated overhead with compiling for parallel
machines that we pay even on sequential machines. We compiled with
--O2 and linked all of the executables using --static to improve
performance.
\subsection{MLWorks}
The initial SML version of the pseudoknot program is not quite
Standard ML. It contains use of a pervasive list length function, and
an overloaded print function, neither of which are part of the
pervasive environment provided by the language. These problems were
corrected.
A list length function was provided in the obvious manner (i.e.\ tail
recursive accumulating). The uses of the overloaded print function
(two of) were replaced by calls to the relevant functions (output for
the string and an MLWorks real printing function for the real).
An initial profile of the run of this program suggests that slightly
over 50\% of the run time is taken up in three functions,
\verb=tfo_combine=, \verb=tfo_align= and \verb=tfo_apply=. These do
little more than floating point arithmetic and trigonometric functions,
which means that half the time, little more than the floating point
capability of the implementations is being tested. Some of that is
functionality provided by a runtime library implemented through
\verb=libc=.
ML is a strict language. Thus if a function is defined but only used in
one half of a conditional, a closure must still be built for that
function before evaluating the conditional. Building a closure, for
MLWorks, involves allocating heap and copying in values. Moving the
function definition within the conditional avoids unnecessary closure
building (which would be avoided anyway by a lazy implementation).
Small savings were achieved by modifying the function
\verb=pseudoknot_constraint= to build \verb=dist= only when required.
The function \verb=try_assignments= has been modified to be tail
recursive.
\subsection{NHC}
\label{sec:NHC}
The name ``NHC'' comes from the description ``Nearly a Haskell
Compiler''. Speed is not the main goal of NHC, instead NHC is written
so that it can recompile itself in less than 3 Mbytes of memory.
Floating point performance is completely ignored, and all arithmetic
operations are treated as user defined functions.
Two versions of the NHC compiler have been used. The first is NHC
compiled by HBC, denoted as NHC(HBC). This is a relatively fast, but
space hungry version of the compiler. The second version is NHC
compiled by itself, denoted as NHC(NHC). This version is slower, but
more economic in space.
The version of the pseudoknot program as it has been used with the HBC
compiler has been changed for NHC in two ways.
Firstly, all six separate modules were combined into one module. This
was done to show that NHC indeed compiles large programs using less
memory than the other Haskell compilers. NHC(NHC) needs an 8 Mbytes heap
to compile the single module version of the pseudoknot program, HBC
could not compile the single module version of the pseudoknot program
with 80 Mbytes of heap space, not even when using a single space garbage
collector. NHC(HBC) needs 30 Mbytes of heap to compile the single module
version of the pseudoknot program.
Secondly, all floating point types have been changed from \verb=Float=
to \verb=Double=. This does not increase the precision as \verb=Double=
is implemented as single precision floating point in NHC, but it is
necessary for the type checker.
The early formulation of the pseudoknot program is slightly faster
than the late formulation, 170 seconds compared to 175 seconds (both
on machine~\syschalmers~c.f. Table~\ref{tbl:machine}). A possible
explanation is that the early formulation does fewer arithmetic
computations than the late formulation, and that NHC is bad at
arithmetic.
\subsection{OPAL}
\label{sec:OPAL}
OPAL is based on language constructs similar to those found in most
other functional languages. The port of the pseudoknot program was a
matter of applying regular expressions to the SML and Haskell versions
of the program. Some specialities of OPAL
had to be tackled: (1) OPAL provides only strings and booleans as
builtin data types, such that constant data has to be denoted by
applying conversion functions to strings, and (2) OPAL requires an
explicit type signature for each global function. Furthermore, a static
limit on the number of components of a constructor made it necessary to
split the common atoms of the data type \verb=nuc= into two subrecords.
Only the late formulation of the pseudoknot program has been ported. No
optimizing annotations have been added to the source. OPAL is strict by
default, and the compiler performs inlining of function definitions
across module boundaries automatically. Hence, in the target code, all
floating point operations are inlined on the instruction level.
To measure the costs incurred in allocating floating point numbers in
the heap, a new single precision floating point data type has been
added to the system. These single precision numbers are represented as
``unboxed'' values, which thus do not incur garbage collection costs.
The first two rows of Table~\ref{tbl:opal} show the difference, as
measured on machine~\sysopal~(c.f. Table~\ref{tbl:machine}).
Arithmetic is still performed
with double precision, but the results of the operations are stored
with single precision. The timings suggest, that dynamically allocating
floating point numbers slows down the pseudoknot program by about a
factor of $2$. The amount of heap space allocated does not differ much
in both versions; this is a result of the reference counting based
garbage collection scheme of the OPAL system, which minimizes the total
amount of allocated memory at the expense of some execution speed.
The single precision floating point data type has been further refined
by encapsulating values which do not require dynamic type information.
These special values are not visible on the user level. However, after
automatic unfolding and simplifications performed by the compiler,
intermediate results of floating point operations are now passed
literally to subsequent operations in the resulting code. This is
similar to the way unboxed values are handled by the Glasgow Haskell
compiler~\cite{Pey91b}. Unboxing intermediate results improves the
execution time again by a factor of $2$. The results are shown in the
last row of Table~\ref{tbl:opal}.
The current distributed version 2.1b of the OPAL compilation system
failed to properly compile the benchmark. The reason was that this
compiler version packs all initialization code of a module into a
single C function (C is the target language), which is -- with several
thousand lines of code for the nucleotide conformation data base -- too
large for the C compiler. The compilation scheme has been slightly
modified to fix this design bug. The new scheme will be available in
the next release 2.1c. As with other C-based compilers, more than
50\% of the compilation time for the pseudoknot program is used by the
GNU C compiler. The compilation is particularly slowed down by the
need to check and compile several thousand applications of overloaded
conversion functions from strings to numbers. This overhead is a
result of the fact that numbers cannot be denoted literally in the
OPAL language.
\begin{table}
\begin{center}
\begin{tabular}{|l|r@{\,}r@{\,}r|r|}
\hline
version & \mmm{time(s)} & space(M) \\
\hline
double precision (boxed) & 19.9 &+& 0.26 & 1.10 \\
single precision (unboxed) & 9.7 &+& 0.23 & 0.85 \\
no intermediate type tags & 5.1 &+& 0.14 & 0.85 \\
\hline
\end{tabular}
\end{center}
\caption{Three versions of the pseudoknot program executed with the
OPAL system on machine~\sysopal.}
\label{tbl:opal}
\end{table}
\subsection{RUFL}
RUFL is (almost) a subset of Haskell. It excludes type classes, and
type signatures are mandatory for all top-level functions. It has an
abridged module mechanism.
The source was ported from the modularized Chalmers Haskell version.
The bulk of the effort went into providing the type signatures and
module interface files. The main algebraic data types were renamed to
avoid ambiguity between similarly named types and constructors. One
function (\verb=make_relative_nuc=) was too complex for the compiler
and had to be split into two parts, one of the library primitives was
named differently, and a top-level \verb=main= function which does the
output replaced the Haskell construct. No annotations were made. (The
RUFL system does not support annotating components of data structures
as strict.)
The compiler generates SPARC assembly code or SECD-like intermediate
code which is interpreted. Timings for the interpreted version are
denoted here as RUFLI. In both cases run-time memory organization is
rather simplistically based around fixed-size cells, and algebraic data
types are instantiated as chains of these cells. The heavy use of
high-arity constructors in this example generates long chains and
incurs considerable run-time penalties.
\subsection{Sisal}
Collaborators at Lawrence Livermore National Laboratory and Colorado
State University developed the Sisal parallel functional language with
the objective of obtaining high performance on commercial scalar and
vector multiprocessors. Currently, mature Sisal compilers exist for a
variety of shared memory scalar and vector multiprocessors. The two
most important optimizations of the Sisal compiler are memory
preallocation~\cite{Ran87} and copy elimination~\cite{Gop89}. These
optimizations enable most Sisal programs to execute in place without
copying; thus Sisal programs typically run as fast as equivalent
imperative programs.
We ported the C version of the pseudoknot program to Sisal and made
the following minor changes to increase performance. First, we
converted the nucleotide database to an array-based structure instead
of the record-based structure as in the C code. Sisal supports
records, but the compiler is heavily optimized for arrays and array
operations. Then, we rewrote the distance and transformation
functions as scalar functions to eliminate array allocations and
deallocations in the inner loops. Although array allocations and
deallocations in Sisal are efficient (cost is constant for arrays of
known size), it is difficult to recuperate the cost for small arrays.
Finally, we developed a recursive program that single threads the
\verb=partial_inst= stack and list of solutions. To accomplish this, we
changed the code of the function \verb=try= as follows:
\noindent
\begin{tabular}{l@{\hspace{10mm}}l}
C & Sisal revised \\
\hline
& \verb|> let| \\
add new element to stack & \verb|> stack1 := array_addh(stack)| \\
increment stack counter & \verb|> stack2 := pseudoknot_domains(stack1)| \\
pseudoknot\_domains & \verb|> in | \\
decrement stack counter & \verb|> array_remh(stack2)| \\
& \verb|> end let|
\end{tabular}
In principle, this code is identical to the C code. The Sisal compiler
realizes that there is only a single consumer of each stack. It tags
the data structure as mutable and generates code to perform all updates
in place. Consequently, the Sisal code maintains a single stack
structure similar to the C code, eliminating excessive memory usage and
copy operations. The Sisal code runs in approximately 85KB of memory
and achieves execution speeds comparable to the C code.
\subsection{SML/NJ and Poly/ML}
\label{sec:SML}
Both Standard ML of New Jersey and Poly/ML are implementations of
Standard ML as described in the Definition of Standard
ML~\cite{Mil90}. The SML version of the pseudoknot application was
initially derived from the Scheme version of the program (by Marc
Feeley) and then subsequently modified to make better use of SML's
typing facilities. These modifications had no effect on the execution
time of the program, but improved its readability and robustness, and
this was helpful for later experimentation with the code.
For SML/NJ, a signature was used to constrain the program (which was
written as one large SML module) to export only the top-level
\verb|run| function. This had the effect of improving the compilation
time as well as providing the compiler with more opportunities for
automatic inlining and other optimizations. Other than that, no other
changes were made to the program (though various things were tried, as
explained below).
Initially, the program was compiled with versions 0.75 and 0.93 of the
SML/NJ system. It was expected that the latest version (1.03z) would
show a great improvement, since it employs a representation
analysis~\cite{Ler92}, implemented by Zhong Shao (described in his
Ph.D. thesis~\cite{Sha94}), which would ``unbox'' many of the
floating-point numbers stored in the rather large tuple data
structures. Surprisingly, the first attempt with this version of the
compiler resulted in a slowdown. This was due to the fact that the
implementation of the representation analysis was not finished. In
particular, unboxing of records of floating-point numbers was not yet
supported, nor were floats being passed to functions in floating-point
registers. When these features were completed (by Shao), the
speed improved by more than a factor of two. Analysis of the compiler
output for the pseudoknot program also showed that excessive
register-spilling was occurring. A small improvement to the spilling
heuristic (again by Shao) led to a further speedup of the code. The
SML/NJ 1.03z compiler does not use FPU load and store instructions for
loading and storing floating-point numbers. It is strongly suspected
that changing this would lead to yet more large improvements to the
running time.
Like the other functional versions of the pseudoknot program (including
the Haskell and CLEAN versions), the SML version is ``pure'' in the
sense that all of the data structures are immutable. (The SML program
is also evaluated strictly; note, however, that the Haskell and CLEAN
versions also use mostly strict evaluation.) Significant improvements
might be possible if mutable arrays were used, in a manner similar to
the C version of the program. However, one suspects that such a
program would be less reliable and understandable than the current
version.
A large number of variations on the SML program were tried, in an
attempt to find better performance. The execution time was
surprisingly stable under these changes, and in fact no change made
any significant difference (good or bad) to the execution
speed. Almost all of the functions in the program are presently
curried; uncurrying made a barely measurable improvement. All of the
various ``deforestation'' transformations used by the other
implementations were also tried, and again none made any measurable
improvement (though several led to minor slowdowns). It was recognized
early on that the representation of the ``tfo'' matrix as a large
tuple of values leads to a considerable amount of register spilling in
the code generated by SML/NJ\@. However, changing the representation
to an array of floating-point numbers made no improvement. The current
version of the program also uses the late formulation for position
computation of atoms; changing this to the early formulation greatly
increased both the time and space required. In the end, the original
transcription of the Scheme program, with the signature constraint,
was used for all of the measurements.
The SML/NJ implementation of the pseudoknot program performs better on the DECstation 5000 than
on the SPARC. On the DECstation 5000 it runs at 55\% of the speed of C,
on the SPARC it runs at only 36\% of the speed of C. We suspect that
this is mainly due to memory effects. Previous studies~\cite{Diw94}
have shown that the intensive heap allocation which is characteristic
of the SML/NJ interacts badly with memory subsystems that use a
write-no-allocate cache policy, as is the case of the SPARC; in
contrast, the use of a write-allocate policy coupled with what amounts
to sub-block placement on the DECstation (the cache block size is four
bytes) supports such intensive heap allocation extremely well.
\subsection{Stoffel}
The Stoffel language is an intermediate language designed for studying
code generation of high level languages to fine-grain parallel
processors (see, for example, Fisher~\cite{Fis84}). Because of this, the
implementation is robust but lacks the sophistication of some of the
other compilers discussed here. The source code that has been used is
the same as that used for the FAST compiler (see Section~\ref{sec:FAST}),
but without the strictness annotations. The intermediate format
used by the FAST compiler has been translated automatically into
Stoffel by the FAST compiler front-end.
The Stoffel compiler is based on a spineless abstract machine and
implements many of the optimization commonly found, such as strictness
and boxing analysis. The implementation uses double precision floating
point numbers, which can also be passed around in unboxed form.
Most discouraging were the compilation times. The actual Stoffel
compiler takes about 216 seconds to run, the gcc-compiler backend with
--O2 takes more than two hours on machine~\sysstoffel~(c.f.
Table~\ref{tbl:machine}). Most of this time is spent in compiling the
function that initialises the floating point number representations.
Both the late and early formulations of the program were run {\it out
of the box}, there was no tweaking. The late formulation runs 30\%
faster than the early formulation, the difference being relatively small
because no strictness annotations have been used.
\subsection{Trafola}
The Trafola language and system has been designed as a (pure
functional) transformation language in a compiler construction
project. In addition to the usual functional constructs Trafola offers
nondeterministic pattern matching. This allows for an expressive coding
of algorithms, in particular tree transformations. The first version of
the system and the language were untyped. User defined data types and
Milner/Cardelli type inference were added later. At the moment there
is still only one Trafola compiler and runtime system. This compiler
is able to generate better code from typed programs than from untyped
programs. The runtime system has to support both, at the cost of some
efficiency.
\begin{table}
\begin{center}
\begin{tabular}{|l|r|r@{\,}r@{\,}r|r|}
\hline
floats & compilation (user time) & \mmm{execution} & garbage collection \\
\hline
Single & 6.16 & 162.6 &+& 2.4 & 4\% \\
Double & 6.35 & 186.5 &+& 5.7 & 11\% \\
\hline
\end{tabular}
\end{center}
\caption{Trafola results comparing single and double precision arithmetic.}
\label{tbl:trafola}
\end{table}
The pseudoknot program was transformed from Miranda to typed Trafola
basically by changing syntax. No special Trafola features were
introduced. Basic definitions such as
\verb|fadd = (+)| were inlined and taken from the C~library. The fact
that Trafola was first designed as an untyped language, together with
the provision of powerful pattern matching facilities, results in an
unusual implementation of algebraic data types. Each $n$-ary
constructor is represented as an $n+1$-tuple. The first component of
the tuple is an element of an enumeration type e.g. \verb=Foo a b c= is
implemented as \verb=(Foo,a,b,c)=.
The Trafola version of the pseudoknot program was compiled and
executed using both single and double precision arithmetic. The
results are shown in Table~\ref{tbl:trafola}. Using double precision
arithmetic increases compilation time because the internal
representation of double precision floats is less compact, and because
constant-folding also takes longer for doubles.
At runtime, double precision floats are stored in the heap, whereas
single precision values are stored in the stack. The double
precision version therefore allocates more space and increases both
execution and garbage collection times.
\subsection{Yale Haskell}
The Yale Haskell compiler uses Common Lisp as an intermediate language
and relies on the Common Lisp compiler for translation to machine
code. The version of Yale Haskell tested here uses CMU Common Lisp as
its back end, although it also runs under most other Common Lisp
implementations.
To prepare the pseudoknot program for running in Yale Haskell, all of
the data structures were annotated as strict. (Internally, Yale
Haskell represents the data structures as Common Lisp simple vectors
containing tagged objects.)
The first argument of \verb=get_var= and the arguments of
\verb=make_relative_nuc= have been forced to be strict because the case
where these arguments are not used is an error. The functions
\verb=is_A=, \verb=is_C=, and \verb=is_G= have been inlined because
they are small; this probably has no effect on timings. The functions
\verb=atom_pos= and \verb=search= have been inlined to avoid a
higher-order function call. These were the only changes made to the
Chalmers Haskell version of the program; the actual code was unchanged.
The code was compiled with all optimizations enabled at the Haskell
level. At the Lisp level, the same optimizations were used as for
obtaining the CMU Common Lisp results. The program was run with a heap
size of about 14 megabytes.
The late formulation of the pseudoknot program runs more than
twice as fast as the early formulation.
\subsection{CMU Common Lisp}
CMU Common Lisp is a public-domain implementation featuring a
high-quality optimizing compiler. This compiler does a higher
level of static type checking and type inference than most other Common
Lisp compilers.
In the Common Lisp version of the program, the \verb=pt= and \verb=tfo=
data types were implemented as vectors specialized to hold untagged
\verb=single-float= objects, rather than as general vectors of tagged
objects. Other data types were implemented using the \verb=defstruct=
facility. Type declarations were added to all of the floating-point
arithmetic operations and local variables that hold floating-point
numbers. Otherwise, the code was a straight translation of the program
from Scheme syntax. The resulting code is written in completely
standard Common Lisp and uses no annotations or extensions specific to
the CMU implementation.
The program was compiled with optimizations \verb=(speed 3)=
\verb=(safety 0)= \verb=(debug 0)= \verb=(compilation-speed 0)=. CMU
Lisp's block compilation feature was not used. The program was run with
a heap size of about 14 megabytes.
The difference between the Yale Haskell and Common Lisp results as shown in
Table~\ref{tbl:execution} is due partly to use of tagged versus
untagged arrays, and partly to the overhead of lazy lists in Haskell.
These were the only significant differences between the hand-written
Common Lisp code and the Lisp code produced by the Yale Haskell compiler. The
Haskell code generator could be extended to use untagged arrays for
homogeneous floating-point tuple types as well, but this has not yet
been implemented.
\section{Results}
\label{sec:results}
The pseudoknot program has always been compiled with option settings
that should give fast execution, we have consistently tried to
optimise for execution speed. The compile time and run time options
used are shown in Table~\ref{tbl:option}. To achieve best performance,
no debugging, run time checks or profiling code has been
generated. Where a ``--O'' option or higher optimisation setting could
be used to generate faster code, we have done so.
\begin{table}
\small
\begin{center}
\begin{tabular}{|r|l|l|l|l|l|l|}
\hline
no. &machine & mem. & cache & op. system & processor & C compiler \\
\hline
\sysbigloo &SUN 4/670 & 64 M& 64 K &SunOS 4.1.3. &standard &gcc 2.5.7 \\
\syscaml &DECstat., 5000/200 &32 M&64 K &Ultrix 4.1. &standard &cc --O2 \\
\syschalmers &SUN SPARC 10/30& 32 M&1 M &SunOS 4.1.3. &TMS390Z55 &gcc 2.5.8 \\
\sysclean &SUN 4/670 & 64 M&64 K &SunOS 4.1.2. &Cypress CY605&gcc 2.4.5 \\
\syscmucl &SUN 4/690MP & 64 M&64 K &SunOS 4.1.3 &ROSS 40MHz Super&cc \\
\syserlang &SUN 4/50 & 32 M&64 K &SunOS 4.1.3. &standard &gcc 2.5.8 \\
\sysfast &SUN 4/690 & 64 M&64 K &SunOS 4.1.2. &standard &gcc 2.5.8 \\
\sysgambit &HP9000/385 & 32 M&4K/4K &HP-UX B.09.00&68040 &gcc 2.0 \\
\sysglasgow &SUN SPARC 10/41& 96 M&20K/32K+1M&SunOS 4.1.3.&TMS390Z50 &gcc 2.5.7 \\
\sysid &SUN SPARC 10/41& 96 M& 1 M &SunOS 4.1.3. &standard &gcc 2.5.8 \\
\sysmlworks &SUN 4/330 & 96 M& &SunOS 4.1.1. &standard &gcc 2.5.4 \\
\sysopal &SUN 4/75 & 64 M&64 K &SunOS 4.1.3. &standard &gcc 2.5.8 \\
\syspolyml &SUN 4/690 & 64 M&1 M &SunOS 4.1.3. &standard &gcc 2.5.8 \\
\syssisals &SUN 4/670MP & 64 M&1 M &SunOS 4.1.3. &SUNW, system 600&gcc 2.5.8 \\
\syssisalc &CRAY C90 & 1024 M & none &UNICOS 7.C &custom vector&scc 4.0 \\
\syssml &DECstat. 5000/200& 128 M&64 K &Mach 2.6 &standard &gcc 2.4 \\
\sysstoffel &SUN SPARC 10/41&128 M&20K/32K+1M&Solaris 2.3 &standard &gcc 2.5.8 \\
\systrafola &SUN 4/670 & 64 M&64 K &SunOS 4.1.3. &Cypress CY605&gcc 2.4.5 \\
\sysfloat &SUN SPARC 5 & 32 M&24 K &SunOS 4.1.3. &standard (85 MHZ)&gcc 2.5.8 \\
\sysfacile &SUN Sparc 10/41& 64 M&1M &SunOS 4.1.3. &standard &gcc 2.5.7 \\
\hline
\end{tabular}
\end{center}
\normalsize
\caption{Details of the machines and C compilers used to compile and/or
execute the pseudoknot program. The make and type of the machine is
followed by the size of the memory (in MB) the size of the cache
(as a total or as instruction/data + secondary cache size),
the operating system name and version, and the type of processor
used. The last column gives the C compiler that has been used on the
machine.}
\label{tbl:machine}
\end{table}
We have tried to use one and the same machine where possible, but not
all programs could be compiled and/or executed on the same machine,
either because of commercial considerations, or because of the lack of
a SPARC code generator. An overview of the machines used may be found
in Table~\ref{tbl:machine}.
To factor out architectural differences, the C version of the
pseudoknot program has been timed on all the machines involved. The
execution time of the C version serves as the basic unit of time. The
unit ``pseudoknot'' gives the relative speed with respect to C
execution. It is computed as:
\[
\mbox{relative speed} ~ = ~
\frac{100 \times \mbox{C execution time}}{\mbox{absolute time}}
\]
To ``run at 100 knots'' thus means to run at exactly the same speed as
C, and a speed of 1 knot is 1\% of the speed of C.
\begin{table}
\begin{center}
\begin{tabular}{|r|l|r|}
\hline
machine & cache size & pseudoknots \\
\hline
\sysfast & 64KB & 73\% \\
\sysstoffel & 20K/32K+1M & 92\% \\
\syssisalc & 36K+1M & 100\% \\
\hline
\end{tabular}
\end{center}
\caption{The relative performance of the Sisal version as a function of
the cache size.}
\label{tbl:cache}
\end{table}
The functional language compilers generate code that requires more
memory than the code generated by the C compilers. Therefore, the
functional implementations require larger caches to accommodate the
same fraction of the code and data as the C implementation. This has a
measurable effect on the pseudoknots. For example, we found the
pseudoknots for the Sisal version of the benchmark program to increase
with the cache size. Table~\ref{tbl:cache} shows the speed of the Sisal
version of the pseudoknot program relative to that of C on three
machines with different cache sizes. We could have used a machine with
a cache larger than 64K for all experiments to make the results look better. We
have chosen not to do so because it is important to remember that using
large amounts of memory is an important cost factor.
The ``pseudoknot'' as defined above is a relative measure. In the
long run an absolute measure is also useful both as a measure of overall
system performance, and to demonstrate absolute improvements in functional
compiler technology independently of improvements in the C compilers
(c.f. Drhystone, Whetstone). This requires choosing
a particular architecture and C compiler as a reference to represent
the absolute 100\% mark. Most execution times have been obtained on
machine~\sysfast~(c.f. Table~\ref{tbl:machine}), using GCC version
2.5.8. This combination is therefore rather a convenient choice to
deliver the absolute 100\% pseudoknot. The C execution times for this
machine are 2.71 seconds user time and 0.20 seconds system time.
To measure the times required by the faster programs with a reasonable
degree of accuracy, the programs have been timed in a Unix C-shell loop
as follows \verb=time repeat 10 a.out=, or even \verb=time repeat 100
a.out=. The resulting system and user times divided by 10 (100) are
reported in the Tables~\ref{tbl:compilation} and~\ref{tbl:execution},
sorted by relative pseudoknot ratings on the user times.
\begin{table}
\begin{minipage}{\hsize}
\begin{center}
\begin{tabular}{|l|c|c|r@{\,}r@{\,}r|rr|r@{\,}r@{\,}r|r@{\,}r@{\,}r|r@{\,}r@{\,}r|}
\hline
compiler &route& mach. & \mmm{time(s)} & \mm{space} &\mmm{C-time(s)}&\multicolumn{6}{c|}{pseudoknots (\%)}\\
& & & user &+& sys & Mb &\footnote{A = Mbytes allocated space; H = Mbytes heap size; R = Mbytes maximum resident set size}
& user &+& sys & \mmm{rel} & \mmm{abs} \\
\hline
Gofer & I &\sysfast & 6.7 &+& 0.7 & 3\dz & A & 2.71 &+& 0.20 & 40.4 &+& 28.6 & 40.4 &+& 28.6 \\
RUFLI & I &\sysfast & 9.1 &+& 1.7 & 1\dz & A & 2.71 &+& 0.20 & 29.8 &+& 11.8 & 29.8 &+& 11.8 \\
Miranda & I &\sysfast & 12.5 &+& 0.8 & 13\dz & A & 2.71 &+& 0.20 & 21.7 &+& 25.0 & 21.7 &+& 25.0 \\
&&&&&&&&&&&&&&&&\\
Caml Light & I &\syscaml & 29.7 &+& 1.1 & 2.3 & A & 2.66 &+& 0.05 & 9.0 &+& 4.5 & 9.1 &+& 18.2 \\
Clean & N &\sysclean & 30\dz &+& 10\dz & 9\dz & A & 2.69 &+& 0.55 & 9.0 &+& 5.5 & 9.0 &+& 2.0 \\
Trafola & I &\sysfast & 31.4 &+& 11.5 & 6\dz & A & 2.71 &+& 0.10 & 8.6 &+& 1.7 & 8.6 &+& 1.7 \\
RUFL & N &\sysfast & 41.6 &+& 8\dz & 3\dz & A & 2.71 &+& 0.20 & 6.5 &+& 2.5 & 6.5 &+& 2.5 \\
Poly/ML & I &\syspolyml & 22.4 &+& 3.1 & 3.3 & A & 1.28 &+& 0.05 & 5.7 &+& 1.6 & 12.1 &+& 6.5 \\
Bigloo & C &\sysbigloo & 56.5 &+& 6.4 & 7.5 & A & 3.00 &+& 0.10 & 5.3 &+& 1.6 & 4.8 &+& 3.1 \\
Gambit & N &\sysgambit & 147\dz &+& 77\dz & 14\dz & H & 6.34 &+& 0.12 & 4.3 &+& 0.2 & 1.8 &+& 0.3 \\
CMU CL & N &\syscmucl & 118\dz &+& 25\dz & 14\dz & H & 4.73 &+& 0.43 & 4.0 &+& 1.7 & 2.3 &+& 0.8 \\
Camloo & S+C &\sysbigloo & 98\dz &+& 17.6 & 4.6 & A & 3.00 &+& 0.10 & 3.1 &+& 0.6 & 2.8 &+& 1.1 \\
Caml Gallium& N &\syscaml & 129\dz &+& 4.4 & 3.7 & A & 2.66 &+& 0.05 & 2.1 &+& 1.1 & 2.1 &+& 4.5 \\
&&&&&&&&&&&&&&&&\\
SML/NJ & N &\syssml & 131\dz &+& 4.4 & 50\dz & A & 1.90 &+& 0.10 & 1.5 &+& 2.3 & 2.1 &+& 4.5 \\
Facile & N &\sysfacile & 123\dz &+& 2.5 & 11.3 & A & 1.71 &+& 0.08 & 1.4 &+& 3.2 & 2.2 &+& 8.0 \\
NHC(HBC) & I &\syschalmers & 122\dz &+& 7.3 & 30\dz & A & 1.56 &+& 0.06 & 1.3 &+& 0.8 & 2.2 &+& 2.7 \\
Sisal (SUN) & N &\syssisals & 112\dz &+& 13.3 & 2.4 & A & 1.40 &+& 0.05 & 1.3 &+& 0.4 & 2.4 &+& 1.5 \\
MLWorks & N &\sysmlworks & 394\dz &+& 19\dz & 14.4 & R & 4.90 &+& 0.00 & 1.2 &+& 0.0 & 0.7 &+& 1.1 \\
&&&&&&&&&&&&&&&&\\
Sisal (CRAY)& N &\syssisalc & 82.5 &+& 15.5 & 24\dz & A & 0.76 &+& 0.02 & 0.9 &+& 0.2 & 3.3 &+& 1.3 \\
Yale & L &\syscmucl & 610\dz &+& 186\dz & 14\dz & H & 4.73 &+& 0.43 & 0.8 &+& 0.2 & 0.4 &+& 0.1 \\
Chalmers & N &\syschalmers & 181\dz &+& 45\dz & 50\dz & A & 1.33 &+& 0.06 & 0.7 &+& 0.1 & 1.5 &+& 0.4 \\
FAST & C &\sysfast & 450\dz &+& 40\dz & 100\dz & A & 2.71 &+& 0.20 & 0.6 &+& 0.5 & 0.6 &+& 0.5 \\
&&&&&&&&&&&&&&&&\\
NHC(NHC) & I &\syschalmers & 560\dz &+& 5.0 & 8.7 & A & 1.56 &+& 0.06 & 0.3 &+& 1.2 & 0.5 &+& 4.0 \\
Glasgow & C &\sysglasgow & 564\dz &+& 30\dz & 47\dz & A & 1.28 &+& 0.05 & 0.2 &+& 0.2 & 0.5 &+& 0.7 \\
Opal & C &\sysopal & 1301\dz &+& 19\dz & 15\dz & A & 3.00 &+& 0.08 & 0.2 &+& 0.4 & 0.2 &+& 1.1 \\
&&&&&&&&&&&&&&&&\\
Erlang BEAM & C &\syserlang & \mmm{$>$ 1 Hour} & 8\dz & A & 3.27 &+& 0.10 & \mmm{---} & \mmm{---} \\
CeML & C &\sysbigloo & \mmm{$>$ 1 Hour} & 35\dz & A & 3.00 &+& 0.10 & \mmm{---} & \mmm{---} \\
ID & C &\sysid & \mmm{$>$ 1 Hour} & 64\dz & A & 2.75 &+& 0.05 & \mmm{---} & \mmm{---} \\
Stoffel & C &\sysstoffel & \mmm{$>$ 2 Hours} & 25\dz & A & 1.28 &+& 0.05 & \mmm{---} & \mmm{---} \\
\hline
cc --O & N &\sysfast & 325\dz &+& 26\dz & 8\dz & A & 2.71 &+& 0.20 & 0.8 &+& 0.8 & 0.8 &+& 0.8 \\
gcc --O & N &\sysfast & 910\dz &+& 97\dz & 21\dz & A & 2.71 &+& 0.20 & 0.3 &+& 0.2 & 0.3 &+& 0.2 \\
\hline
\end{tabular}
\end{center}
\end{minipage}
\caption{Results giving the amount of time (user+system time in
seconds) and space (in Mbytes) required for compilation of the
pseudoknot program. The ``pseudoknots'' give the relative speed with
respect to the execution (not compilation) of the C version as a
percentage.}
\label{tbl:compilation}
\end{table}
\subsection{Compilation}
Table~\ref{tbl:compilation} shows the results of compiling the
programs. The first column of the table shows the name of the compiler
(c.f. Table~\ref{tbl:compiler}). The second column ``route'' indicates
whether the compiler produces native code (``N''), code for a special
interpreter (``I''), or compiles to native code through a portable C
back-end (``C''), Lisp (``L''), or Scheme (``S''), or through a
combination of these back-ends. The third column gives a reference to
the particular machine used for compilation (c.f.
Table~\ref{tbl:machine}). The next two columns give the time and space
required to compile the pseudoknot program. Unless noted otherwise,
the space is the largest amount of space required by the compiler, as
obtained from \verb=ps -v= under the heading SIZE. The column marked
``C-time'' gives the time required to execute the C version of the
pseudoknot program on the same machine as the one used for
compilation. The last two columns ``pseudoknots'' show the relative
and absolute performance with respect to C.
It is possible to distinguish broad groups within the compilers. The
fastest compilers are, unsurprisingly, those that compile to an
intermediate code for an interpreter, and which therefore perform few,
if any, optimisations. NHC is an outlier, perhaps because unlike the
other interpreters, it is a bootstrapping compiler. With the
exception of Bigloo, which is faster than many native compilers,
compilers that generate C are the slowest compilers. Not only does it
take extra time to produce and parse the C, but C compilers have
particular difficulty compiling code that contains large numbers of
constants. As noted in Section~\ref{sec:compilers}, C compilers also have
difficulty compiling the hand written C version of the pseudoknot
program due to this phenomenon. The faster compilers also generally
allocate less space. This may be because the slower compilers
generally apply more sophisticated (and therefore space-intensive)
optimisations.
\begin{table}
\begin{minipage}{\hsize}
\begin{center}
\begin{tabular}{|l|c|c|c|r@{\,}r@{\,}r|rr|r@{\,}r@{\,}r|r@{\,}r@{\,}r|}
\hline
compiler &route& mach. &\footnote{L = late formulation; E = early formulation}
& \mmm{time(s)} &\mm{space} &\multicolumn{6}{c|}{pseudoknots (\%)}\\
& & & & user &+& sys & Mb & \footnote{A = Mbytes allocated space; H = Mbytes heap size}
& \mmm{rel} & \mmm{abs} \\
\hline
Sisal (CRAY)& C &\syssisalc & L & 0.68 &+& 0.03 & 0.1 & A & 111\dz&+& 96.0 & 398\dz&+& 800\dz\\
&&&&&&&&&&&&&&\\
Sisal (SUN) & C &\sysfast & L & 3.7\z &+& 0.2\z & 0.7 & A & 73.2 &+& 100\dz& 73.2 &+& 100\dz\\
Caml Gallium& N &\syscaml & L & 3.8\z &+& 0.1\z & 0.3 & A & 70.7 &+& 62.5 & 72.1 &+& 250\dz\\
Glasgow & C &\sysfast & L & 3.9\z &+& 0.2\z & 1\dz & A & 69.5 &+& 100\dz& 69.5 &+& 100\dz\\
Opal & C &\sysfast & L & 4.7\z &+& 0.5\z & 0.8 & A & 57.4 &+& 40.0 & 57.4 &+& 40.0 \\
Clean & N &\sysfast & L & 5.1\z &+& 0.8\z & 2.5 & A & 52.7 &+& 25.6 & 52.7 &+& 25.6 \\
&&&&&&&&&&&&&&\\
CMU CL & N &\sysfast & L & 5.8\z &+& 3.3\z & 14\dz & H & 46.7 &+& 6.1 & 46.7 &+& 6.1 \\
Gambit & N &\sysgambit & L & 17.3\z &+& 0.2\z & 14\dz & H & 36.6 &+& 60.0 & 15.7 &+& 100\dz\\
SML/NJ & N &\sysfast & L & 7.6\z &+& 2.8\z & 2.6 & A & 35.9 &+& 7.0 & 35.9 &+& 7.0 \\
MLWorks & N &\sysmlworks& L & 13.7\z &+& 0.9\z & 2.6 & A & 35.8 &+& 0.0 & 19.8 &+& 22.2 \\
CeML & C &\sysfast & L & 8.7\z &+& 0.6\z & 2\dz & A & 31.1 &+& 33.3 & 31.1 &+& 33.3 \\
&&&&&&&&&&&&&&\\
FAST & C &\sysfast & L & 11.0\z &+& 0.5\z & 1\dz & A & 24.6 &+& 40.0 & 24.6 &+& 40.0 \\
Camloo & S+C &\sysfast & L & 11.5\z &+& 1.3\z & 4.9 & A & 23.6 &+& 15.4 & 23.6 &+& 15.4 \\
ID & C &\sysfast & L & 11.6\z &+& 2.9\z & 14\dz & A & 23.4 &+& 6.9 & 23.4 &+& 6.9 \\
Bigloo & C &\sysfast & L & 11.7\z &+& 1.1\z & 4.9 & A & 23.2 &+& 18.2 & 23.2 &+& 18.2 \\
Yale & L &\sysfast & L & 11.9\z &+& 7.2\z & 14\dz & H & 22.8 &+& 2.8 & 22.8 &+& 2.8 \\
Chalmers & N &\sysfast & L & 12.1\z &+& 1.0\z & 3\dz & A & 22.4 &+& 20.0 & 22.4 &+& 20.0 \\
Facile & N &\sysfast & L & 15.5\z &+& 4.3\z & 7.9 & A & 17.5 &+& 4.7 & 17.5 &+& 4.7 \\
Stoffel & C &\sysfast & L & 26.6\z &+& 2.1\z & 5.6 & A & 10.2 &+& 9.5 & 10.2 &+& 9.5 \\
&&&&&&&&&&&&&&\\
Erlang BEAM & C &\sysfast & L & 31.8\z &+& 4.5\z & 11\dz & A & 8.5 &+& 4.4 & 8.5 &+& 4.4 \\
Caml Light & I &\sysfast & L & 53.9\z &+& 7.5\z & 0.3 & A & 5.0 &+& 2.7 & 5.0 &+& 2.7 \\
RUFL & N &\sysfast & L & 87\dzz &+& 2.8\z & 3\dz & A & 3.1 &+& 7.1 & 3.1 &+& 7.1 \\
Poly/ML & I &\syspolyml & E & 54.7\z &+& 8.4\z & 3.3 & A & 2.3 &+& 0.6 & 5.0 &+& 2.4 \\
Trafola & I &\sysfast & L & 124\dzz &+& 6.3\z & 10.7 & A & 2.2 &+& 3.2 & 2.2 &+& 3.2 \\
NHC & I &\sysfast & E & 170\dzz &+& 2.2\z & 2.6 & A & 1.6 &+& 9.1 & 1.6 &+& 9.1 \\
&&&&&&&&&&&&&&\\
Gofer & I &\sysfast & L & 370\dzz &+& 12.0\z & 3\dz & A & 0.7 &+& 1.7 & 0.7 &+& 1.7 \\
RUFLI & I &\sysfast & L & 529\dzz &+& 13.0\z & 4\dz & A & 0.5 &+& 1.5 & 0.5 &+& 1.5 \\
Miranda & I &\sysfast & L &1156\dzz &+& 34.0\z & 13\dz & A & 0.2 &+& 0.6 & 0.2 &+& 0.6 \\
\hline
\end{tabular}
\end{center}
\end{minipage}
\caption{Results giving the amount of time (user+system time in
seconds) and space (in Mbytes) required for execution of the pseudoknot
program. The ``pseudoknots'' give the relative speed with respect to
the execution of the C version as a percentage.}
\label{tbl:execution}
\end{table}
\subsection{Execution}
All programs have been executed a number of times, with different heap
sizes to optimise for speed. The results reported in
Table~\ref{tbl:execution} correspond to the fastest execution. This
includes garbage collection time.
The first column of the table shows the name of the
compiler/interpreter (c.f. Table~\ref{tbl:compiler}). The second column
``route'' duplicates the ``route'' column from Table~\ref{tbl:compiler}.
The third column gives a reference to the
particular machine used (c.f. Table~\ref{tbl:machine}). The fourth
column specifies whether the late (``L'') or early (``E'') formulation of
the program gives the fastest execution. Columns 5 and 6 give the time
and space required to execute the pseudoknot program. Unless noted
otherwise, the space is the largest amount of space required by the
program, as obtained from \verb=ps -v= under the heading SIZE.
The execution time of the C version of the pseudoknot program for the
machine used may be found in Table~\ref{tbl:compilation}.
Please note that in most cases execution times and compilation times
were measured on different machines. The last two columns ``pseudoknots''
show the relative and absolute performance with respect to C.
Again, broad performance groupings can be distinguished.
The highest (relative) pseudoknots are measured for the Sisal compiler,
which is actually faster than the corresponding C for the Cray version.
The next best implementations are the Caml Gallium
compiler (eager) and the Glasgow Haskell compiler (lazy). It is
probably fair to say that the Glasgow Haskell version of the program
has been more heavily optimised than the CAML Gallium version.
As the Clean and
Glasgow Haskell compilers show, if the compiler can exploit strictness
at the right points, the presence of lazy evaluation need not be a
hindrance to high performance. It is also interesting that several
of these compilers compile through C rather than being native compilers.
Clearly, it is possible to compile efficient code without generating
assembler directly. Space usage for these compilers is also generally low:
the compilers have clearly optimised both for time and space.
The Lisp, Scheme and SML compilers generally yield very similar
performance (35-50 pseudoknots). An obvious outlier is the Bigloo
optimising Scheme compiler, whose performance is comparable to most of
the non-strict implementations. The two SML compilers (SML/NJ and
MLWorks) are particularly close in both time and heap usage.
Most of the non-strict compilers (Chalmers, FAST, Id, Stoffel, Yale
and Glasgow Haskell on less optimised code) are grouped in the 10--25
pseudoknot range. These compilers typically offer around 75\% of the
performance of eager implementations such as SML/NJ or Gambit, or 50\%
of the performance of CMU Common Lisp. Even so, this level of performance
has required the exploitation of strictness through unboxing and
similar optimisations. Without these features, on the basis of the
Glasgow results (Section~\ref{sec:ghc}), performance can be estimated
as approximately 8 pseudoknots or just under a quarter of the typical
performance of a compiler for an eager language. For these compilers
and this application, laziness therefore costs directly a factor of 3,
with a further 50\% probably attributable to the use of different
implementation techniques for predefined functions etc., which are
required due to the possibility of laziness.
The lowest pseudoknot values are found with the interpretive systems,
which are all less than 10 pseudoknots. This is not particularly
surprising. The interpreters (Caml Light, Poly/ML, NHC,
Trafola) which lie in the 1--10 pseudoknot range are, however,
significantly faster than their conventional brethren, which are
generally less than one pseudoknot. Interpreters for strict languages
(CAML Light, Poly/ML, Trafola) do seem on the whole to be faster
than interpreters for non-strict languages (NHC, Gofer, RUFLI,
Miranda).
There are two remaining curiosities: the Erlang and RUFL compilers.
These do not fit clearly in the same groups as other similar
compilers. The low performance of the RUFL compiler may reflect its
relative immaturity, as well as the fact that no special priority has
been placed on floating-point performance.
The low performance recorded by the Erlang BEAM compiler reflects the
fact that Erlang is a programming language primarily intended for
designing robust, concurrent real time systems and a low priority has
been placed on floating-point performance.
The set of CAML compilers offers an interesting spectrum: Caml Gallium
is a slow compiler which produces fast code; Caml Light compiles
quickly, but is relatively slow; and Camloo is intermediate between
the two.
As might be expected, there is, broadly speaking, a strong
relationship between compilation time and execution speed. Only the
Clean implementation offers both fast compilation and fast execution.
For the compiled systems there is also a very rough relationship
between execution speed and heap usage: faster implementations
use less heap. There does not, however, seem to be any
correlation between non-strictness and heap usage.
The fact that Sisal is first-order may be significant, though this is
hard to judge since the only other first-order implementation (Erlang)
yields relatively poor performance. Sisal is also the only
monomorphic language studied, so results here must also be
inconclusive. Of the languages studied, Sisal is the only
one that was specifically designed for ``numeric'' rather than
``symbolic'' computations, and clearly the design works well
for this application. Floating-point performance has traditionally
taken second-place in functional language implementations, so
we may hope that these results spur other compiler writers to
attempt to duplicate the Sisal results.
The provision of pattern-matching facilities does not appear to have
any effect on code performance: there are fast and slow
pattern-matching compilers. The fastest compilers all use strong type
systems, which are known to assist compilation. The dynamically typed
implementations are, however, intermediate in speed, and comparable
with the statically typed SML systems.
\section{Conclusions}
\label{sec:conclusions}
Over 20 compilers for lazy and strict functional languages have been
benchmarked using a single floating point intensive program. The C
version of the program spends 25\% of its time in the C library
trigonometric and square root routines. The problem at hand may thus be
typical only for geometric problems. The program should not be
construed as a typical numerical scientific application. However, the
program is useful as a benchmark, because the ``real'' work it does
(the trigonometric and floating point calculations) is so clearly
identifiable. Everything else the program does should be kept to a
minimum. Not all implementations of functional languages that have been
benchmarked are capable of realising this, but some are!
The compilation speed of most implementations (including the two C
compilers) could be improved significantly. Generating C as
intermediate code does not necessarily make the compiler slow, as
demonstrated by the performance of the Bigloo and Camloo compilers,
but generating fast C does often lead to high compilation times.
To achieve good performance from lazy implementations, it proved
necessary to annotate certain data structures that are often used as
strict. The performance of the pseudoknot program is not sensitive to
incorrectly placed strictness annotations. In general, lazy functional
programs are not so well behaved in this respect. Inserting annotations
is a fine art, as demonstrated by the efforts of the Glasgow team. To
make functional languages more useful than they are now, clearly more
effort should go into providing users with simple to use and effective
means of analysing and improving the performance of their programs.
Benchmarking a single program can lead to results which cannot easily
be generalised. Special care has been taken to make the comparison as
fair as possible: the pseudoknot program is not an essentially lazy
program or an essentially non-lazy program; the different
implementations use the same algorithm; most of the binaries were timed
on one and the same machine.
Storing floating point numbers as unboxed values has a significant
positive effect on the execution speed of the pseudoknot program. Most
compilers can deal adequately with unboxed single precision floating
point numbers. Few compilers can also implement unboxed double
precision numbers properly. There is thus a need for better methods of
dealing with double precision unboxed floats, although this problem
might become obsolete with the next generation of 64 bit machine
architectures.
There is no clear distinction between the runtime performance of eager
and lazy implementations when appropriate annotations are used: lazy
implementations have clearly come of age when it comes to implementing
largely strict applications, such as the pseudoknot program. The speed
of C can be approached by some implementations, but not without special
measures such as strictness annotations. On the Cray, the Sisal
version is faster than the C version.
\section*{Acknowledgements}
Mark Jones produced the Gofer version of the pseudoknot program.
Will Partain and Jim Mattson performed many of the experiments reported
here for the Glasgow Haskell compiler. The work at Glasgow is
supported by a SOED Research Fellowship from the Royal Society of
Edinburgh, and by the EPSRC AQUA grant.
The work at Nijmegen is supported by STW (Stichting voor de
Technische Wetenschappen, The Netherlands).
Jon-Dean Mountjoy performed some of the experiments for the RUFL
implementation.
The ID version of the pseudoknot program was the result of a group
effort. Credit goes to Jamey Hicks, R. Paul Johnson, Shail Aditya,
Yonald Chery and Andy Shaw.
Zhong Shao made several important changes to the SML/NJ
implementation, and Andrew Appel and David MacQueen provided general
support in the use of this system. David Tarditi performed several of
the SML/NJ experiments.
John T. Feo and Scott Denton of Lawrence Livermore National Laboratory
collaborated on the Sisal version of the pseudoknot program.
\bibliography{refs}
\bibliographystyle{plain}
\end{document}
|