1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995
|
/* Apophenia's narrative documentation
Copyright (c) 2005--2013 by Ben Klemens. Licensed under the modified GNU GPL v2; see COPYING and COPYING2. */
/** \mainpage Apophenia--the intro
Apophenia is an open statistical library for working with data sets and statistical
models. It provides functions on the same level as those of the typical stats package
(such as OLS, probit, or singular value decomposition) but gives the user more
flexibility to be creative in model-building. The core functions are written in C,
but experience has shown them to be easy to bind to in Python/Julia/Perl/Ruby/&c.
It is written to scale well, to comfortably work with gigabyte data sets, million-step simulations, or
computationally-intensive agent-based models. If you have tried using other open source
tools for computationally demanding work and found that those tools weren't up to the
task, then Apophenia is the library for you.
<h5>The goods</h5>
The library has been growing and improving since 2005, and has been downloaded over 10,000 times. To date, it has over two hundred functions to facilitate statistical computing, such as:
\li OLS and family, discrete choice models like probit and logit, kernel density estimators, and other common models
\li database querying and maintenance utilities
\li moments, percentiles, and other basic stats utilities
\li t-tests, F-tests, et cetera
\li Several optimization methods available for your own new models
\li It does <em>not</em> re-implement basic matrix operations or build yet another database
engine. Instead, it builds upon the excellent <a href="http://www.gnu.org/software/gsl/">GNU
Scientific</a> and <a href="http://www.sqlite.org/">SQLite</a> libraries. MySQL/mariaDB is also supported.
For the full list, click the <a href="globals.html">index</a> link from the header.
<h5><a href="https://github.com/b-k/apophenia/archive/pkg.zip">Download Apophenia here</a>.</h5>
Most users will just want to download the latest packaged version linked from the <a
href="https://github.com/b-k/apophenia/archive/pkg.zip">Download
Apophenia here</a> header.
Those who would like to work on a cutting-edge copy of the source code
can get the latest version by cutting and pasting the following onto
the command line. If you follow this route, be sure to read the development README in the
<tt>apophenia</tt> directory this command will create.
\code
git clone https://github.com/b-k/apophenia.git
\endcode
<!--git clone git://apophenia.git.sourceforge.net/gitroot/apophenia/apophenia
cvs -z3 -d:ext:<i>(your sourceforge login)</i>@cvs.sourceforge.net:/cvsroot/apophenia co -P apophenia
cvs -z3 -d:pserver:anonymous@cvs.sf.net:/cvsroot/apophenia checkout -P apophenia
svn co https://apophenia.svn.sourceforge.net/svnroot/apophenia/trunk/apophenia -->
<h5>The documentation</h5>
To start off, have a look at this \ref gentle "Gentle Introduction" to the library.
<a href="outline.html">The outline</a> gives a more detailed narrative.
The <a href="globals.html">index</a> lists every function in the
library, with detailed reference
information. Notice that the header to every page has a link to the outline and the index.
To really go in depth, download or pick up a copy of <a
href="http://modelingwithdata.org">Modeling with Data</a>, which discusses general
methods for doing statistics in C with the GSL and SQLite, as well as Apophenia
itself. <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf"><em>A Useful
Algebraic System of Statistical Models</em></a> (PDF) discusses some of the theoretical
structures underlying the library.
There is a <a href="https://github.com/b-k/apophenia/wiki">wiki</a> with some convenience
functions, tips, and so on.
<h5>Notable features</h5>
Much of what Apophenia does can be done in any typical statistics package. The \ref
apop_data element is much like an R data frame, for example, and there is nothing special
about being able to invert a matrix or take the product of two matrices with a single
function call (\ref apop_matrix_inverse and \ref apop_dot, respectively).
Even more advanced features like Loess smoothing (\ref apop_loess) and the Fisher Exact
Test (\ref apop_test_fisher_exact) are not especially Apophenia-specific. But here are
some things that are noteworthy.
\li The text file parser is flexible and effective. Such data files are typically
called `CSV files', meaning <em>comma-separated values</em>, but the delimiter can be
anything (or even some mix of things), and there is no requirement that text have
"special delimiters". Missing data can be specified by a simple blank or a marker
of your choosing (e.g., <tt>apop_opts.nan_string = "N/A";</tt>). Or there can be
no delimiters, as in the case of fixed-width files. If you are a heavy SQLite user,
Apophenia may be useful to you simply for its \ref apop_text_to_db function.
\li The maximum likelihood system combines a lot of different subsystems into one
form: it will do a few flavors of conjugate gradient search, Nelder-Mead Simplex,
Newton's Method, or Simulated Annealing. You pick the method by a setting attached to
your model. If you want to use a method that requires derivatives and you don't have
a closed-form derivative, the ML subsystem will estimate a numerical gradient for
you. If you would like to do EM-style maximization (all but the first parameter are
fixed, that parameter is optimized, then all but the second parameter are fixed, that
parameter is optimized, ..., looping through dimensions until the change in objective
across cycles is less than <tt>eps</tt>), just add a settings group specifying the
tolerance at which the cycle should stop: <tt>Apop_settings_add_group(your_model,
apop_mle, .dim_cycle_tolerance=eps)</tt>.
\li The Iterative Proportional Fitting algorithm, \ref apop_rake, is best-in-breed,
designed to handle large, sparse matrices.
\li As well as the \ref apop_data structure, Apophenia is built around a model object,
the \ref apop_model. This allows for consistent treatment of distributions, regressions,
simulations, machine learning models, and who knows what other sorts of models you can
dream up. By transforming and combining existing models, it is easy to build complex
models from simple sub-models.
\li For example, the \ref apop_update function does Bayesian updating on any two
well-formed models. If they are on the table of conjugates, that is correctly
handled, and if they are not, an appropriate variant of MCMC produces an empirical
distribution. The output is yet another model, from which you can make random draws,
or which you can use as a prior for another round of Bayesian updating. Outside of
Bayesian updating, the \ref apop_model_metropolis function is good for approximating
other complex models.
\li Of course, it's a C library, meaning that you can build applications using Apophenia
for the data-processing back-end of your program. For example, it is currently used
in production for certain aspects of processing for the U.S. Census Bureau's American
Community Survey.
<h5>Contribute!</h5>
\li Develop a new model object.
\li Contribute your favorite statistical routine.
\li Package Apophenia into an RPM, apt, portage, cygwin package.
\li Report bugs or suggest features.
\li Write bindings for your preferred language. For example, here are early versions of <a
href="http://modelingwithdata.org/arch/00000173.htm"> a Julia
wrapper</a> and <a href="https://r-forge.r-project.org/projects/rapophenia/">an R
wrapper</a> which you could expand upon.
If you're interested, <a href="mailto:fluffmail@f-m.fm">write to the maintainer</a> (Ben Klemens), or join the
<a href="https://github.com/b-k/apophenia">GitHub</a> project.
*/
/** \page eg Some examples
Here are a few pieces of sample code, gathered from elsewhere in the documentation, for testing your installation or to give you a sense of what code with Apophenia's tools looks like. If you'd like more context or explanation, please click through to the page from which the example was taken.
In the documentation for the \ref apop_ols model, a program to read in data and run a regression. You'll need to go to that page for the sample data and further discussion.
From the \ref setup page, an example of gathering data from two processes, saving the input to a database, then doing a later analysis:
\include draw_to_db.c
In the \ref outline section on map/apply, a new \f$t\f$-test on every row, with all operations acting on entire rows rather than individual data points:
\include t_test_by_rows.c
In the documentation for \ref apop_query_to_text, a program to list all the tables in an SQLite database.
\include ls_tables.c
A demonstration of fixing parameters to create a marginal distribution, via \ref apop_model_fix_params
\include fix_params.c
Several uses of the \ref apop_dot function
\include dot_products.c
*/
/** \page setup Setting up
\section cast The supporting cast
To use Apophenia, you will need to have a working C compiler, the GSL (v1.7 or higher) and SQLite installed.
\li Some readers are unfamiliar with modern package managers and common methods for setting up a C development environment; see
<a href="http://modelingwithdata.org/appendix_o.html">Appendix O</a> of <em> Modeling with Data</em> for an introduction.
\li Other pages on this site have a few more notes for \ref windows "Windows" users or \ref mingw users.
\li Install the basics using your package manager. E.g., try
\code
sudo apt-get install make gcc libgsl0-dev libsqlite3-dev
\endcode
or
\code
sudo yum install make gcc gsl-devel libsqlite3x-devel
\endcode
\li <a href="https://github.com/b-k/apophenia/archive/pkg.zip">Download Apophenia here</a>.
\li Once you have the library downloaded, compile it using
\code
tar xvzf apop*tgz && cd apophenia-0.999
./configure && make && sudo make install && make check
\endcode
If you decide not to keep the library on your system, run <tt>sudo make uninstall</tt>
from the source directory to remove it.
\li A \ref makefile will help immensely when you want to compile your program.
\subsection sample_program Sample programs
Here is a sample program so you can test your setup. There is another short,
complete program in the \ref apop_ols entry which runs a simple OLS regression on a
data file. Follow the instructions there to compile and run.
The sample program here is intended to show how one would integrate Apophenia into an existing program. For example, say that you are running a simulation of two different treatments, or say that two sensors are posting data at regular intervals. You need to gather the data in an organized form, and then ask questions of the resulting data set. Below, a thousand draws are made from the two processes and put into a database. Then, the data is pulled out, some simple statistics are compiled, and the data is written to a text file for inspection outside of the program. This program will compile cleanly with the sample \ref makefile.
\include draw_to_db.c
*/
/** \page windows The Windows page
\ref mingw users, see that page.
If you have a choice, <a href="http://www.cygwin.com">Cygwin</a> is strongly recommended. The setup program is
very self-explanatory. As a warning, it will probably take up >300MB on
your system. You should install at least the following programs:
\li autoconf/automake
\li binutils
\li gcc
\li gdb
\li gnuplot -- for plotting data
\li groff -- needed for the man program, below
\li gsl -- the engine that powers Apophenia
\li less -- to read text files
\li libtool -- needed for compiling programs
\li make
\li man -- for reading help files
\li more -- not as good as less but still good to have
If you are missing anything else, the program will probably tell you.
The following are not necessary but are good to have on hand as long as you are going to be using Unix and programming.
\li svn -- to partake in the versioning system
\li emacs -- steep learning curve, but people love it
\li ghostscript (for reading .ps/.pdf files)
\li openssh -- needed for cvs
\li perl, python, ruby -- these are other languages that you might also be interested in
\li tetex -- write up your documentation using the nicest-looking formatter around
\li X11 -- a windowing system
X-Window will give you a nicer environment in which to work. After you start Cygwin, just type <tt>startx</tt> to bring up a more usable, nice-looking terminal (and the ability to do a few thousand other things which are beyond the scope of this documentation). Once you have Cygwin installed and a good terminal running, you can follow along with the remainder of the discussion without modification.
Sqlite3 is difficult to build from scratch, but you can get a packaged version by pointing Cygwin's install program to the Cygwin Ports site: http://cygwinports.dotsrc.org/ .
Second, some older (but still pretty recent) versions of Cygwin have a search.h file which doesn't include the function lsearch(). If this is the case on your system, you will have to update your Cygwin installation.
Finally, windows compilers often spit out lines like:
\code
Info: resolving _gsl_rng_taus by linking to __imp__gsl_rng_taus (auto-import)
\endcode
These lines are indeed just information, and not errors. Feel free to ignore them.
[Thanks to Andrew Felton and Derrick Higgins for their Cygwin debugging efforts.]
*/
/** \page notroot Not root?
If you aren't root, then you will need to create a subdirectory in your home directory in which to install packages. The GSL and SQLite installations will go like this. The key is the <tt>--prefix</tt> addition to the <tt>./configure</tt> command.
\code
export MY_LIBS = src #choose a directory name to be created in your home directory.
tar xvzf pkg.tgz #change pkg.tgz to the appropriate name
cd package_dir #same here.
mkdir $HOME/$MY_LIBS
./configure --prefix $HOME/$MY_LIBS
make
make install #Now you don't have to be root.
echo "export LD_LIBRARY_PATH=$HOME/$MY_LIBS:\$LD_LIBRARY_PATH" >> ~/.bashrc
\endcode
*/
/** \page makefile Makefile
Instead of giving lengthy compiler commands at the command prompt, you can use a Makefile to do most of the work. How to:
\li Copy and paste the following into a file named \c makefile.
\li Change the first line to the name of your program (e.g., if you have written <tt>sample.c</tt>, then the first line will read <tt>PROGNAME=sample</tt>).
\li If your program has multiple <tt>.c</tt> files, add a corresponding <tt>.o</tt> to the currently blank <tt>objects</tt> variable, e.g. <tt>objects=sample2.o sample3.o</tt>
\li One you have a Makefile in the directory, simply type <tt>make</tt> at the command prompt to generate the executable.
\code
PROGNAME = your_program_name_here
objects =
CFLAGS = -g -Wall -O3
LDLIBS = -lapophenia -lgsl -lgslcblas -lsqlite3
$(PROGNAME): $(objects)
\endcode
\li If your system has \c pkg-config, then you can use it for a slightly more robust and readable makefile. Replace the above C and link flags with:
\code
CFLAGS = -g -Wall `pkg-config --cflags apophenia` -O3
LDLIBS = `pkg-config --libs apophenia`
\endcode
The \c pkg-config program will then fill in the appropriate directories and libraries. Pkg-config knows Apophenia depends on the GSL and database libraries, so you need only list the most-dependent library.
\li The -O3 flag is optional, asking the compiler to run its highest level of optimization (for speed).
\li GCC users may need the <tt>--std=gnu99</tt> or <tt>--std=gnu11</tt> flag to use post-1989 C standards.
\li Order matters in the linking list: the files a package depends on should be listed after the package. E.g., since sample.c depends on Apophenia, <tt>gcc sample.c -lapophenia</tt> will work, while <tt>gcc -lapophenia sample.c</tt> is likely to give you errors. Similarly, list <tt>-lapophenia</tt> before <tt>-lgsl</tt>, which comes before <tt>-lgslcblas</tt>.
*/
/** \page designated Designated initializers
Functions so marked in this documentation use standard C designated initializers and compound literals to allow you to omit, call by name, or change the order of inputs. The following examples are all equivalent.
The standard format:
\code
apop_text_to_db("infile.txt", "intable", 0, 1, NULL);
\endcode
Omitted arguments are left at their default vaules:
\code
apop_text_to_db("infile.txt", "intable");
\endcode
You can use the variable's name, if you forget its ordering:
\code
apop_text_to_db("infile.txt", "intable", .has_col_name=1, .has_row_name=0);
\endcode
If an un-named element follows a named element, then that value is given to the next variable in the standard ordering:
\code
apop_text_to_db("infile.txt", "intable", .has_col_name=1, NULL);
\endcode
\li There may be cases where you can not use this form (it relies on a macro, which
may not be available). You can always call the underlying function directly, by adding
\c _base to the name and giving all arguments:
\code
apop_text_to_db_base("infile.txt", "intable", 0, 1, NULL);
\endcode
\li If one of the optional elements is an RNG and you do not provide one, I use one
from \ref apop_rng_get_thread.
\li For exhaustive details on implementation of the above (should you wish to write
new functions that behave like this) see the \ref optionaldetails page.
*/
/** \defgroup global_vars The global variables */
/** \defgroup mle Maximum likelihood estimation */
/** \defgroup command_line "Command line programs" */
/** \page admin The admin page
Just a few page links:
\li <a href=todo.html>The to-do list</a>
\li <a href=bug.html>The known bug list</a>
\li <a href="modules.html">The documentation page list</a>
*/
/** \page outline An outline of the library
ALLBUTTON
Outlineheader preliminaries Getting started
If you are entirely new to Apophenia, \ref gentle "have a look at the Gentle Introduction here".
As well as the information in this outline, there is a separate page covering the details of
\ref setup "setting up a computing environment" and another page with \ref eg "some sample code" for your perusal.
For another concrete example of folding Apophenia into a project, have a look at this \ref sample_program "sample program".
Outlineheader c Some notes on C and Apophenia's use of C utilities.
Outlineheader learning Learning C
<a href="http://modelingwithdata.org">Modeling with Data</a> has a full tutorial for C, oriented at users of standard stats packages. More nuts-and-bolts tutorials are <a href="http://www.google.com/search?hl=es&c2coff=1&q=c+tutorial">in abundance</a>. Some people find pointers to be especially difficult; fortunately, there's a <a href="http://www.youtube.com/watch?v=6pmWojisM_E">claymation cartoon</a> which clarifies everything.
Coding often relies on gathering together many libraries; there is a section at the bottom of this outline linking to references for some libraries upon which Apophenia builds.
endofdiv
Outlineheader usagenotes Usage notes
Here are some notes about the technical details of using the Apophenia library in your development environment.
<b> Header aggregation </b>
There is only one header. Put
\code
#include <apop.h>
\endcode
at the top of your file, and you're done. Everything declared in that file starts with \c apop_ (or \c Apop_).
<b>Linking</b>
You will need to link to the Apophenia library, which involves adding the <tt>-lapophenia</tt> flag to your compiler. Apophenia depends on SQLite3 and the GNU Scientific Library (which depends on a BLAS), so you will probably need something like:
\code
gcc sample.c -lapophenia -lsqlite3 -lgsl -lgslcblas -o run_me -g -Wall -O3
\endcode
Your best bet is to encapsulate this mess in a \ref makefile "Makefile". Even if you are using an IDE and its command-line management tools, see the Makefile page for notes on useful flags.
<b>Debugging</b>
The end of <a href="http://modelingwithdata.org/appendix_o.html">Appendix O</a>
of <em>Modeling with Data</em> offers some GDB macros which can make dealing with
Apophenia from the GDB command line much more pleasant. As per the next section, it
also helps to set <tt>apop_opts.stop_on_warning='v'</tt> or <tt>'w'</tt> when running
under the debugger.
<b>Standards compliance</b>
To the best of our abilities, Apophenia complies to the C standard (ISO/IEC 9899:2011).
<b> Easier calling syntax</b>
Many functions allow optional named arguments to functions. For
example:
\code
apop_vector_distance(v1, v2); //assumes Euclidean distance
apop_vector_distance(v1, .metric='M'); //assumes v2=0, uses Manhattan metric.
\endcode
See the \ref designated page for details of this syntax.
endofdiv
endofdiv
Outlineheader debugging Errors, logging, debugging and stopping
<h5>The \c error element</h5>
The \ref apop_data set and the \ref apop_model both include an element named \c error. It is normally \c 0, indicating no (known) error.
For example, \ref apop_data_copy detects allocation errors and some circular links
(when <tt>Data->more == Data</tt>) and fails in those cases. You could thus use the
function with a form like
\code
apop_data *d = apop_text_to_data("indata");
apop_data *cp = apop_data_copy(d);
if (cp->error) {printf("Couldn't copy the input data; failing.\n"); return 1;}
\endcode
There is sometimes (but not always) benefit to handling specific error codes, which are listed in the documentation of those functions that set the \c error element. E.g.,
\code
apop_data *d = apop_text_to_data("indata");
apop_data *cp = apop_data_copy(d);
if (cp->error == 'a') {printf("Couldn't allocate space for the copy; failing.\n"); return 1;}
if (cp->error == 'c') {printf("Circular link in the data set; failing.\n"); return 2;}
\endcode
<h5>Verbosity level and logging</h5>
The global variable <tt>apop_opts.verbose</tt> determines how many notifications and warnings get printed by Apophenia's warning mechanism:
-1: turn off logging, print nothing (ill-advised) <br>
0: notify only of failures and clear danger <br>
1: warn of technically correct but odd situations that might indicate, e.g., numeric instability <br>
2: debugging-type information; print queries <br>
3: give me everything, such as the state of the data at each iteration of a loop.
These levels are of course subjective, but should give you some idea of where to place the
verbosity level. The default is 1.
The messages are printed to the \c FILE handle at <tt>apop_opts.log_file</tt>. If
this is blank (which happens at startup), then this is set to \c stderr. This is the
typical behavior for a console program. Use
\code
apop_opts.log_file = fopen("mylog", "w");
\endcode
to write to the \c mylog file instead of \c stderr.
As well as the error and warning messages, some functions can also print diagnostics,
using the \ref Apop_notify macro. For example, \ref apop_query and friends will print the
query sent to the database engine iff <tt>apop_opts.verbose >=2</tt> (which is useful
when building complex queries). The diagnostics attempt to follow
the same verbosity scale as the warning messages.
<h5>Stopping</h5>
Warnings and errors never halt processing. It is up to the calling function to decide
whether to stop.
When running the program under a debugger, this is an annoyance: we want to stop as
soon as a problem turns up.
The global variable <tt>apop_opts.stop_on_warning</tt> changes when the system halts:
\c 'n': never halt. If you were using Apophenia to support a user-friendly GUI, for example, you would use this mode.<br>
The default: if the variable is <tt>'\0'</tt> (the default), halt on severe errors, continue on all warnings.<br>
\c 'v': If the verbosity level of the warning is such that the warning would print to screen, then halt;
if the warning message would be filtered out by your verbosity level, continue.<br>
\c 'w': Halt on all errors or warnings, including those below your verbosity threshold.
See the documentation for individual functions for details on how each reports errors to the caller and the level at which warnings are posted.
endofdiv
Outlineheader About SQL, the syntax for querying databases
For a reference, your best bet is the <a href="http://www.sqlite.org/lang.html">Structured Query Language reference</a> for SQLite. For a tutorial; there is an abundance of <a href="http://www.google.com/search?q=sql+tutorial">tutorials online</a>. Here is a nice blog <a href="http://fluff.info/blog/arch/00000118.htm">entry</a> about complementaries between SQL and matrix manipulation packages.
Apophenia currently supports two database engines: SQLite and mySQL/mariaDB. SQLite is the default, because it is simpler and generally more easygoing than mySQL, and supports in-memory databases.
The global <tt>apop_opts.db_engine</tt> is initially \c NUL, indicating no preference
for a database engine. You can explicitly set it:
\code
apop_opts.db_engine='s' //use SQLite
apop_opts.db_engine='m' //use mySQL/mariaDB
\endcode
If \c apop_opts.db_engine is still \c NUL on your first database operation, then I will check
for an environment variable <tt>APOP_DB_ENGINE</tt>, and set
<tt>apop_opts.db_engine='m'</tt> if it is found and matches (case insensitive) \c mariadb or \c mysql.
\code
export APOP_DB_ENGINE=mariadb
apop_text_to_db indata mtab db_for_maria
unset APOP_DB_ENGINE
apop_text_to_db indata stab db_for_sqlite.db
\endcode
Finally, Apophenia provides a few nonstandard SQL functions to facilitate math via database; see \ref db_moments.
endofdiv
Outlineheader threads Threading
Apophenia uses OpenMP for threading. You generally do not need to know how OpenMP works
to use Apophenia, and many points of work will thread without your doing anything.
\li All functions strive to be thread-safe. Part of how this is achieved is that static
variables are marked as thread-local or atomic, as per the C standard. There still
exist compilers that can't implement thread-local or atomic variables, in which case
your safest bet is to set OMP's thread count to one as below (or get a new compiler).
\li Some functions modify their inputs. It is up to you to use those functions in
a thread-safe manner. The \ref apop_matrix_realloc handles states and global variables
correctly in a threaded environment, but if you have two threads resizing the same \c
gsl_matrix at the same time, you're going to have problems.
\li There are few compilers that don't support OpenMP. Clang on MacOS may be the only
current mainstream example as of this writing, and they are hard at work on implementing
it. In the mean time, when compiling on such a system all work will be single-threaded.
\li Set the maximum number of threads to \c N with the environment variable
\code
export OMP_NUM_THREADS N
\endcode
or the C function
\code
#include <omp.h>
omp_set_num_threads(N);
\endcode
Use one of these methods with <tt>N=1</tt> if you want a single-threaded program.
\li \ref apop_map and friends distribute their \c for loop over the input \ref apop_data
set across multiple threads. Therefore, be careful to send thread-unsafe functions to
it only after calling \c omp_set_num_threads(1).
\li There are a few functions, like \ref apop_model_draws, that rely on \ref apop_map, and
therefore also thread by default.
\li The function \ref apop_rng_get_thread retrieves a statically-stored RNG specific
to a given thread. Therefore, if you use that function in the place of a \c gsl_rng,
you can parallelize functions that make random draws.
\li \ref apop_rng_get_thread allocates its store of threads using <tt>apop_opts.rng_seed</tt>,
then incrementing that seed by one. You thus probably have threads with seeds 479901,
479902, 479903, .... [If you have a better way to do it, please feel free to modify the
code to implement your improvement and submit a pull request on Github.]
See <a href="http://modelingwithdata.org/arch/00000175.htm">this tutorial on C
threading</a> if you would like to know more, or are unsure about whether your functions
are thread-safe or not.
endofdiv
Outlineheader mwd The book version
Apophenia co-evolved with <em>Modeling with Data: Tools and Techniques for Statistical Computing</em>. You can read about the book, or download a free PDF copy of the full text, at <a href="http://modelingwithdata.org">modelingwithdata.org</a>.
If you are at this site, there is probably something there for you, including a tutorial on C and general computing form, SQL for data-handing, several chapters of statistics from various perspectives, and more details on working Apophenia.
As with many computer programs, the preferred manner of citing Apophenia is to cite its related book.
Here is a BibTeX-formatted entry giving the relevant information:
\code
@book{klemens:modeling,
title = "Modeling with Data: Tools and Techniques for Statistical Computing",
author="Ben Klemens",
year=2008,
publisher="Princeton University Press"
}
\endcode
The rationale for the \ref apop_model struct, based on an algebraic system of models, is detailed in a <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf">U.S. Census Bureau research report</a>.
endofdiv
Outlineheader status What is the status of the code?
[This section last updated 3 August 2014.]
Apophenia was first posted to SourceForge in February 2005, which means that we've had
several years to develop and test the code in real-world applications.
The test suite, including the sample code and solution set for <em>Modeling with Data</em>,
is about 5,500 lines over 135 files. gprof reports that it covers over 90% of the
7,700 lines in Apophenia's code base. A broad rule of thumb for any code base is
that the well-worn parts, in this case functions like \ref apop_data_get and \ref
apop_normal's <tt>log_likelihood</tt>, are likely to be entirely reliable, while the
out-of-the-way functions (maybe the score for the Beta distribution) are worth a bit
of caution. Close to all of the code has been used in production, so all of it was at
least initially tested against real-world data.
It is currently at version 0.999, which is intended to indicate that it is substantially
complete. Of course, a library for scientific computing, or even for that small subset
that is statistics, will never cover all needs and all methods. But as it stands
Apophenia's framework, based on the \ref apop_data and \ref apop_model, is basically
internally consistent, has enough tools that you can get common work done quickly,
and is reasonably fleshed out with a good number of models out of the box.
The \ref apop_data structure is set, and there are enough functions there that you could
use it as a subpackage by itself (especially in tandem with the database functions)
for nontrivial dealings with data.
The \ref apop_model structure is much more ambitious---Apophenia is really intended
to be a novel system for developing models---and its internals can still be improved.
The promise underlying the structure is that you can provide just one item, such as
an RNG or a likelihood function, and the structure will do all of the work to fill in
computationally-intensive methods for everything else; see \ref settingswriting for
the details. Some directions aren't quite there yet (such as RNG -> most other things),
the PMF model needs an internal index for faster lookups, and so on. Readers are invited
to contribute better methods (such as an alternate means of estimating mixture models),
or filling in more of existing models (write a dlog likelihood function for a model
that does not currently have one), or submit new standard models not yet included.
endofdiv
Outlineheader ext How do I write extensions?
It's not a package, so you don't need an API---write your code and <tt>include</tt>
it like any other C code. The system is written to not require a registration or
initialization step to add a new model or other such parts. A new \ref apop_model
has to conform to some rules if it is to play well with \ref apop_estimate,
\ref apop_draw, and so forth. See the notes at \ref modeldetails. Once your new
model or function is working, please post the code or a link to the code on the <a
href="https://github.com/b-k/apophenia/wiki">Apophenia wiki</a>.
endofdiv
endofdiv
Outlineheader dataoverview Data sets
The \ref apop_data structure represents a data set. It joins together a \c gsl_vector, a \c gsl_matrix, an \ref apop_name, and a table of strings. It tries to be lightweight, so you can use it everywhere you would use a \c gsl_matrix or a \c gsl_vector.
If you are viewing the HTML documentation, here is a diagram showing a sample data set with all of the elements in place. Together, they represet a data set where each row is an observation, which includes both numeric and text values, and where each row/column may be named.
\htmlinclude apop_data_fig.html
In a regression, the vector would be the dependent variable, and the other columns
(after factor-izing the text) the independent variables. Or think of the \ref apop_data
set as a partitioned matrix, where the vector is column -1, and the first column of
the matrix is column zero. Here is some code to print the entire matrix, starting at
column -1.
\code
for (int j = 0; j< data->matrix->size1; j++){
printf("%s\t", apop_name_get(data->names, j, 'r'));
for (int i = -1; i< data->matrix->size2; i++)
printf("%g\t", apop_data_get(data, j, i));
printf("\n");
}
\endcode
Most functions assume that the data vector, data matrix, and text have the same row count: \c data->vector->size==data->matrix->size1 and \c data->vector->size==*data->textsize. This means that the \ref apop_name structure doesn't have separate \c vector_names, \c row_names, or \c text_row_names elements: the \c rownames are assumed to apply for all.
See below for notes on easily managing the \c text element and the row/column names.
The \ref apop_data set includes a \c more pointer, which will typically
be \c NULL, but may point to another \ref apop_data set. This is
intended for a main data set and a second or third page with auxiliary
information: estimated parameters on the front page and their
covariance matrix on page two, or predicted data on the front page and
a set of prediction intervals on page two. \c apop_data_copy and \c apop_data_free
will handle all the pages of information. The \c more pointer is not
intended as a linked list for millions of data points---you can probably
find a way to restructure your data to use a single table (perhaps via
\ref apop_data_pack and \ref apop_data_unpack).
There are a great many functions to collate, copy, merge, sort, prune, and otherwise
manipulate the \ref apop_data structure and its components.
\li\ref apop_data_add_named_elmt()
\li\ref apop_data_copy()
\li\ref apop_data_fill()
\li\ref apop_data_memcpy()
\li\ref apop_data_pack()
\li\ref apop_data_rm_columns()
\li\ref apop_data_sort()
\li\ref apop_data_split()
\li\ref apop_data_stack()
\li\ref apop_data_transpose()
\li\ref apop_data_unpack()
\li\ref apop_matrix_copy()
\li\ref apop_matrix_realloc()
\li\ref apop_matrix_rm_columns()
\li\ref apop_matrix_stack()
\li\ref apop_text_add()
\li\ref apop_text_paste()
\li\ref apop_text_to_data()
\li\ref apop_vector_bounded()
\li\ref apop_vector_copy()
\li\ref apop_vector_fill()
\li\ref apop_vector_stack()
\li\ref apop_vector_realloc()
Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate
the GSL's documentation here. You will find a link to the full GSL documentation at
the end of this outline. Meanwhile, here are prototypes for a few common functions. The GSL's
naming scheme is very consistent, so a simple reminder of the function name may be
sufficient to indicate how they are used.
\li <tt>gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>gsl_matrix_transpose_memcpy (gsl_matrix * dest, const gsl_matrix * src)</tt>
\li <tt>gsl_matrix_transpose (gsl_matrix * m) : square matrices only</tt>
\li <tt>gsl_matrix_set_all (gsl_matrix * m, double x)</tt>
\li <tt>gsl_matrix_set_zero (gsl_matrix * m)</tt>
\li <tt>gsl_matrix_set_identity (gsl_matrix * m)</tt>
\li <tt>void gsl_vector_set_all (gsl_vector * v, double x)</tt>
\li <tt>void gsl_vector_set_zero (gsl_vector * v)</tt>
\li <tt>int gsl_vector_set_basis (gsl_vector * v, size_t i)</tt>: set all elements to zero, but set item \f$i\f$ to one.
\li <tt>gsl_vector_reverse (gsl_vector * v)</tt>: reverse the order of your vector's elements
\li <tt>gsl_vector_ptr</tt> and <tt>gsl_matrix_ptr</tt>. To increment an element in a vector use, e.g., <tt>*gsl_vector_ptr(v, 7) += 3;</tt> or <tt>(*gsl_vector_ptr(v, 7))++</tt>.
Outlineheader readin Reading from text files
The \ref apop_text_to_data() function takes in the name of a text file with a grid of data in (comma|tab|pipe|whatever)-delimited format and reads it to a matrix. If there are names in the text file, they are copied in to the data set. See \ref text_format for the full range and details of what can be read in.
If you have any columns of text, then you will need to read in via the database: use
\ref apop_text_to_db() to convert your text file to a database table,
do any database-appropriate cleaning of the input data, then use \ref
apop_query_to_data() or \ref apop_query_to_mixed_data() to pull the data to an \ref apop_data set.
endofdiv
Outlineheader datalloc Alloc/free
\li\ref apop_data_alloc()
\li\ref apop_data_calloc()
\li\ref apop_data_free()
\li\ref apop_text_alloc() : allocate or resize the text part of an \ref apop_data set.
\li\ref apop_text_free()
See also:
\li <tt>gsl_matrix * gsl_matrix_alloc (size_t n1, size_t n2)</tt>
\li <tt>gsl_matrix * gsl_matrix_calloc (size_t n1, size_t n2)</tt>
\li <tt>void gsl_matrix_free (gsl_matrix * m)</tt>
\li <tt>gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)</tt>
\li <tt>gsl_vector * gsl_vector_alloc (size_t n)</tt>
\li <tt>gsl_vector * gsl_vector_calloc (size_t n)</tt>
\li <tt>void gsl_vector_free (gsl_vector * v)</tt>
\li <tt>gsl_vector_memcpy (gsl_vector * dest, const gsl_vector * src)</tt>
endofdiv
Outlineheader gslviews Using views
There are several macros for the common task of viewing a single row or column of a \ref
apop_data set.
\code
apop_data *d = apop_query_to_data("select obs1, obs2, obs3 from a_table");
//Get a column using its name
Apop_col_t(d, "obs1", ov);
double obs1_sum = apop_vector_sum(ov);
//Get row zero of the data set's matrix as a vector; get its sum
double first_row_sum = apop_vector_sum(Apop_rv(d, 0));
//Get a row or rows as a standalone one-row apop_data set
apop_data_print(Apop_r(d, 0));
//ten rows starting at row 3:
Apop_data_rows(d, 3, 10, d10);
apop_data_show(d10);
//First column's sum
double first_col_sum = apop_vector_sum(Apop_cv(d, 0));
//Pull a 10x5 submatrix, whose first element is the (2,3)rd
//element of the parent data set's matrix
double sub_sum = apop_matrix_sum(Apop_subm(d, 2,3, 10,5));
\endcode
To make it easier to use the result of these macros as an argument to a function, these macros have abbreviated names.
\li\ref Apop_c
\li\ref Apop_r
\li\ref Apop_cv
\li\ref Apop_rv
\li\ref Apop_cs
\li\ref Apop_rs
\li\ref Apop_subm
A second set of macros have a slightly different syntax, taking the name of the object to be declared as the last argument. These can not be used as expressions such as function arguments.
\li\ref Apop_col_t
\li\ref Apop_row_t
\li\ref Apop_col_tv
\li\ref Apop_row_tv
\li\ref Apop_matrix_row
\li\ref Apop_matrix_col
The view is an automatic variable, not a pointer, and therefore disappears at the end
of the scope in which it is declared. If you want to retain the data after the function
exits, copy it to another vector:
\code
Apop_matrix_row(d->matrix, 2, rowtwo);
gsl_vector *outvector = apop_vector_copy(rowtwo);
\endcode
Curly braces always delimit scope, not just at the end of a function.
These macros work by generating a number of local variables, which you may be able to see in your
debugger. When program evaluation exits a given block, all variables in that block are
erased. Here is some sample code that won't work:
\code
apop_data *outdata;
if (get_odd){
outdata = Apop_r(data, 1);
} else {
outdata = Apop_r(data, 0);
}
apop_data_show(outdata); //breaks: outdata points to out-of-scope variables.
\endcode
For this if/then statement, there are two sets of local variables
generated: one for the \c if block, and one for the \c else block. By the last line,
neither exists. You can get around the problem here by making sure to not put the macro
declaring new variables in a block. E.g.:
\code
apop_data *outdata = Apop_r(data, get_odd ? 1 : 0);
apop_data_show(outdata);
\endcode
This is a general rule about how variables declared in blocks will behave, but because the
macros obscure the variable declarations, it is especially worth watching out for here.
endofdiv
Outlineheader setgetsec Set/get
The set/get functions can act on both names or indices. Sample usages:
\code
double twothree = apop_data_get(data, 2, 3); //just indices
apop_data_set(data, .rowname="A row", .colname="this column", .val=13);
double AIC = apop_data_get(your_model->info, .rowname="AIC");
\endcode
\li\ref apop_data_get()
\li\ref apop_data_set()
\li\ref apop_data_ptr() : returns a pointer to the element.
See also:
\li <tt>double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>double gsl_vector_get (const gsl_vector * v, size_t i)</tt>
\li <tt>void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x)</tt>
\li <tt>void gsl_vector_set (gsl_vector * v, size_t i, double x)</tt>
\li <tt>double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>double * gsl_vector_ptr (gsl_vector * v, size_t i)</tt>
\li <tt>const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i)</tt>
\li <tt>gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i)</tt>
\li <tt>gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j)</tt>
\li <tt>gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v)</tt>
\li <tt>gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)</tt>
endofdiv
Outlineheader mapplysec Map/apply
\anchor outline_mapply
These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The \c ..._sum functions return the sum of the mapped output.
There is an older and a newer set of functions. The older versions, which act on
<tt>gsl_matrix</tt>es or <tt>gsl_vector</tt>s have more verbose names; the newer
versions, which take in an \ref apop_data set, use the \ref designated syntax to add
a few options and a more brief syntax.
You can do many things quickly with these functions.
Get the sum of squares of a vector's elements:
\code
//given apop_data *dataset and gsl_vector *v:
double sum_of_squares = apop_map_sum(dataset, gsl_pow_2);
double sum_of_squares = apop_vector_map_sum(v, gsl_pow_2);
\endcode
Here, we create an index vector [\f$0, 1, 2, ...\f$].
\code
double index(double in, int index){return index;}
apop_data *d = apop_data_alloc(100);
apop_map(d, .fn_di=index, .inplace='y');
\endcode
Given your log likelihood function and a data set where each row of the matrix is an observation, find the total log likelihood:
\code
static double your_log_likelihood_fn(const gsl_vector * in)
{[your math goes here]}
double total_ll = apop_matrix_map_sum(dataset, your_log_likelihood_fn);
\endcode
How many missing elements are there in your data matrix?
\code
static double nan_check(const double in){ return isnan(in);}
int missing_ct = apop_map_sum(in, nan_check, .part='m');
\endcode
Get the mean of the not-NaN elements of a data set:
\code
static double no_nan_val(const double in){ return isnan(in)? 0 : in;}
static double not_nan_check(const double in){ return !isnan(in);}
static double apop_mean_no_nans(apop_data *in){
return apop_map_sum(in, no_nan_val)/apop_map_sum(in, not_nan_check);
}
\endcode
The following program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the \f$t\f$ statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.
Notice how the older \ref apop_vector_apply uses file-global variables to pass information into the functions, while the \ref apop_map uses a pointer to the constant parameters to input to the functions.
\include t_test_by_rows.c
One more toy example, demonstrating the use of \ref apop_map and \ref apop_map_sum :
\include apop_map_row.c
\li\ref apop_map()
\li\ref apop_map_sum()
\li\ref apop_matrix_apply()
\li\ref apop_matrix_map()
\li\ref apop_matrix_map_all_sum()
\li\ref apop_matrix_map_sum()
\li\ref apop_vector_apply()
\li\ref apop_vector_map()
\li\ref apop_vector_map_sum()
endofdiv
Outlineheader matrixmathtwo Basic Math
\li\ref apop_vector_exp : exponentiate every element of a vector
\li\ref apop_vector_log : take the log of every element of a vector
\li\ref apop_vector_log10 : take the log (base 10) of every element of a vector
\li\ref apop_vector_distance : find the distance between two vectors via various metrics
\li\ref apop_vector_normalize : scale/shift a matrix to have mean zero, sum to one, et cetera
\li\ref apop_matrix_normalize : apply \ref apop_vector_normalize to every column or row of a matrix
See also:
\li <tt>int gsl_matrix_add (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_sub (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_mul_elements (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_div_elements (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_scale (gsl_matrix * a, const double x)</tt>
\li <tt>int gsl_matrix_add_constant (gsl_matrix * a, const double x)</tt>
\li <tt>gsl_vector_add (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_sub (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_mul (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_div (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_scale (gsl_vector * a, const double x)</tt>
\li <tt>gsl_vector_add_constant (gsl_vector * a, const double x)</tt>
endofdiv
Outlineheader matrixmath Matrix math
\li\ref apop_dot(): matrix \f$\cdot\f$ matrix, matrix \f$\cdot\f$ vector, or vector \f$\cdot\f$ matrix
\li\ref apop_matrix_determinant
\li\ref apop_matrix_inverse
\li\ref apop_det_and_inv(): find determinant and inverse at the same time
See the GSL documentation for voluminous further options.
endofdiv
Outlineheader sumstats Summary stats
\li\ref apop_data_summarize ()
\li\ref apop_vector_moving_average()
\li\ref apop_vector_percentiles()
See also:
\li <tt>double gsl_matrix_max (const gsl_matrix * m)</tt>
\li <tt>double gsl_matrix_min (const gsl_matrix * m)</tt>
\li <tt>void gsl_matrix_minmax (const gsl_matrix * m, double * min_out, double * max_out)</tt>
\li <tt>void gsl_matrix_max_index (const gsl_matrix * m, size_t * imax, size_t * jmax)</tt>
\li <tt>void gsl_matrix_min_index (const gsl_matrix * m, size_t * imin, size_t * jmin)</tt>
\li <tt>void gsl_matrix_minmax_index (const gsl_matrix * m, size_t * imin, size_t * jmin, size_t * imax, size_t * jmax)</tt>
\li <tt>gsl_vector_max (const gsl_vector * v)</tt>
\li <tt>gsl_vector_min (const gsl_vector * v)</tt>
\li <tt>gsl_vector_minmax (const gsl_vector * v, double * min_out, double * max_out)</tt>
\li <tt>gsl_vector_max_index (const gsl_vector * v)</tt>
\li <tt>gsl_vector_min_index (const gsl_vector * v)</tt>
\li <tt>gsl_vector_minmax_index (const gsl_vector * v, size_t * imin, size_t * imax)</tt>
endofdiv
Outlineheader moments Moments
For most of these, you can add a weights vector for weighted mean/var/cov/..., such as
<tt>apop_vector_mean(d->vector, .weights=d->weights)</tt>
\li\ref apop_mean(): the first three with short names operate on a vector.
\li\ref apop_sum()
\li\ref apop_var()
\li\ref apop_matrix_sum ()
\li\ref apop_data_correlation ()
\li\ref apop_data_covariance ()
\li\ref apop_data_summarize ()
\li\ref apop_matrix_mean ()
\li\ref apop_matrix_mean_and_var ()
\li\ref apop_vector_correlation ()
\li\ref apop_vector_cov ()
\li\ref apop_vector_kurtosis ()
\li\ref apop_vector_kurtosis_pop ()
\li\ref apop_vector_mean()
\li\ref apop_vector_skew()
\li\ref apop_vector_skew_pop()
\li\ref apop_vector_sum()
\li\ref apop_vector_var()
\li\ref apop_vector_var_m ()
endofdiv
Outlineheader convsec Conversion among types
There are no functions provided to convert from \ref apop_data to the constituent
elements, because you don't need a function.
If you need an individual element, you can just use its pointer to retrieve it:
\code
apop_data *d = apop_text_to_mixed_data("vmw", "select result, age, "
"income, sampleweight from data");
double avg_result = apop_vector_mean(d->vector, .weights=d->weights);
\endcode
In the other direction, you can use compound literals to wrap an \ref apop_data struct
around a loose vector or matrix:
\code
//Given:
gsl_vector *v;
gsl_matrix *m;
// Then this form wraps the elements into \ref apop_data structs. Note that
// these are not pointers: they're automatically allocated and therefore
// the extra memory use for the wrapper is cleaned up on exit from scope.
apop_data *dv = &(apop_data){.vector=v};
apop_data *dm = &(apop_data){.matrix=m};
apop_data *v_dot_m = apop_dot(dv, dm);
//Here is a macro to hide C's ugliness:
#define As_data(...) (&(apop_data){__VA_ARGS__})
apop_data *v_dot_m2 = apop_dot(As_data(.vector=v), As_data(.matrix=m));
//The wrapped object is an automatically-allocated structure pointing to the
//original data. If it needs to persist or be separate from the original,
//make a copy:
apop_data *dm_copy = apop_data_copy(As_data(.vector=v, .matrix=m));
\endcode
\li\ref apop_array_to_vector() : <tt>double*</tt>\f$\to\f$ <tt>gsl_vector</tt>
\li\ref apop_data_fill() : <tt>double*</tt>\f$\to\f$ \ref apop_data. See also \ref apop_data_falloc
\li\ref apop_text_to_data() : delimited text file\f$\to\f$ \ref apop_data
\li\ref apop_text_to_db() : delimited text file\f$\to\f$ database
\li\ref apop_vector_to_matrix()
endofdiv
Outlineheader names Name handling
If you generate your data set from the database via \ref apop_query_to_data (or
\ref apop_query_to_text or \ref apop_query_to_mixed_data) then column names appear
as expected. Set <tt>apop_opts.db_name_column</tt> to the name of a column in your
query result to use that column name for row names.
Sample uses, given \ref apop_data set <tt>d</tt>:
\code
int row_name_count = d->names->rowct
int col_name_count = d->names->colct
int text_name_count = d->names->textct
//Manually add a name:
apop_name_add(d->names, "the vector", 'v');
apop_name_add(d->names, "row 0", 'r');
apop_name_add(d->names, "row 1", 'r');
apop_name_add(d->names, "row 2", 'r');
apop_name_add(d->names, "numeric column 0", 'c');
apop_name_add(d->names, "text column 0", 't');
apop_name_add(d->names, "The name of the data set.", 'h');
//or append several names at once
apop_data_add_names(d, 'c', "numeric column 1", "numeric column 2", "numeric column 3");
//point to element i from:
char *rowname_i = d->names->row[i];
char *colname_i = d->names->col[i];
char *textname_i = d->names->text[i];
//The vector also has a name:
char *vname = d->names->vector;
\endcode
\li\ref apop_name_add() : add one name
\li\ref apop_data_add_names() : add a sequence of names at once
\li\ref apop_name_stack() : copy the contents of one name list to another
\li\ref apop_name_find() : find the row/col number for a given name.
\li\ref apop_name_print() : print the \ref apop_name struct, for diagnostic purposes.
endofdiv
Outlineheader textsec Text data
The \ref apop_data set includes a grid of strings, <tt>text</tt>, for holding text data.
Text should be encoded in UTF-8. US ASCII is a subset of UTF-8, so that's OK too.
There are a few simple forms for handling the \c text element of an \c apop_data set, which handle the tedium of memory-handling for you.
\li Use \ref apop_text_alloc to allocate the block of text. It is actually a realloc function, which you can use to resize an existing block without leaks.
\li Use \ref apop_text_add to add text elements. It replaces any existing text in the given slot without memory leaks.
\li The number of rows of text data in <tt>tdata</tt> is
<tt>tdata->textsize[0]</tt>;
the number of columns is <tt>tdata->textsize[1]</tt>.
\li Refer to individual elements using the usual 2-D array notation, <tt>tdata->text[row][col]</tt>.
\li <tt>x[0]</tt> can always be written as <tt>*x</tt>, which may save some typing. The number of rows is <tt>*tdata->textsize</tt>. If you have a single column of text data (i.e., all data is in column zero), then item \c i is <tt>*tdata->text[i]</tt>. If you know you have exactly one cell of text, then its value is <tt>**tdata->text</tt>.
\li After \ref apop_text_alloc, all elements are the empty string <tt>""</tt>, which
you can check via <tt>if (!strlen(dataset->text[i][j])) printf("<blank>")</tt> or
<tt>if (!*dataset->text[i][j]) printf("<blank>")</tt>. For the sake of efficiency
when dealing with large, sparse data sets, all blank cells point to <em>the same</em>
static empty string, meaning that freeing cells must be done with care. Your best bet
is to rely on \ref apop_text_add, \ref apop_text_alloc, and \ref apop_text_free to do
the memory management for you.
Here is a sample program that uses these forms, plus a few text-handling functions.
\include eg/text_demo.c
\li\ref apop_data_transpose() : also transposes the text data. Say that you use
<tt>dataset = apop_query_to_text("select onecolumn from data");</tt> then you have a
sequence of strings, <tt>d->text[0][0], d->text[1][0], </tt>.... After <tt>apop_data
*dt = apop_data_transpose(dataset)</tt>, you will have a single list of strings,
<tt>dt->text[0]</tt>, which is often useful as input to list-of-strings handling
functions.
\li\ref apop_query_to_text()
\li\ref apop_text_alloc() : allocate or resize the text part of an \ref apop_data set.
\li\ref apop_text_add(): replace a single cell of the text grid with new text.
\li\ref apop_text_paste() : convert a table of little strings into one long string.
\li\ref apop_text_unique_elements() : get a sorted list of unique elements for one column of text.
\li\ref apop_text_free() : you may never need this, because \ref apop_data_free calls it.
\li\ref apop_regex() : friendlier front-end for POSIX-standard regular expression searching and pulling matches into a \ref apop_data set.
\li\ref apop_text_paste(): produce a single string from a grid of text
endofdiv
Outlineheader fact Generating factors
\em Factor is jargon for a numbered category. Number-crunching programs prefer integers over text, so we need a function to produce a one-to-one mapping from text categories into numeric factors.
A \em dummy is a variable that is either one or zero, depending on membership in a given
group. Some methods (typically when the variable is an input or independent variable
in a regression) prefer dummies; some methods (typically for outcome or dependent
variables) prefer factors. The functions that generate factors and dummies will add
an informational page to your \ref apop_data set with a name like <tt>\<categories
for your_column\></tt> listing the conversion from the artificial numeric factor to
the original data. Use \ref apop_data_get_factor_names to get a pointer to that page.
You can use the factor table to translate from numeric categories back to text (though
you probably have the original text column in your data anyway).
This is especially useful if you want text in two data sets to have the same
categories. Generate factors in the first set, then copy the factor list to the second,
then run \ref apop_data_to_factors on the second.
\code
apop_data_to_factors(d1);
d2->more = apop_data_copy(apop_data_get_factor_names(d1));
apop_data_to_factors(d2);
\endcode
\li\ref apop_data_to_dummies()
\li\ref apop_data_to_factors()
\li\ref apop_data_get_factor_names()
\li\ref apop_text_unique_elements()
\li\ref apop_vector_unique_elements()
endofdiv
endofdiv
Outlineheader dbs Databases
These are convenience functions to handle interaction with SQLite or mySQL/mariaDB. They open one and only one database, and handle most of the interaction therewith for you.
You will probably first use \ref apop_text_to_db to pull data into the database, then \ref apop_query to clean the data in the database, and finally \ref apop_query_to_data to pull some subset of the data out for analysis.
See the \ref db_moments page for not-SQL-standard math functions that you can
use when sending queries from Apophenia, such as \c pow, \c stddev, or \c sqrt.
\li \ref apop_text_to_db : Read a text file on disk into the database. Most data analysis projects start with a call to this.
\li \ref apop_data_print : If you include the argument <tt>.output_type='d'</tt>, this prints your \ref apop_data set to the database.
\li \ref apop_query : Manipulate the database, return nothing (e.g., insert rows or create table).
\li \ref apop_db_open : Optional, for when you want to use a database on disk.
\li \ref apop_db_close : If you used \ref apop_db_open, you will need to use this too.
\li \ref apop_table_exists : Check to make sure you aren't reinventing or destroying data. Also, a clean way to drop a table.
\li Apophenia reserves the right to insert temp tables into the opened database. They will all have names beginning with "apop_", so the reader is advised to not use tables with such names, and is free to ignore or delete any such tables that turn up.
\li If you need to deal with two databases, use SQL's <a
href="https://sqlite.org/lang_attach.html"><tt>attach database</tt></a>. By default
with SQLite, Apophenia opens an in-memory database handle. It is a sensible workflow to
use the faster in-memory database as the primary, and then attach an on-disk database
to read in data and write final output tables.
<b>Extracting data from the database</b>
\li\ref apop_db_to_crosstab(): take three columns in the database (row, column, value) and produce a table of values.
\li\ref apop_query_to_data()
\li\ref apop_query_to_float()
\li\ref apop_query_to_mixed_data()
\li\ref apop_query_to_text()
\li\ref apop_query_to_vector()
<b>Writing data to the database</b>
See the print functions below. E.g.
\code
apop_data_print(yourdata, .output_type='d', .output_name="dbtab");
\endcode
Outlineheader cmdline Command-line utilities
A few functions have proven to be useful enough to be worth breaking out into their own programs, for use in scripts or other data analysis from the command line:
\li The \c apop_text_to_db command line utility is a wrapper for the \ref apop_text_to_db command.
\li The \c apop_db_to_crosstab function is a wrapper for the \ref apop_db_to_crosstab function.
\li For fans of Gnuplot, the \c apop_plot_query utility produces a plot from the database. It is especially useful for histograms, whcih are binned via \ref apop_data_to_bins before plotting.
endofdiv
endofdiv
Outlineheader Modesec Models
This segment discusses the use of existing \ref apop_model objects.
If you need to write a new model, see \ref modeldetails.
Outlineheader introtomodels Introduction
Begin with the most common use:
the \c estimate function will estimate the parameters of your model. Just prep the data, select a model, and produce an estimate:
\code
apop_data *data = apop_query_to_data("select outcome, in1, in2, in3 from dataset");
apop_model *the_estimate = apop_estimate(data, apop_probit);
apop_model_print(the_estimate, NULL);
\endcode
Along the way to estimating the parameters, most models also find covariance estimates for
the parameters, calculate statistics like log likelihood, and so on, which the final print statement will show.
The <tt>apop_probit</tt> model that ships with Apophenia is unparameterized:
<tt>apop_probit.parameters==NULL</tt>. The output from the estimation,
<tt>the_estimate</tt>, has the same form as <tt>apop_probit</tt>, but
<tt>the_estimate->parameters</tt> has a meaningful value.
Outlineheader covandstuff More estimation output
A call to \ref apop_estimate produces more than just the estimated parameters. Most will
produce any of a covariance matrix, some hypothesis tests, a list of expected values, log
likelihood, AIC, AIC_c, BIC, et cetera.
First, note that if you don't want all that,
adding to your model an \ref apop_parts_wanted_settings group with its default values (see below on settings groups) signals to
the model that you want only the parameters and to not waste CPU time on covariances,
expected values, et cetera. See the \ref apop_parts_wanted_settings documentation for examples and
further refinements.
\li The actual parameter estimates are in an \ref apop_data set at \c your_model->parameters.
\li A pointer to the \ref apop_data set used for estimation, named \c data.
\li Scalar statistics of the model are listed in the output model's \c info group, and can
be retrieved via a form like
\code
apop_data_get(your_model->info, .rowname="log likelihood");
//or
apop_data_get(your_model->info, .rowname="AIC");
\endcode
\li Covariances of the parameters are a page appended to the parameters; retrieve via
\code
apop_data *cov = apop_data_get_page(your_model->parameters, "<Covariance>");
\endcode
\li If the model calculates it, the table of expected values (typically including
expected value, actual value, and residual) is a page stapled to the main info
page. This is mostly for regression-type models. Retrieve via:
\code
apop_data *predict = apop_data_get_page(your_model->info, "<Predicted>");
\endcode
endofdiv
But we expect much more from a model than just estimating parameters from data.
Continuing the above example where we got an estimated Probit model named \c the_estimate, we can interrogate the estimate in various familiar ways. In each of the following examples, the model object holds enough information that the generic function being called can do its work:
\code
apop_data *expected_value = apop_predict(NULL, the_estimate);
double density_under = apop_cdf(expected_value, the_estimate);
apop_data *draws = apop_model_draws(the_estimate, .count=1000);
\endcode
Apophenia ships with many well-known models for your immediate use, including
probability distributions, such as the \ref apop_normal, \ref apop_poisson, or \ref apop_beta models. The data is assumed to have been drawn from a given distribution and the question is only what distributional parameters best fit; e.g., assume the data is Normally distributed and find the mean and variance: \ref apop_estimate(\c your_data, \ref apop_normal).
There are also linear models like \ref apop_ols, \ref apop_probit, and \ref apop_logit. As in the example, they are on equal footing with the distributions, so nothing keeps you from making random draws from an estimated linear model.
\li If you send a data set with the \c weights vector filled, \ref apop_ols estimates Weighted OLS.
\li If the dependent variable has more than categories, The \ref apop_probit and \ref apop_logit models estimate a multinomial logit or probit.
\li There are separate \ref apop_normal and \ref apop_multivariate_normal functions because the parameter formats are slightly different: the univariate Normal keeps both μ and σ in the vector element of the parameters; the multivariate version uses the vector for the vector of means and the matrix for the Σ matrix. The univariate is so heavily used that it merits a special-case model.
Simulation models seem to not fit this form, but you will see below that if you can write an objective function for the \c p method of the model, you can use the above tools. Notably, you can estimate parameters via maximum likelihood and then give confidence intervals around those parameters.
But some models have to break uniformity, like how a histogram has a list of bins that makes no sense for a Normal distribution. These are held in <em>settings groups</em>, which you will occasionally need to tweak to modify how a model is handled or estimated. The most common example would be for maximum likelihood, eg.
\code
//Probit uses MLE. Redo the estimation using Newton's Method
Apop_settings_add_group(the_estimate, apop_mle, .verbose='y',
.tolerance=1e-4, .method=APOP_RF_NEWTON);
apop_model *re_est = apop_estimate(data, the_estimate);
\endcode
See below for more details on using settings groups.
Outlineheader modelparameterization Parameterizing or initializing a model
The models that ship with Apophenia have the requisite procedures for estimation,
making draws, and so on, but have <tt>parameters==NULL</tt> and <tt>settings==NULL</tt>. The
model is thus, for many purposes, incomplete, and you will need to take some action to
complete the model. As per the examples to follow, there are several possibilities:
\li Estimate it! Almost all models can be sent with a data set as an argument to the
<tt>apop_estimate</tt> function. The input model is unchanged, but the output model
has parameters and settings in place.
\li If your model has a fixed number of numeric parameters, then you can set them with
\ref apop_model_set_parameters.
\li If your model has a variable number of parameters, you can directly set the \c
parameters element via \c apop_data_falloc. For most purposes, you will also need to
set the \c msize1, \c msize2, \c vsize, and \c dsize elements to the size you want. See
the example below.
\li Some models have disparate, non-numeric settings rather than a simple matrix of
parameters. For example, an kernel density estimate needs a model as a kernel and a
base data set, which can be set via \ref apop_model_copy_set.
Here is an example that shows the options for parameterizing a model. After each
parameterization, 20 draws are made and written to a file named draws-[modelname].
\include ../eg/parameterization.c
endofdiv
Where to from here? See the \ref models page for
a list of basic functions that make use of them,
along with a list of the canned models, including popular favorites like
\ref apop_beta, \ref apop_binomial, \ref apop_iv (instrumental variables),
\ref apop_kernel_density, \ref apop_loess, \ref apop_lognormal,
\ref apop_pmf (see the histogram section below), and \ref apop_poisson.
If you need to write a new model, see \ref modeldetails.
endofdiv
Outlineheader mathmethods Model methods
\li\ref apop_estimate() : estimate the parameters of the model with data.
\li\ref apop_predict() : the expected value function.
\li\ref apop_draw() : random draws from an estimated model.
\li\ref apop_p() : the probability of a given data set given the model.
\li\ref apop_log_likelihood() : the log of \ref apop_p
\li\ref apop_score() : the derivative of \ref apop_log_likelihood
\li\ref apop_model_print() : write to screen, file, or database
\li\ref apop_model_copy() : duplicate a model
\li\ref apop_model_set_parameters() : Models ship with no parameters set. Use this to convert a Normal(μ, σ) with unknown μ and σ into a Normal(0, 1), for example.
\li\ref apop_model_free()
\li\ref apop_model_clear(), apop_prep() : remove the parameters from a parameterized model. Used infrequently.
\li\ref apop_model_draws() : many random draws from an estimated model.
endofdiv
Outlineheader Update Filtering & updating
The model structure makes it
easy to generate new models that are variants of prior models. Bayesian updating,
for example, takes in one \ref apop_model that we call the prior, one \ref apop_model
that we call a likelihood, and outputs an \ref apop_model that we call the
posterior. One can produce complex models using simpler transformations as well. For example, to generate
a one-parameter Normal(μ, 1) given the code for for a Normal(μ, σ):
\code
apop_model *N_sigma1 = apop_model_fix_params(apop_model_set_parameters(apop_normal, NAN, 1));
\endcode
This can be used anywhere the original Normal distribution can be. If we need to truncate the distribution in the data space:
\code
//The constraint function.
double over_zero(apop_data *in, apop_model *m){
return apop_data_get(in) > 0;
}
apop_model *trunc = apop_model_dconstrain(.base_model=N_sigma1,
.constraint=over_zero);
\endcode
Chaining together simpler transformations is an easy method to produce
models of arbitrary detail.
\li\ref apop_update() : Bayesian updating
\li\ref apop_model_coordinate_transform() : apply an invertible transformation to the data space
\li\ref apop_model_dconstrain() : constrain the data space of a model to a subspace. E.g., truncate a Normal distribution so \f$x>0\f$.
\li\ref apop_model_fix_params() : hold some parameters constant
\li\ref apop_model_mixture() : a linear combination of models
\li\ref apop_model_stack() : If \f$(p_1, p_2)\f$ has a Normal distribution and \f$p_3\f$ has an independent Poisson distribution, then \f$(p_1, p_2, p_3)\f$ has an <tt>apop_model_stack(apop_normal, apop_poisson)</tt> distribution.
\li\ref apop_model_dcompose() : use the output of one model as a data set for another
endofdiv
Outlineheader modelsettings Settings groups
[For info on specific settings groups and their contents and use, see the \ref settings page.]
Describing a statistical, agent-based, social, or physical model in a standardized form is difficult because every model has significantly different settings. E.g., an MLE requires a method of search (conjugate gradient, simplex, simulated annealing), and a histogram needs the number of bins to be filled with data.
So, the \ref apop_model includes a single list which can hold an arbitrary number of groups of settings, like the search specifications for finding the maximum likelihood, a histogram for making random draws, and options about the model type.
Settings groups are automatically initialized with default values when
needed. If the defaults do no harm, then you don't need to think about
these settings groups at all.
Here is an example where a settings group is worth tweaking: the \ref apop_parts_wanted_settings group indicates which parts
of the auxiliary data you want.
\code
1 apop_model *m = apop_model_copy(apop_ols);
2 Apop_settings_add_group(m, apop_parts_wanted, .covariance='y');
3 apop_model *est = apop_estimate(data, m);
\endcode
Line one establishes the baseline form of the model. Line two adds a settings group
of type \ref apop_parts_wanted_settings to the model. By default other auxiliary items, like the expected values, are set to \c 'n' when using this group, so this specifies that we want covariance and only covariance. Having stated our preferences, line three does the estimation we want.
Notice that the \c _settings ending to the settings group's name isn't written---macros
make it happen. The remaining arguments to \c Apop_settings_add_group (if any) follow
the \ref designated syntax of the form <tt>.setting=value</tt>.
There is an \ref apop_model_copy_set macro that adds a settings group when it is first copied, joining up lines one and two above:
\code
apop_model *m = apop_model_copy_set(apop_ols, apop_parts_wanted, .covariance='y');
\endcode
Settings groups are copied with the model, which facilitates chaining
estimations. Continuing the above example, you could re-estimate to get the predicted
values and covariance via:
\code
Apop_settings_set(est, apop_parts_wanted, predicted, 'y');
apop_model *est2 = apop_estimate(data, est);
\endcode
\li \ref Apop_settings_set, for modifying a single setting, doesn't use the designated initializers format.
\li The \ref settings page lists the settings structures included in Apophenia and their use.
\li Because the settings groups are buried within the model, debugging them can be a
pain. Here is a documented macro for \c gdb that will help you pull a settings group out of a
model for your inspection, to cut and paste into your \c .gdbinit. It shouldn't be too difficult to modify this macro for other debuggers.
\code
define get_group
set $group = ($arg1_settings *) apop_settings_get_grp( $arg0, "$arg1", 0 )
p *$group
end
document get_group
Gets a settings group from a model.
Give the model name and the name of the group, like
get_group my_model apop_mle
and I will set a gdb variable named $group that points to that model,
which you can use like any other pointer. For example, print the contents with
p *$group
The contents of $group are printed to the screen as visible output to this macro.
end
\endcode
For just using a model, that's all of what you need to know. For details on writing a new settings group, see \ref settingswriting .
\li\ref Apop_settings_add_group
\li\ref Apop_settings_set
\li\ref Apop_settings_get get a single element from a settings group.
\li\ref Apop_settings_get_group get the whole settings group.
endofdiv
endofdiv
endofdiv
Outlineheader Test Tests & diagnostics
Just about any hypothesis test consists of a few common steps:
\li specify a statistic
\li State the statistic's hypothesized distribution
\li Find the odds that the statistic would lie within some given range, like <em>greater than zero</em> or <em>near 1.1</em>.
If the statistic is from a common form, like the parameters from an OLS regression, then the commonly-associated \f$t\f$ test is probably included as part of the estimation output, typically as a row in the \c info element of the \ref apop_data set.
Some tests, like ANOVA, produce a statistic using a specialized procedure, so Apophenia includes some functions, like \ref apop_test_anova_independence and \ref apop_test_kolmogorov, to produce the statistic and look up its significance level.
If you are producing a statistic that you know has a common form, like a central limit theorem tells you that your statistic is Normally distributed, then the convenience function \ref apop_test will do the final lookup step of checking where your statistic lies on your chosen distribution.
\li\ref apop_test()
\li\ref apop_paired_t_test()
\li\ref apop_f_test()
\li\ref apop_t_test()
\li\ref apop_test_anova_independence()
\li\ref apop_test_fisher_exact()
\li\ref apop_test_kolmogorov()
\li\ref apop_estimate_coefficient_of_determination()
\li\ref apop_estimate_r_squared()
\li\ref apop_estimate_parameter_tests()
See also the example at the end of \ref gentle.
Outlineheader Mont Monte Carlo methods
\li\ref apop_bootstrap_cov()
\li\ref apop_jackknife_cov()
endofdiv
To give another example of testing, here is a function that used to be a part of Apophenia, but seemed a bit out of place. Here it is as a sample:
\code
// Input: any old vector. Output: 1 - the p-value for a chi-squared
// test to answer the question, "with what confidence can I reject the
// hypothesis that the variance of my data is zero?"
double apop_test_chi_squared_var_not_zero(const gsl_vector *in){
Apop_stopif(!in, return NAN, 0, "input vector is NULL. Doing nothing.");
double sum=0;
gsl_vector *normed = apop_vector_normalize((gsl_vector *)in, .normalization_type='s');
gsl_vector_mul(normed, normed);
for(size_t i=0;i< normed->size;
sum +=gsl_vector_get(normed,i++));
gsl_vector_free(normed);
return gsl_cdf_chisq_P(sum, in->size);
}
\endcode
Or, consider the Rao statistic,
\f${\partial\over \partial\beta}\log L(\beta)'I^{-1}(\beta){\partial\over \partial\beta}\log L(\beta)\f$
where \f$L\f$ is your model's likelihood function and \f$I\f$ its information matrix. In code:
\code
apop_data * infoinv = apop_model_numerical_covariance(data, your_model);
apop_data * score;
apop_score(data, score->vector, your_model);
apop_data * stat = apop_dot(apop_dot(score, infoinv), score);
\endcode
Given the correct assumptions, this is \f$\sim \chi^2_m\f$, where \f$m\f$ is the dimension of \f$\beta\f$, so the odds of a Type I error given the model is:
\code
double p_value = apop_test(stat, "chi squared", beta->size);
\endcode
endofdiv
Outlineheader Histosec Empirical distributions and PMFs (probability mass functions)
The \ref apop_pmf model wraps a \ref apop_data set so it can be read as an empirical
model, with a likelihoood function (equal to the associated weight for observed
values and zero for unobserved values), a random number generator (which
simply makes weighted random draws from the data), and so on. Setting it up is a
model estimation from data like any other, done via \ref apop_estimate(\c your_data,
\ref apop_pmf).
You have the option of cleaning up the data before turning it into a PMF. For example...
\code
apop_data_pmf_compress(your_data); //remove duplicates
apop_data_sort(your_data);
apop_vector_normalize(your_data->weights); //weights sum to one
apop_model *a_pmf = apop_estimate(your_data, apop_pmf);
\endcode
These are largely optional.
\li The CDF is calculated based on the percent of the weights between the zeroth row of the PMF
and the row specified. This generally makes more sense after \ref apop_data_sort.
\li Compression produces a corresponding improvement in efficiency when calculating
CDFs, but is otherwise not necessary.
\li Sorting or normalizing is not necessary for making draws or getting a likelihood or log likelihood.
It is the weights vector that holds the density represented by each row; the rest of the row represents the coordinates of that density. If the input data set has no \c weights segment, then I assume that all rows have equal weight.
Most models have a \c parameters \ref apop_data set that is filled when you call \ref
apop_estimate. For a PMF model, the \c parameters are \c NULL, and the \c data itself
is used for calculation. Therefore, modifying the data post-estimation can break some
internal settings set during estimation. If you modify the data, throw away any existing
PMFs (\c apop_model_free) and re-estimate a new one.
Using \ref apop_data_pmf_compress puts the data into one bin for each unique value in
the data set.
You may instead want bins of fixed with, in the style of a histogram, which you can get via
\ref apop_data_to_bins. It requires a bin specification. If you send a \c NULL binspec,
then the offset is zero and the bin size is big enough to ensure that there are
\f$\sqrt(N)\f$ bins from minimum to maximum. There are other preferred formulæ for
bin widths that minimize MSE, which might be added as a future extension. The binspec
will be added as a page to the data set, named <tt>"<binspec>"</tt>. See the \ref
apop_data_to_bins documentation on how to write a custom bin spec.
There are a few ways of testing the claim that one distribution equals another, typically an empirical PMF versus a smooth theoretical distribution. In both cases, you will need two distributions based on the same binspec.
For example, if you do not have a prior binspec in mind, then you can use the one generated by the first call to the histogram binning function to make sure that the second data set is in sync:
\code
apop_data_to_bins(first_set, NULL);
apop_data_to_bins(second_set, apop_data_get_page(first_set, "<binspec>"));
\endcode
You can use \ref apop_test_kolmogorov or \ref apop_histograms_test_goodness_of_fit to generate the appropriate statistics from the pairs of bins.
Kernel density estimation will produce a smoothed PDF. See \ref apop_kernel_density for details.
Or, use \ref apop_vector_moving_average for a simpler smoothing method.
\li\ref apop_data_pmf_compress() : merge together redundant rows in a data set before calling
\ref apop_estimate(\c your_data, \ref apop_pmf); optional.
\li\ref apop_vector_moving_average() : smooth a vector (e.g., your_pmf->data->weights) via moving average.
\li\ref apop_histograms_test_goodness_of_fit() : goodness-of-fit via \f$\chi^2\f$ statistic
\li\ref apop_test_kolmogorov() : goodness-of-fit via Kolmogorov-Smirnov statistic
endofdiv
Outlineheader Maxi Maximum likelihood methods
This section includes some notes on the maximum likelihood routine. As in the section
on writing models above, if a model has a \c p or \c log_likelihood method but no \c
estimate method, then calling \c apop_estimate(your_data, your_model) executes the
default estimation routine of maximum likelihood.
If you are a not a statistician, then there are a few things you will need to keep in
mind:
\li Physicists, pure mathematicians, and the GSL minimize; economists, statisticians, and
Apophenia maximize. If you are doing a minimization, be sure that your function returns minus the objective
function's value.
\li The overall setup is about estimating the parameters of a model with data. The user
provides a data set and an unparameterized model, and the system tries parameterized
models until one of them is found to be optimal. The data is fixed. The parameters make sense only in the
context of a model, so the optimization tries a series of parameterized models,
searching for the one that is most likely.
In a non-stats setting, you will probably have \c NULL data, and will be optimizing the
parameters internal to the model.
\li Because the unit of analysis is a parameterized model,
not just parameters, you need to have an \ref apop_model
wrapping your objective function. Here is a typical sort of function that one would
maximize; it is Rosenbrock's banana function, \f$(1-x)^2+ s(y - x^2)^2\f$, where the
scaling factor \f$s\f$ is fixed ahead of time, say at 100.
\code
typedef struct {
double scaling;
} coeff_struct;
double banana (double *params, coeff_struct *in){
return (gsl_pow_2(1-params[0]) +
in->scaling*gsl_pow_2(params[1]-gsl_pow_2(params[0])));
}
\endcode
The function returns a single number to be minimized. You will need to write an
\ref apop_model to send to the optimizer, which is a two step process: write a log
likelihood function wrapping the real objective function, and a model that uses that
log likelihood. Here they are:
\code
double ll (apop_data *d, apop_model *in){
return - banana(in->parameters->vector->data, in->more);
}
int main(){
coeff_struct co = {.scaling=100};
apop_model b = {"Bananas!", .log_likelihood= ll, .vsize=2,
.more = &co, .more_size=sizeof(coeff_struct)};
\endcode
The <tt>vsize=2</tt> specified that your parameters are a vector of size two, which
means that <tt>in->parameters->vector->data</tt> is the list of <tt>double</tt>s that you
should send to \c banana. The \c more element of the structure is designed to hold any
arbitrary structure; if you use it, you will also need to use the \c more_size element, as
above.
\li Statisticians want the covariance and basic tests about the parameters. If you just
want the optimal value, then adding this line will shut off all auxiliary calculations:
\code
Apop_settings_add_group(your_model, apop_parts_wanted);
\endcode
See the documentation for \ref apop_parts_wanted_settings for details about how this
works. It can also offer quite the speedup: especially for high-dimensional problems,
finding the covariance matrix without any information can take dozens of evaluations
of the objective funtion for each evaluation that is part of the search itself.
\li MLEs have an especially large number of parameter tweaks that could be made;
see the section on MLE settings above or the \ref apop_mle_settings page.
\li Putting it all together, here is a full program to minimize Rosenbrock's banana
function. There are some extras: it uses two methods (notice how easy it is to re-run an
estimation with an alternate method, but the syntax for modifying a setting differs from
the initialization syntax) and checks that the results are accurate.
\include ../eg/banana.c
\li If you would like to see what the optimizer did, add <tt>.want_path='y'</tt> to the settings group, then get the <tt>path</tt> element from the settings group:
\code
Apop_settings_add_group(your_model, apop_mle, .want_path='y');
apop_model *out = apop_estimate(your_data, your_model);
apop_data_show(Apop_settings_get(out, apop_mle, path));
\endcode
Outlineheader constr Setting Constraints
The problem is that the parameters of a function must not take on certain values, either because the function is undefined for those values or because parameters with certain values would not fit the real-world problem.
The solution is to rewrite the function being maximized such that the function is continuous at the constraint boundary but takes a steep downward slope. The unconstrained maximization routines will be able to search a continuous function but will never return a solution that falls beyond the parameter limits.
If you give it a likelihood function with no regard to constraints plus an array of constraints,
\ref apop_maximum_likelihood will combine them to a function that fits the above description and search accordingly.
A constraint function must do three things:
\li It must check the constraint, and if the constraint does not bind (i.e. the parameter values are OK), then it must return zero.
\li If the constraint does bind, it must return a penalty, that indicates how far off the parameter is from meeting the constraint.
\li if the constraint does bind, it must set a return vector that the likelihood function can take as a valid input. The penalty at this returned value must be zero.
The idea is that if the constraint returns zero, the log likelihood function will
return the log likelihood as usual, and if not, it will return the log likelihood at
the constraint's return vector minus the penalty. To give a concrete example, here
is a constraint function that will ensure that both parameters of a two-dimensional
input are both greater than zero. As with the constraints for many of the models that
ship with Apophenia, it is a wrapper for \ref apop_linear_constraint.
\code
static long double beta_zero_greater_than_x_constraint(apop_data *data, apop_model *v){
//constraint is 0 < beta_2
static apop_data *constraint = NULL;
if (!constraint) constraint= apop_data_falloc((1,1,1), 0, 1);
return apop_linear_constraint(v->parameters->vector, constraint, 1e-3);
}
\endcode
\li\ref apop_linear_constraint()
endofdiv
\li\ref apop_estimate_restart(): Restarting an MLE with different settings can improve results.
\li\ref apop_maximum_likelihood(): Rarely used. If a model has no \c estimate element, just call \ref apop_estimate to run an MLE.
\li\ref apop_model_numerical_covariance()
\li\ref apop_numerical_gradient()
endofdiv
Outlineheader Miss Missing data
\li\ref apop_data_listwise_delete()
\li\ref apop_ml_impute()
endofdiv
Outlineheader Legi Legible output
The output routines handle four sinks for your output. There is a global variable that
you can use for small projects where all data will go to the same place.
\code
apop_opts.output_type = 's'; //Stdout
apop_opts.output_type = 'f'; //named file
apop_opts.output_type = 'p'; //a pipe or already-opened file
apop_opts.output_type = 'd'; //the database
\endcode
You can also set the output type, the name of the output file or table, and other options
via arguments to individual calls to output functions. See \ref apop_prep_output for the list of options.
C makes minimal distinction between pipes and files, so you can set a
pipe or file as output and send all output there until further notice:
\code
apop_opts.output_type = 'p';
apop_opts.output_pipe = popen("gnuplot", "w");
apop_plot_lattice(...); //see https://github.com/b-k/Apophenia/wiki/gnuplot_snippets
fclose(apop_opts.output_pipe);
apop_opts.output_pipe = fopen("newfile", "w");
apop_data_print(set1);
fprintf(apop_opts.output_pipe, "\nNow set 2:\n");
apop_data_print(set2);
\endcode
Continuing the example, you can always override the global data with
a specific request:
\code
apop_vector_print(v, "vectorfile"); //put vectors in a separate file
apop_matrix_print(m, "matrix_table", .output_type = 'd'); //write to the db
apop_matrix_print(m, .output_pipe = stdout); //now show the same matrix on screen
\endcode
I will first look to the input file name, then the input pipe, then the
global \c output_pipe, in that order, to determine to where I should
write. Some combinations (like output type = \c 'd' and only a pipe) don't
make sense, and I'll try to warn you about those.
What if you have too much output and would like to use a pager, like \c less or \c more?
In C and POSIX terminology, you're asking to pipe your output to a paging program. Here is
the form:
\code
FILE *lesspipe = popen("less", "w");
assert(lesspipe);
apop_data_print(your_data_set, .output_pipe=lesspipe);
pclose(lesspipe);
\endcode
\c popen will search your usual program path for \c less, so you don't have to give a full path.
\li\ref apop_data_print()
\li\ref apop_matrix_print()
\li\ref apop_vector_print()
\li\ref apop_data_show() : alias for \ref apop_data_print limited to \c stdout.
The plot functions produce output for Gnuplot (so output type = \c 'd' again does not
make sense). As above, you can pipe directly to Gnuplot or write to a file. Please
consider these to be deprecated, as there is better graphics support in the works.
\li\ref apop_plot_histogram()
endofdiv
Outlineheader moreasst Assorted
A few more descriptive methods:
\li\ref apop_matrix_pca : Principal component analysis
\li\ref apop_anova : One-way or two-way ANOVA tables
\li\ref apop_rake : Iterative proportional fitting on large, sparse tables
General utilities:
\li\ref Apop_stopif : Apophenia's error-handling and warning-printing macro.
\li\ref apop_opts : the global options
\li\ref apop_regex() : friendlier front-end for POSIX-standard regular expression searching and pulling matches into a \ref apop_data set.
\li\ref apop_text_paste(): produce a single string from a grid of text
\li\ref apop_system()
Math utilities:
\li\ref apop_matrix_is_positive_semidefinite()
\li\ref apop_matrix_to_positive_semidefinite()
\li\ref apop_generalized_harmonic()
\li\ref apop_multivariate_gamma()
\li\ref apop_multivariate_lngamma()
\li\ref apop_rng_alloc()
endofdiv
Outlineheader links Further references
For your convenience, here are links to some other libraries you are probably using.
\li <a href="http://www.gnu.org/software/libc/manual/html_node/index.html">The standard C library</a>
\li <a href="http://www.gnu.org/software/gsl/manual/html_node/index.html">The
GSL documentation</a>, and <a href="http://www.gnu.org/software/gsl/manual/html_node/Function-Index.html">its index</a>
\li <a href="http://sqlite.org/lang.html">SQL understood by SQLite</a>
*/
/** \page mingw MinGW
Minimalist GNU for Windows is indeed minimalist: it is not a full POSIX subsystem, and provides no package manager. Therefore, you will have to make some adjustments and install the dependencies yourself.
Matt P. Dziubinski successfully used Apophenia via MinGW; here are his instructions (with edits by BK):
\li get libregex (the ZIP file) from:
http://sourceforge.net/project/showfiles.php?group_id=204414&package_id=306189
\li get libintl (three ZIP files) from:
http://gnuwin32.sourceforge.net/packages/libintl.htm .
download "Binaries", "Dependencies", "Developer files"
\li follow "libintl" steps from:
http://kayalang.org/download/compiling/windows
\li Comment out "alloc.h" in apophenia-0.22/vasprintf/vasnprintf.c
\li Modify \c Makefile, adding -lpthread to AM_CFLAGS (removing -pthread) and -lregex to AM_CFLAGS and LIBS
\li Now compile the main library:
\code
make
\endcode
\li Finally, put one more expected directory in place and install:
\code
mkdir -p -- "/usr/local/Lib/site-packages"
make install
\endcode
\li You will get the usual warning about library paths, and may have to take the specified action:
\code
----------------------------------------------------------------------
Libraries have been installed in:
/usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-LLIBDIR' linker flag
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
\endcode
*/
/** \page optionaldetails Implementation of optional arguments
Optional and named arguments are among the most commonly commented-on features of Apophenia, so this page goes into full detail about the implementation.
To use these features, see the all-you-really-need summary at the \ref designated
page. For a background and rationale, see the blog entry at http://modelingwithdata.org/arch/00000022.htm .
I'll assume you've read both links before continuing.
OK, now that you've read the how-to-use and the discussion of how optional and named arguments can be constructed in C, this page will show how they are done in Apophenia. The level of details should be sufficient to implement them in your own code if you so desire.
There are three components to the process of generating optional arguments as implemented here:
\li Produce a \c struct whose elements match the arguments to the function.
\li Write a wrapper function that takes in the struct, unpacks it, and calls the original function.
\li Write a macro that makes the user think the wrapper function is the real thing.
None of these steps are really rocket science, but there is a huge amount of redundancy.
Apophenia includes some macros that reduce the boilerplate redundancy significantly. There are two layers: the C-standard code, and the script that produces the C-standard code.
We'll begin with the C-standard header file:
\code
#ifdef APOP_NO_VARIADIC
void apop_vector_increment(gsl_vector * v, int i, double amt);
#else
void apop_vector_increment_base(gsl_vector * v, int i, double amt);
apop_varad_declare(void, apop_vector_increment, gsl_vector * v; int i; double amt);
#define apop_vector_increment(...) apop_varad_link(apop_vector_increment, __VA_ARGS__)
#endif
\endcode
First, there is an if/else that allows the system to degrade gracefully
if you are sending C code to a parser like swig, whose goals differ
too much from straight C compilation for this to work. Just set \c
APOP_NO_VARIADIC to produce a plain function with no variadic support.
Else, we begin the above steps. The \c apop_varad_declare line expands to the following:
\code
typedef struct {
gsl_vector * v; int i; double amt ;
} variadic_type_apop_vector_increment;
void variadic_apop_vector_increment(variadic_type_apop_vector_increment varad_in);
\endcode
So there's the ad-hoc struct and the declaration for the wrapper
function. Notice how the arguments to the macro had semicolons, like a
struct declaration, rather than commas, because the macro does indeed
wrap the arguments into a struct.
Here is what the \c apop_varad_link would expand to:
\code
#define apop_vector_increment(...) variadic_apop_increment_base((variadic_type_apop_vector_increment) {__VA_ARGS__})
\endcode
That gives us part three: a macro that lets the user think that they are
making a typical function call with a set of arguments, but wraps what
they type into a struct.
Now for the code file where the function is declared. Again, there is is an \c APOP_NO_VARIADIC wrapper. Inside the interesting part, we find the wrapper function to unpack the struct that comes in.
\code
\#ifdef APOP_NO_VARIADIC
void apop_vector_increment(gsl_vector * v, int i, double amt){
\#else
apop_varad_head( void , apop_vector_increment){
gsl_vector * apop_varad_var(v, NULL);
Apop_assert(v, "You sent me a NULL vector.");
int apop_varad_var(i, 0);
double apop_varad_var(amt, 1);
apop_vector_increment_base(v, i, amt);
}
void apop_vector_increment_base(gsl_vector * v, int i, double amt){
#endif
v->data[i * v->stride] += amt;
}
\endcode
The
\c apop_varad_head macro just reduces redundancy, and will expand to
\code
void variadic_apop_vector_increment (variadic_type_variadic_apop_vector_increment varad_in)
\endcode
The function with this header thus takes in a single struct, and for every variable, there is a line like
\code
double apop_varad_var(amt, 1);
\endcode
which simply expands to:
\code
double amt = varad_in.amt ? varad_in.amt : 1;
\endcode
Thus, the macro declares each not-in-struct variable, and so there will need to be one such declaration line for each argument. Apart from requiring declarations, you can be creative: include sanity checks, post-vary the variables of the inputs, unpack without the macro, and so on. That is, this parent function does all of the bookkeeping, checking, and introductory shunting, so the base function can just do the math. Finally, the introductory section will call the base function.
The setup goes out of its way to leave the \c _base function in the public namespace, so that those who would prefer speed to bounds-checking can simply call that function directly, using standard notation. You could eliminate this feature by just merging the two functions.
<b>The m4 script</b>
The above is all you need to make this work: the varad.h file, and the above structures. But there is still a lot of redundancy, which can't be eliminated by the plain C preprocessor.
Thus, in Apophenia's code base (the one you'll get from checking out the git repository, not the gzipped distribution that has already been post-processed) you will find a pre-preprocessing script that converts a few markers to the above form. Here is the code that will expand to the above C-standard code:
\code
//header file
APOP_VAR_DECLARE void apop_vector_increment(gsl_vector * v, int i, double amt);
//code file
APOP_VAR_HEAD void apop_vector_increment(gsl_vector * v, int i, double amt){
gsl_vector * apop_varad_var(v, NULL);
Apop_assert(v, "You sent me a NULL vector.");
int apop_varad_var(i, 0);
double apop_varad_var(amt, 1);
APOP_VAR_END_HEAD
v->data[i * v->stride] += amt;
}
\endcode
It is obviously much shorter. The declaration line is actually a C-standard declaration with the \c APOP_VAR_DECLARE preface, so you don't have to remember when to use semicolons. The function itself looks like a single function, but there is again a marker before the declaration line, and the introductory material is separated from the main matter by the \c APOP_VAR_END_HEAD line. Done right, drawing a line between the introductory checks or initializations and the main function can really improve readability.
The m4 script inserts a <tt>return function_base(...)</tt> at the end of the header
function, so you don't have to. If you want to call the funtion before the last line, you
can do so explicitly, as in the expansion above, and add a bare <tt>return;</tt> to
guarantee that the call to the base function that the m4 script will insert won't ever be
reached.
One final detail: it is valid to have types with commas in them---function arguments. Because commas get turned to semicolons, and m4 isn't a real parser, there is an exception built in: you will have to replace commas with exclamation marks in the header file (only). E.g.,
\code
APOP_VAR_DECLARE apop_data * f_of_f(apop_data *in, void *param, int n, double (*fn_d)(double ! void * !int));
\endcode
m4 is POSIX standard, so even if you can't read the script, you have the program needed to run it. For example, if you name it \c prep_variadics.m4, then run
\code
m4 prep_variadics.m4 myfile.m4.c > myfile.c
\endcode
*/
/** \page dataprep Data prep rules
There are a lot of ways your data can come in, and we would like to run estimations on a reasonably standardized form.
First, this page will give a little rationale, which you are welcome to skip, and then will present the set of rules.
\section dataones Dealing with the ones column
Most standard regression-type estimations require or generally expect
a constant column. That is, the 0th column of your data is a constant (one), so the first parameter
\f$\beta_1\f$ is slightly special in corresponding to a constant rather than a variable.
However, there are some estimations that do not use the constant column.
\em "Why not implicitly assume the ones column?"
Some stats packages implicitly assume a constant column, which the user never sees. This violates the principle of transparency
upon which Apophenia is based, and is generally annoying. Given a data matrix \f$X\f$ with the estimated parameters \f$\beta\f$,
if the model asserts that the product \f$X\beta\f$ has meaning, then you should be able to calculate that product. With a ones column, a dot product is one line \c apop_dot(x, your_est->parameters, 0, 0)); without a ones column, the problem is left as an unpleasant exercise for the reader.
\subsection datashunt Shunting columns around.
Each regression-type estimation has one dependent variable and several independent. In the end, we want the dependent variable to be in the vector element. However, continuing the \em "lassies faire" tradition, doing major surgery on the data, such as removing a column and moving in all subsequent columns, is more invasive than an estimation should be.
\subsection datarules The rules
So those are the two main considerations in prepping data. Here are the rules, intended to balance those considerations:
\subsubsection datamatic The automatic case
There is one clever trick we can use to resolve both the need for a ones column and for having the dependent column in the vector: given a data set with no vector element and the dependent variable in the first column of the matrix, we can copy the dependent variable into the vector and then replace the first column of the matrix with ones. The result
fits all of the above expectations.
You as a user merely have to send in a \c apop_data set with no vector and a dependent column in the first column.
\subsubsection dataprepped The already-prepped case
If your data has a vector element, then the prep routines won't try to force something to be there. That is, they won't move anything, and won't turn anything into a constant column. If you don't want to use a constant column, or your data has already been prepped by an estimation, then this is what you want.
You as a user just have to send in a \c apop_data set with a filled vector element.
*/
/**
\page gentle A quick overview
This is a "gentle introduction" to the Apophenia library. It is intended
to give you some initial bearings on the typical workflow and the concepts and tricks that
the manual pages assume you have met.
This introduction assumes you already have some familiarity with C, how to compile a program, and how to use a debugger. If you want to install Apophenia now so you can try the samples on this page, see the \ref setup page.
An outline of this overview:
\li Apophenia fills a space between traditional C libraries and stats packages.
\li The \ref apop_data structure represents a data set (of course). Data sets are inherently complex,
but there are many functions that act on \ref apop_data sets to make life easier.
\li The \ref apop_model encapsulates the sort of actions one would take with a model, like estimating model parameters or predicting values based on new inputs.
\li Databases are great, and a perfect fit for the sort of paradigm here. Apophenia
provides functions to make it easy to jump between database tables and \ref apop_data sets.
\par The opening example
Setting aside the more advanced applications and model-building tasks, let us begin with
the workflow of a typical fitting-a-model project using Apophenia's tools:
\li Read the raw data into the database using \ref apop_text_to_db.
\li Use SQL queries handled by \ref apop_query to massage the data as needed.
\li Use \ref apop_query_to_data to pull some of the data into an in-memory \ref apop_data set.
\li Call a model estimation such as \code apop_estimate (data_set, apop_ols)\endcode or \code apop_estimate (data_set, apop_probit)\endcode to fit parameters to the data. This will return an \ref apop_model with parameter estimates.
\li Interrogate the returned estimate, by dumping it to the screen with \ref apop_model_print, sending its parameters and variance-covariance matrices to additional tests (the \c estimate step runs a few for you), or send the model's output to be input to another model.
Here is a concrete example of most of the above steps, which you can compile and run, to be discussed in detail below.
The program:
\include ols.c
To run this, you will need a file named <tt>data</tt> in comma-separated form, so
here is a set sufficient for the demonstration. The first column is the dependent variable; the
remaining columns are the independent:
\code
Y, X_1, X_2, X_3
2,3,4,5
1,2,9,3
4,7,9,0
2,4,8,16
1,4,2,9
9,8,7,6
\endcode
If you saved the code to <tt>sample.c</tt> and don't have a \ref makefile or other
build system, then you can compile it with
\code
gcc sample.c -std=gnu99 -lapophenia -lgsl -lgslcblas -lsqlite3 -o run_me
\endcode
or
\code
clang sample.c -lapophenia -lgsl -lgslcblas -lsqlite3 -o run_me
\endcode
and then run it with <tt>./run_me</tt>. This compile line will work on any system with all the requisite tools,
but for full-time work with this or any other C library, you will probably want to write a \ref makefile .
The results are unremarkable---this is just an ordinary regression on too few data
points---but it does give us some lines of code to dissect.
The first two lines in \c main() make use of a database.
I'll discuss the value of the database step more at the end of this page, but for
now, note that there are several functions, \ref apop_query, and \ref
apop_query_to_data being the ones you will most frequently be using, that will allow you to talk to and pull data from either an SQLite or mySQL/mariaDB database.
\par Designated initializers
Like this line,
\code
apop_text_to_db(.text_file="data", .tabname="d");
\endcode
many Apophenia functions accept named, optional arguments. To give another example,
the \ref apop_data set has the usual row and column numbers, but also row and column
names. So you should be able to refer to a cell by any combination of name or number;
for the data set you read in above, which has column names, all of the following work:
\code
x = apop_data_get(data, 2, 3); //x == 0
x = apop_data_get(data, .row=2, .colname="X_3"); // x == 0
apop_data_set(data, 2, 3, 18);
apop_data_set(data, .colname="X_3", .row=2, .val= 18);
\endcode
Default values mean that the \ref apop_data_get, \ref apop_data_set, and \ref apop_data_ptr functions handle matrices, vectors, and scalars sensibly:
\code
//Let v be a hundred-element vector:
apop_data *v = apop_data_alloc(100);
double x1 = apop_data_get(v, 1);
apop_data_set(v, 2, .val=x1);
//A 100x1 matrix behaves like a vector
apop_data *m = apop_data_alloc(100, 1);
double m1 = apop_data_get(v, 1);
//let s be a scalar stored in a 1x1 apop_data set:
apop_data *v = apop_data_alloc(1);
double *scalar = apop_data_ptr(s);
\endcode
This form may be new to users of less user-friendly C libraries, but it it fully
conforms to the C standard (ISO/IEC 9899:2011). See the \ref designated page for details.
\section apop_data
A lot of real-world data processing is about quotidian annoyances about text versus
numeric data or dealing with missing values, and the the \ref apop_data set and its
many support functions are intended to make data processing in C easy. Some users of
Apophenia use the library only for its \ref apop_data set and associated functions. See
the "data sets" section of the outline page (linked from the header of this page)
for extensive notes on using the structure.
The structure includes seven parts:
\li a vector
\li a matrix
\li a grid of text elements
\li a vector of weights
\li names for everything: row names, a vector name, matrix column names, text names.
\li a link to a second page of data
\li an error marker
This is not a generic and abstract ideal, but is really the sort of mess that data sets look like. For
example, here is some data for a weighted OLS regression. It includes an outcome
variable in the vector, dependent variables in the matrix and text grid,
replicate weights, and column names in bold labeling the variables:
\htmlinclude apop_data_fig.html
Apophenia will generally assume that one row across all of these elements
describes a single observation or data point.
See above for some examples of getting and setting individual elements.
Also, \ref apop_data_get, \ref apop_data_set, and \ref apop_data_ptr consider the vector to be the -1st column,
so using the data set in the figure, <tt>apop_data_get(sample_set, .row=0, .col=-1) == 1</tt>.
\par Reading in data
As per the example above, use \ref apop_text_to_data or \ref apop_text_to_db and then \ref apop_query_to_data.
\par Subsets
There are many macros to get subsets of the data. Each generates what is
considered to be a disposable view: once the variable goes out of scope (by the usual C
rules of scoping), it is no longer valid. However, these structures are all wrappers for pointers
to the base data, so all operations on the data view affect the base data.
\include simple_subsets.c
All of these slicing routines are macros, because they generate several
background variables in the current scope (something a function can't do). Traditional
custom is to put macro names in all caps, like \c APOP_DATA_ROWS, which to modern
sensibilities looks like yelling. The custom has a logic: there are ways to hang
yourself with macros, so it is worth distinguishing them typographically.
The documentation always uses a single capital.
Notice that all of the slicing macros return nothing, so there is nothing to do with one
but put it on a line by itself. This limits the number of misuses.
\par Basic manipulations
The outline page (which you can get to via a link next to the snowflake header at the top of every page on this site) lists a number of other manipulations of data sets, such as
\ref apop_data_listwise_delete for quick-and-dirty removal of observations with <tt>NaN</tt>s,
\ref apop_data_split / \ref apop_data_stack,
or \ref apop_data_sort to sort all elements by a single column.
\par Apply and map
If you have an operation of the form <em>for each element of my data set, call this
function</em>, then you can use \ref apop_map to do it. You could basically do everything you
can do with an apply/map function via a \c for loop, but the apply/map approach is clearer
and more fun. Also, if you set OpenMP's <tt>omp_set_num_threads(N)</tt> for any \c N
greater than 1 (the default on most systems is the number of CPU cores), then the work
of mapping will be split across multiple CPU threads. See the outline \> data sets \>
map/apply section for a number of examples.
\par Text
Text in C is annoying. C already treats strings as pointer-to-characters, so a grid
of text data is a pointer-to-pointer-to-pointer-to-character. The text grid in the
\ref apop_data structure actually takes this form, but functions are provided to do
most or all the pointer work for you. The \ref apop_text_alloc function
is really a realloc function: you can use it to resize the text grid as necessary. The
\ref apop_text_add function will do the pointer work in copying a single string to the
grid. Functions that act on entire data sets, like \ref apop_data_rm_rows, handle the
text part as well.
You have <tt>your_data->textsize[0]</tt> rows and <tt>your_data->textsize[1]</tt> columns. If you are using only the functions to this point, then empty elements are a blank string (<tt>""</tt>), not \c NULL.
For reading individual elements, refer to the \f$(i,j)\f$th text element via <tt>your_data->text[i][j]</tt>.
\par Errors
Many functions will set the <tt>error</tt> element of the \ref apop_data structure being operated on if anything goes wrong. You can use this to halt the program or take corrective action:
\code
apop_data *the_data = apop_query_to_data("select * from d");
if (the_data->error) exit(1);
\endcode
\par The whole structure
Here is a diagram of all of Apophenia's structures and how they
relate. It is taken from this
<a href="http://modelingwithdata.org/pdfs/cheatsheet.pdf">cheat sheet</a> (2 page PDF),
which will be useful to you if only because it lists some of the functions that act on
GSL vectors and matrices that are useful (in fact, essential) but out of the scope of the Apophenia documentation.
\image html structs.png
\image latex structs.png
All of the elements of the \ref apop_data structure are laid out at middle-left. You have
already met the vector, matrix, and weights, which are all a \c gsl_vector or \c gsl_matrix.
The diagram shows the \ref apop_name structure, which has received little mention so far because names
basically take care of themselves. Just use \ref apop_data_add_names to add names to your data
set and \ref apop_name_stack to copy from one data set to another.
The \ref apop_data structure has a \c more element, for when your data is best expressed
in more than one page of data. Use \ref apop_data_add_page, \ref apop_data_rm_page,
and \ref apop_data_get_page. Output routines will sometimes append an extra page of
auxiliary information to a data set, such as pages named <tt>\<Covariance\></tt> or
<tt>\<Factors\></tt>. The angle-brackets indicate a page that describes the data set
but is not a part of it (so an MLE search would ignore that page, for example).
Now let us move up the structure diagram to the \ref apop_model structure.
\section gentle_model apop_model
Even restricting ourselves to the most basic operations, there are a lot of things that
we want to do with our models: estimating the parameters of a model (like the mean and
variance of a Normal distribution) given data, or drawing random numbers, or showing the
expected value, or showing the expected value of one part of the data given fixed values
for the rest of it. The \ref apop_model is intended to encapsulate most of these desires
into one object, so that models can easily be swapped around, modified to create new models,
compared, and so on.
From the figure above, you can see that the \ref apop_model structure is pretty big,
including a number of informational items, key being the \c parameters, \c data, and
\c info elements, a list of settings to be discussed below, and a set of procedures for
many operations. Its contents are not (entirely) arbitrary: the theoretical basis for
what is and is not included in an \ref apop_model, as well as its overall intent, are
described in this <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf">research
report</a>.
There are helper functions that will allow you to avoid deailing with the model internals. For example, the \ref apop_estimate helper function means you never have to look at the model's \c estimate method (if it even has one), and you will simply pass the model to a function, as with the above form:
\code
apop_model *est = apop_estimate(data, apop_ols);
\endcode
\li Apophenia ships with a broad set of models, like \ref apop_ols, \ref apop_dirichlet,
\ref apop_loess, and \ref apop_pmf (probability mass function); see the full list on <a href="http://apophenia.info/group__models.html">the models documentation page</a>. You would estimate the
parameters of any of them using the form above, with the appropriate model in the second
slot of the \ref apop_estimate call.
\li The models that ship with Apophenia, like \ref apop_ols, include the procedures and some metadata, but are of course not yet estimated using a data set (i.e., <tt>data == NULL</tt>, <tt>parameters == NULL</tt>). The line above generated a new
model, \c est, which is identical to the base OLS model but has estimated parameters
(and covariances, and basic hypothesis tests, a log likelihood, AIC_c, BIC, et cetera), and a \c data pointer to the \ref apop_data set used for estimation.
\li You will mostly use the models by passing them as inputs to
functions like \ref apop_estimate, \ref apop_draw, or \ref apop_predict; more examples below.
Other than \ref apop_estimate, most require a parameterized model like \c est. After all, it doesn't make sense to
draw from a Normal distribution until you've specified its mean and standard deviation.
\li If you know what the parameters should be, use \ref apop_model_set_parameters. E.g.
\code
apop_model *std_normal = apop_model_set_parameters(apop_normal, 0, 1);
apop_data *a_thousand_normals = apop_model_draws(std_normal, 1000);
apop_model *poisson = apop_model_set_parameters(apop_poisson, 1.5);
apop_data *a_thousand_waits = apop_model_draws(poisson, 1000);
\endcode
\li You can use \ref apop_model_print to print the various elements to screen.
\li You can combine and transform models with functions such as \ref apop_model_fix_params, \ref apop_model_coordinate_transform, or \ref apop_model_mixture. Each of these functions produce a new model, which can be estimated, re-combined, or otherwise used like any other model.
\li Writing your own models won't be covered in this introduction, but it can be pretty easy to
copy and modify the procedures of an existing model to fit your needs. When in doubt, delete a procedure, because any procedures that are missing will have
defaults filled when used by functions like \ref apop_estimate (which uses \ref
apop_maximum_likelihood) or \ref apop_cdf (which uses integration via random draws). See \ref modeldetails for details.
\li There's a simple rule of thumb for remembering the order of the arguments to most of
Apophenia's functions, including \ref apop_estimate : the data always comes first.
\par Settings
Every model, and every method one would apply to a model, is prone to have a list of
settings: how many bins in the histogram, at what tolerance does the maximum likelihood
search end, what are the models being combined in the mixture?
Apophenia organizes settings in <em>settings groups</em>, which are then attached
to models. In the following snippet, we specify a Beta distribution prior. If the
likelihood function were a Binomial distribution, \ref apop_update knows the closed-form
posterior for a Beta-Binomial pair, but in this case it will have to run Markov chain
Monte Carlo. Attach a \ref apop_mcmc_settings group to the prior to specify details
of how the run should work.
For a likelihood, we generate an empirical distribution---a PMF---from an input
data set; you will often use the \ref apop_pmf to turn a data set into an distribution.
When we call \ref apop_update on the last line, it already has all of the above info
on hand.
\code
apop_model *beta = apop_model_set_parameters(apop_beta, 0.5, 0.25);
Apop_settings_add_group(beta, apop_mcmc, .burnin = 0.2, .periods =1e5);
apop_model *my_pmf = apop_estimate(your_data, apop_pmf);
apop_model *posterior = apop_update(.prior= beta, .likelihood = my_pmf);
\endcode
You will encounter model settings often when doing nontrivial work with models. All
can be set using a form like above. See the <a href="http://apophenia.info/group__settings.html">settings documentation page</a>
for the full list of options.
There is a full discussion of using and writing settings groups in the <a href="http://apophenia.info/outline.html">outline page</a> under the Models heading.
\par Databases and models
Returning to the introductory example, you saw that (1) the
library expects you to keep your data in a database, pulling out the
data as needed, and (2) that the workflow is built around
\ref apop_model structures.
Starting with (2),
if a stats package has something called a <em>model</em>, then it is
probably of the form Y = [an additive function of <b>X</b>], such as \f$y = x_1 +
\log(x_2) + x_3^2\f$. Trying new models means trying different
functional forms for the right-hand side, such as including \f$x_1\f$ in
some cases and excluding it in others. Conversely, Apophenia is designed
to facilitate trying new models in the broader sense of switching out a
linear model for a hierarchical, or a Bayesian model for a simulation.
A formula syntax makes little sense over such a broad range of models.
As a result, the right-hand side is not part of
the \ref apop_model. Instead, the data is assumed to be correctly formatted, scaled, or logged
before being passed to the model. This is where part (1), the database,
comes in, because it provides a proxy for the sort of formula specification language above:
\code
apop_data *testme= apop_query_to_data("select y, x1, log(x2), pow(x3, 2) from data");
apop_model *est = apop_estimate(testme, apop_ols);
\endcode
Generating factors and dummies is also considered data prep, not model
internals. See \ref apop_data_to_dummies and \ref apop_data_to_factors.
Now that you have \c est, an estimated model, you can interrogate it. This is really where Apophenia and its encapsulated
model objects shine, because you can do more than just admire the parameter estimates on
the screen: you can take your estimated data set and fill in or generate new data, use it
as an input to the parent distribution of a hierarchical model, et cetera. Some simple
examples:
\code
//If you have a new data set with missing elements (represented by NaN), you can fill in predicted values:
apop_predict(new_data_set, est);
apop_data_show(new_data_set);
//Fill a matrix with random draws.
apop_data *d = apop_model_draws(est, .count=1000);
//How does the AIC_c for this model compare to that of est2?
printf("ΔAIC_c=%g\n", apop_data_get(est->info, .rowname="AIC_c")
- apop_data_get(est2->info, .rowname="AIC_c"));
\endcode
\par Testing
Here is the model for all testing within Apophenia:
\li Calculate a statistic.
\li Describe the distribution of that statistic.
\li Work out how much of the distribution is (above|below|closer to zero than) the statistic.
There are a handful of named tests that produce a known statistic and then compare to a
known distribution, like \ref apop_test_kolmogorov or \ref apop_test_fisher_exact. For
traditional distributions (Normal, \f$t\f$, \f$\chi^2\f$), use the \ref apop_test convenience
function.
But if your model is not from the textbook, then you have the tools to apply the
above three-step process directly. First I'll give an overview of the three steps,
then another working example.
\li Model parameters are a statistic, and you know that <tt>apop_estimate(your_data,
your_model)</tt> will output a model with a <tt>parameters</tt> element.
\li The distribution of a parameter is also a model, so
\ref apop_parameter_model will also return an \ref apop_model.
\li \ref apop_cdf takes in a model and a data point, and returns the area under the data
point.
Defaults for the parameter models are filled in via bootstrapping or resampling, meaning
that if your model's parameters are decidedly off the Normal path, you can still test
claims about the parameters.
The introductory example ran a standard OLS regression, whose output includes some
standard hypothesis tests; to conclude, let us go the long way and replicate those results
via the general \ref apop_parameter_model mechanism. The results here will of course be
identical, but the more general mechanism can be used in situations where the standard
models don't apply.
The first part of this program is identical to the program above. The second
half executes the three steps uses many of the above tricks: one of the inputs to
\ref apop_parameter_model (which row of the parameter set to use) is sent by adding a
settings group, we pull that row into a separate data set using \ref Apop_r, and we
set its vector value by referring to it as the -1st element.
\include ols2.c
Note that the procedure did not assume the model parameters had a certain form. It
queried the model for the distribution of parameter \c x_1, and if the model didn't have
a closed-form answer then a distribution via bootstrap would be provided. Then that model
was queried for its CDF. [The procedure does assume a symmetric distribution. Fixing this
is left as an exercise for the reader.] For a model like OLS, this is entirely overkill,
which is why OLS provides the basic hypothesis tests automatically. But for models
where the distribution of parameters is unknown or has no closed-form solution, this
may be the only recourse.
This introduction has shown you the \ref apop_data set and some of the functions
associated, which might be useful even if you aren't formally doing statistical work but do have to deal with data with real-world elements like column names and mixed
numeric/text values. You've seen how Apophenia encapsulates as many of a model's
characteristics as possible into a single \ref apop_model object, which you can send with
data to functions like \ref apop_estimate, \ref apop_predict, or \ref apop_draw. Once
you've got your data in the right form, you can use this to simply estimate model
parameters, or as an input to later analysis.
*/
/** \page modeldetails Writing new models
The \ref apop_model is intended to provide a consistent expression of <em>any</em>
model that (implicitly or explicitly) expresses a likelihood of data given parameters,
including traditional linear models, textbook distributions, Bayesian hierarchies,
microsimulations, and any combination of the above. The unifying feature is that
all of the models act over some data space and some parameter space (in some cases
one or both is the empty set), and can assign a likelihood for a fixed pair of
parameters and data given the model. This is a very broad requirement, often used
in the statistical literature. For discussion of the theoretical structures, see <a
href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf"><em>A Useful Algebraic System
of Statistical Models</em></a> (PDF).
This page includes:
\li \ref write_likelihoods, giving a quick overview of how to write a new model from scratch.
\li \ref settingswriting, covering the writing of <em>ad hoc</em> structures to hold model- or method-specific details, like the number of periods for burning in an MCMC run or the number of bins in a histogram.
\li \ref vtables, covering the means of writing special-case routines for functions that are not part of the \ref apop_model itself, including the score.
\li \ref modeldataparts, a detailed list of the requirements for the data (non-function) elements of an \ref apop_model.
\li \ref methodsection, a detailed list of requirements for the method (function) elements of an \ref apop_model.
\section write_likelihoods A walkthrough
Users are encouraged to always use models via the helper functions, like
\ref apop_estimate or \ref apop_cdf. The helper functions do some boilerplate error
checking, and are where the defaults are called: if your model has a \c log_likelihood
method but no \c p method, then \ref apop_p will use exp(\c log_likelihood). If you don't
give an \c estimate method, then \c apop_estimate will call \ref apop_maximum_likelihood.
So the game in writing a new model is to write just enough internal methods to give the helper functions what they need.
In the not-uncommon best case, all you need to do is write a log likelihood function.
Here is how one would set up a model that could be estimated using maximum likelihood:
\li Write a likelihood function. Its header will look like this:
\code
long double new_log_likelihood(apop_data *data, apop_model *m);
\endcode
where \c data is the input data, and \c
m is the parametrized model (i.e. your model with a \c parameters element set by the caller).
This function will return the value of the log likelihood function at the given parameters.
\li Is this a constrained optimization? See the outline page under maximum likelihood methods \f$->\f$ Setting constraints on how to set them. Otherwise, no constraints will be assumed.
\li Write the object:
\code
apop_model *your_new_model = &(apop_model){"The Me distribution",
.vsize=n0, .msize1=n1, .msize2=n2, .dsize=nd,
.log_likelihood = new_log_likelihood };
\endcode
\li The first element is the human-language name for your model.
\li the \c vsize, \c msize1, and \c msize2 elements specify the shape of the parameter set. For example, if it's three numbers in the vector, then set <tt>.vsize=3</tt> and omit the matrix sizes. The default model prep routine will call
<tt>new_est->parameters = apop_data_alloc(vsize, msize1, msize2)</tt>.
\li The \c dsize element is the size of one random draw from your model.
\li It's common to have [the number of columns in your data set] parameters; this
count will be filled in if you specify \c -1 for \c vsize, <tt>msize(1|2)</tt>, or
<tt>dsize</tt>. If the allocation is exceptional in a different way, then you will
need to allocate parameters by writing a custom \c prep method for the model.
\li If there are constraints, add an element for those too.
You already have more than enough that something like this will work (the \c dsize is used for random draws):
\code
apop_model *estimated = apop_estimate(your_data, your_new_model);
\endcode
Once that baseline works, you can fill in other elements of the \ref apop_model as needed.
For example, if you are using a maximum likelihood method to estimate parameters, you can get much faster estimates and better covariance estimates by specifying the dlog likelihood function (aka the score):
\code
void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){
//some algebra here to find df/dp0, df/dp1, df/dp2....
gsl_vector_set(gradient, 0, d_0);
gsl_vector_set(gradient, 1, d_1);
}
\endcode
The score has to be registered (see below) using
\code
apop_score_insert(apop_new_dlog_likelihood, your_new_model);
\endcode
\section settingswriting Writing new settings groups
Your model may need additional settings or auxiliary information to function, which would require associating a model-specific struct with the model.
Before getting into the detail of how to make model-specific groups of settings work, note that there's a lightweight method of storing sundry settings, so in many cases you can bypass all of the following.
The \ref apop_model structure has a \c void pointer named \c more which you can use to
point to a model-specific struct. If \c more_size is larger than zero (i.e. you set
it to <tt>your_model.more_size=sizeof(your_struct)</tt>), then it will be copied via \c
memcpy by \ref apop_model_copy, and freed by \ref apop_model_free. Apophenia's
estimation routines will never impinge on this item, so do what you wish with it.
The remainder of this subsection describes the information you'll have to provide to make
use of the conveniences described to this point: initialization of defaults, smarter
copying and freeing, and adding to an arbitrarily long list of settings groups attached
to a model. You will need four items: a typedef for the structure itself, plus init, copy, and
free functions. This is the sort of boilerplate that will be familiar to users of
object oriented languages in the style of C++ or Java, but it's really a list of
arbitrarily-typed elements, which makes this feel more like LISP. [And being a
reimplementation of an existing feature of LISP, this section will be macro-heavy.]
\li The settings struct will likely go into a header file, so
here is a sample header for a new settings group named \c ysg_settings, with a dataset, its two sizes, and an owner-of-data marker. <tt>ysg</tt> stands for Your Settings Group; replace that substring with your preferred name in every instance to follow.
\code
typedef struct {
int size1, size2;
char *refs;
apop_data *dataset;
} ysg_settings;
Apop_settings_declarations(ysg)
\endcode
The first item is a familiar structure definition. The last line is a macro that declares the
three functions below. This is everything you would
need in a header file, should you need one. These are just declarations; we'll write
the actual init/copy/free functions below.
The structure itself gets the full name, \c ysg_settings. Everything else is a macro, and so you need only specify \c ysg, and the \c _settings part is filled in. Because of these macros, your \c struct name must end in \c _settings.
If you have an especially simple structure, then you can generate the three functions with these three macros in your <tt>.c</tt> file:
\code
Apop_settings_init(ysg, )
Apop_settings_copy(ysg, )
Apop_settings_free(ysg, )
\endcode
These macros generate appropriate functions to do what you'd expect: allocating the
main structure, copying one struct to another, freeing the main structure.
The spaces after the commas indicate that no special code gets added to
the functions that these macros generate.
You'll never call these funtions directly; they are called by \ref Apop_settings_add_group,
\ref apop_model_free, and other model or settings-group handling functions.
Now that initializing/copying/freeing of
the structure itself is handled, the remainder of this section will be about how to
add instructions for the struture internals, like data that is pointed to by the structure elements.
\li For the allocate function, use the above form if everything in your code defaults to zero/\c NULL.
In most cases, though, you will need a new line declaring a default for every element in your structure. There is a macro to help with this too.
These macros will define for your use a structure named \c in, and an output pointer-to-struct named \c out.
Continuing the above example:
\code
Apop_settings_init (ysg,
Apop_assert(in.size1, "I need you to give me a value for size1. Stopping.");
Apop_varad_set(size2, 10);
Apop_varad_set(dataset, apop_data_alloc(out->size1, out->size2));
Apop_varad_set(refs, malloc(sizeof(int)));
*refs=1;
)
\endcode
Now, <tt>Apop_settings_add(a_model, ysg, .size1=100)</tt> would set up a group with a 100-by-10 data set, and set the owner bit to one.
\li Some functions do extensive internal copying, so you will need a copy function even if you don't do any explicit calls to \ref apop_model_copy. The default above simply copies every element in the structure. Pointers are copied, giving you two pointers pointing to the same data. We have to be careful to prevent double-freeing later.
\code
//The elements of the set to copy are all copied, and then make one additional modification:
Apop_settings_copy (ysg,
(*refs)++;
)
\endcode
\li The struct itself is freed by boilerplate code, but add code in the free function
to free data pointed to by pointers in the main struture. The macro defines a
pointer-to-struct named \c in for your use. Continuing the example:
\code
Apop_settings_free (ysg,
if (!(--in->refs)) {
free(in->dataset);
free(in->refs);
}
)
\endcode
With those three macros in place and the header as above, Apophenia will treat your
settings group like any other, and users can use \ref Apop_settings_add_group to
populate it and attach it to any model.
\section vtables Registering new methods in vtables
For any given function (e.g., entropy, the dlog likelihood, Bayesian updating), there is
probably a special case for well-known models like the Normal distribution. Rather than
any procedure that could have a special-case calculation to the \c apop_model struct,
functions may maintain a registry of models and associated special-case procedures.
This subsection will discuss how to add
a function to an existing vtable.
\li See \ref apop_update, \ref apop_score, \ref apop_predict, \ref apop_model_print, and \ref
apop_parameter_model for examples and procedure-specific details.
\li Write a function following the given type definition.
\li Use the associated <tt>_vtable_add</tt> function to add the function and associate it
with the given model. For example, to add a Beta-binomial routine named \c betabinom
to the registry of Bayesian updating routines, use <tt>apop_update_vtable_add(betabinom,
apop_beta, apop_binomial)</tt>.
\li Lookups happen based on a hash that takes into account the elements of the model
that will be used in the calculation. For example, the \c apop_update_hash takes in two
models and calculates the hash based on the address of the prior's \c draw method and
the likelihood's \c log_likelihood or \c p method. Thus, a vtable lookup for new models
that re-use the same methods (at the same addresses in memory) will still find the
same special-case function.
\li If you need to deregister the function, use the associated deregister function,
e.g. <tt>apop_update_vtable_drop(apop_beta, apop_binomial)</tt>. You can guarantee that a method will not be re-added by following up the <tt>_drop</tt> with, e.g., <tt>apop_update_vtable_add(NULL, apop_beta, apop_binomial)</tt>.
\li Calls to <tt>..._vtable_add</tt> are typically placed in the \c prep method of the given model, thus ensuring that the auxiliary functions are registered after the first time the model is sent to \ref apop_estimate.
This overview will not go into detail about setting up a new vtable. Briefly:
\li See the existing setups in vtables.h.
\li Cut/paste one and do a search and replace to change the name to match your desired use.
\li Set the typedef to describe the functions that get added to the vtable.
\li Rewrite the hash function to check the part of the inputs that interest you. For
example, the update vtable associates functions with the \c draw, \c log_likelihood,
and \p methods of the model. A model where these elements are identical but the name
is changed will still match.
\section modeldataparts The data elements
The remainder of this page covers the detailed expectations regarding the elements
of the \ref apop_model structure. I begin with the data (non-function) elements,
and then cover the method (function) elements. Some of the following will be
requirements for all models and some will be advice to authors; I use the accepted
definitions of <a href="http://tools.ietf.org/html/rfc2119">"must", "shall", "may"</a>
and related words.
\subsection datasubsec data
\li Each row should be a single observation.
For example, \ref apop_bootstrap_cov depends on each row being an iid observation to function correctly.
Calculating the Bayesian Information Criterion (BIC) requires knowing
the number of observations in the data, and assumes that row count==observation count.
For complex data, the \ref apop_data_pack and \ref apop_data_unpack functions can help with this.
\li Some functions (bootstrap again, or many uses of \ref apop_kl_divergence) use \ref
apop_draw to use your model's RNG (or a default) to draw a
value, write it to the matrix element of the data set, and then move on to an
estimation or other step. In this case, the data sent in will be entirely in the \c
->matrix element of the \ref apop_data set sent to model methods. Your \c likelihood, \c p, \c cdf, and \c estimate routines
must accept data as a single row of the matrix of the \ref apop_data set for such functions to work.
\li Your routines may accept other data formats, as per contract with the user.
For example, regression-type functions use a function named \c ols_shuffle
to convert a matrix where the first column is the dependent variable to a data
set with dependent variable in the vector and a column of ones in the first
matrix column. By checking for a vector, the prep function knows whether to do
the shuffling or not. Most univariate distributions take each scalar element as
a separate data point; having one data point per row is a special case.
\subsection paramsubsec Parameters, vsize, msize1, msize2
\li The sizes will be used by the \c prep method of the model; see below. Given the model \c m and its elements \c m.vsize, \c m.msize1, \c m.msize2,
functions that need to allocate a parameter set will do so via <tt>apop_data_alloc(m.vsize, m.msize1, m.msize2)</tt>.
\li As a special case, if you set any of \c .vsize, \c .msize1, or \c .msize2
to \c -1, then the default prep method will set that size to the number of columns in
the input data. This is what you want for regression methods, where there is one parameter per independent variable.
\subsection infosubsec Info
\li The first page, named \c <info> is typically a list of scalars. Nothing is guaranteed, but the elements may include:
\li AIC: <a href="https://en.wikipedia.org/wiki/Akaike's_Information_Criterion">Aikake Information Criterion</a>
\li AIC_c: AIC with a finite sample correction. "<b>Generally, we advocate the use of AIC_c when the ratio \f$n/K\f$ is small (say \f$< 40\f$)</b>" [Kenneth P. Burnham, David R. Anderson: <em>Model Selection and Multi-Model Inference</em>, p 66, emphasis in original.]
\li BIC: <a href="https://en.wikipedia.org/wiki/Bayesian_information_criterion">Bayesian Information Criterion</a>
\li R squared
\li R squared adj
\li log likelihood
\li status.
For those elements that require a count of input data, the calculations assume each row in the input \ref apop_data set is a single datum.
Get these via, e.g., <tt>apop_data_get(your_model->info, .rowname="log likelihood")</tt>.
When writing for any arbitrary function, be prepared to handle \c NaN, indicating that the element is not calculated or saved in the info page by the given model.
\li Several routines will include a \c predict table. The table has these rows:
\li row (optional)
\li col (optional)
\li observed
\li predicted
\li residual
For OLS-type estimations, each row corresponds to the row in the original data. For
filling in of missing data, the elements may appear anywhere, so the row/col indices are
essential.
\subsection settingsgroupmention settings, more
In object-oriented jargon, settings groups are the private elements of the data set,
to be pulled out in certain contexts, and ignored in all others. Therefore, there are
no rules about internal use. The \c more element of the \ref apop_model provides a lightweight
means of attaching an arbitrary struct to a model. See \ref settingswriting above for details.
\li As many settings groups of different types as desired can be added to a single \ref apop_model.
\li One \ref apop_model can not hold two settings groups of the same type. Re-additions cause the removal of the previous version of the group.
\section methodsection Methods
\subsection psubsection p, log_likelihood
\li Function headers look like <tt>long double your_p_or_ll(apop_data *d, apop_model *params)</tt>.
\li The inputs are an \ref apop_data set and an \ref apop_model, which should include a filled <tt>->parameters</tt> element.
\li We assume that the parameters have been set, by users via \ref apop_estimate or \ref apop_model_set_parameters, or by \ref apop_maximum_likelihood by its search algorithms. if the parameters are necessary, the function shall check that the parameters are not \c NULL and set the model's \c error element to \c 'p' if they are.
\li Return \c NaN on errors. If an error in the input model is found, the function may set the input model's \c error element to an appropriate \c char value.
\li If observations are assumed to be iid, you can probably use \ref apop_map_sum to write the core of the log likelihood function.
\li If your model includes both \ log_likelihood and \c p methods, it must be the case that <tt>log(p(d, m))</tt> equals <tt>log_likelihood(d, m)</tt> for all \c d and \c m.
\subsection prepsubsection prep
\li Function header looks like <tt>void your_prep(apop_data *data, apop_model *params)</tt>.
\li If \c vsize, \c msize1, or \c msize2 are -1, then the prep function will set them to the width of the input data.
\li If \c dsize is -1, then the prep function shall set it to the width of the input data.
\li If the \c parameters element is not allocated, the function shall allocate it via <tt>apop_data_alloc(vsize, msize1, msize2)</tt> (or equivalent).
\li The model's <tt>data</tt> pointer shall be set to point to the input data.
\li The input data may be modified by the prep routine. For example, the OLS prep routine shuffles a single input matrix as described above under \c data.
\li The \c info element shall be allocated and its title set to "<Info>".
\li The default is \ref apop_model_clear. It does all of the above.
\li The prep routine may initialize any desired settings groups. Unless otherwise
stated, these should not be removed if they are already there, so that users can override defaults by adding a settings group before starting an estimation.
\li If any functions associated with the model need to be added to
a vtable (see above), the registration shall happen here. Registration may also happen elsewhere.
\subsection estimatesubsection estimate
\li Function header looks like <tt> void your_estimate(apop_data * data, apop_model *params)</tt>.
\li Assume that the prep routine has already been run. Notably, this means that parameters have been allocated.
\li Assume that the \c parmaeters hold garbage (as in a \c malloc without a subsequent assignment to the <tt>malloc</tt>-ed space).
\li The function modifies the input model, and returns nothing. Note that this is different from the wrapper function, \ref apop_estimate, which makes a copy of its input model, preps it, and then calls the \c estimate function with the prepeped copy.
\li The function shall set the \c parameters of the input model. For consistency with other models, the estimate should be the maximum likelihood estimate, unless otherwise documented.
\li Additional settings may be set.
\li The model's \c <Info> page may be filled with data. For scalars like log likelihood and AIC, use \ref apop_data_add_named_elmt.
\li Data should not be modified by the \c estimate routine; any changes to the data made by \c estimate must be documented.
\li The default called by \ref apop_estimate is \ref apop_maximum_likelihood.
\li If errors occur during processing, set the model's \c error element to a single character. Documentation should include the list of error characters and their meaning.
\subsection drawsubsection draw
\li Function header looks like <tt>void your_draw(double *out, gsl_rng* r, apop_model *params)</tt>
\li Assume that model \c paramters are set, via \ref apop_estimate or \ref apop_model_set_parameters. The author of the draw method should check that \c parameters are not \c NULL and fill the output with NaNs if necessary parameters are not set.
\li User inputs a pointer-to-<tt>double</tt> of length \c dsize; user is expected to make sure that there is adequate space. User also inputs a \c gsl_rng, already allocated (probably via \ref apop_rng_alloc).
\li The function shall fill the space pointed to by the input pointer with a random draw from the data space, where the likelihood of any given observation is proportional to its likelihood as given by the \c p method. Data shall be reduced to a single vector via \ref apop_data_pack if it is not already a single vector.
\subsection cdfsubsection cdf
\li Function header looks like <tt>long double your_cdf(apop_data *d, apop_model *params)</tt>.
\li Assume that \c paramters are set, via \ref apop_estimate or \ref apop_model_set_parameters. The author of the CDF method should check that \c parameters are not \c NULL and return NaN if necessary parameters are not set.
\li The CDF method must accept data as a single row of data in the \c matrix of the input \ref apop_data set (as per a draw produced using the \c draw method). May accept other formats.
\li Returns the percentage of the likelihood function \f$\leq\f$ the first row of the input data. The definition of \f$\leq\f$ is chosen by the model author.
\li If one is not already present, an \c apop_cdf_settings group may be added to the model. See the \ref apop_cdf function for details of its use.
\subsection constraintsubsection constraint
\li Function header looks like <tt>long double your_constraint(apop_data *data, apop_model *params)</tt>.
\li Assume that \c parameters are set, via \ref apop_estimate, \ref apop_model_set_parameters, or the internals of an MLE search. The author of the constraint method should check that \c parameters are not \c NULL and return NaN if necessary parameters are not set.
\li See \ref apop_linear_constraint for a useful basis and/or example. Many constraints can be written as wrappers for this function.
\li If the constraint is met, then return zero.
\li If the constraint fails, then (1) move the \c parameters in the input model to a
constraint-satisfying value, and (2) return the distance between the input parameters and
what you've moved the parameters to. The choice of within-bounds parameters and distance function is left to the author of the constraint function.
*/
|