1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286
|
/* Apophenia's narrative documentation
Copyright (c) 2005--2013 by Ben Klemens. Licensed under the GPLv2; see COPYING. */
/** \mainpage Welcome
Apophenia is an open statistical library for working with data sets and statistical
models. It provides functions on the same level as those of the typical stats package
(such as OLS, Probit, or singular value decomposition) but gives the user more
flexibility to be creative in model-building. The core functions are written in C,
but experience has shown them to be easy to bind to in Python/Julia/Perl/Ruby/&c.
It is written to scale well, to comfortably work with gigabyte data sets, million-step simulations, or
computationally-intensive agent-based models.
<em>The goods</em>
The library has been growing and improving since 2005, and has been downloaded well over 1e4 times. To date, it has over two hundred functions and macros to facilitate statistical computing, such as:
\li OLS and family, discrete choice models like Probit and Logit, kernel density estimators, and other common models.
\li Functions for transforming models (like Normal \f$\rightarrow\f$ truncated Normal)
and combining models (produce the cross-product of that truncated Normal with three others,
or use Bayesian updating to combine that cross-product prior with an OLS likelihood
to produce a posterior distribution over the OLS parameters).
\li Database querying and maintenance utilities.
\li Data manipulation tools for splitting, stacking, sorting, and otherwise shunting data sets.
\li Moments, percentiles, and other basic stats utilities.
\li \f$t\f$-tests, \f$F\f$-tests, et cetera.
\li Several optimization methods available for your own new models.
\li It does <em>not</em> re-implement basic matrix operations or build yet another database
engine. Instead, it builds upon the excellent <a href="http://www.gnu.org/software/gsl/">GNU
Scientific</a> and <a href="http://www.sqlite.org/">SQLite</a> libraries. MySQL/mariaDB is also supported.
For the full list of macros, functions, and prebuilt models, check the <a href="group__all__public.html">index</a>.
<em><a href="https://github.com/b-k/apophenia/archive/pkg.zip">Download Apophenia here</a>.</em>
Most users will want to download the latest packaged version linked from the <a
href="https://github.com/b-k/apophenia/archive/pkg.zip">Download
Apophenia here</a> header. See the \ref setup page for detailed setup instructions,
including how to use your package manager to install the Debian package.
<em>The documentation</em>
To start off, have a look at this \ref gentle "Gentle Introduction" to the library.
<a href="outline.html">The outline</a> gives a more detailed narrative.
The <a href="group__all__public.html">index</a> lists every function in the
library, with detailed reference
information. Notice that the header to every page has a link to the outline and the index.
To really go in depth, download or pick up a copy of <a
href="http://modelingwithdata.org">Modeling with Data</a>, which discusses general
methods for doing statistics in C with the GSL and SQLite, as well as Apophenia
itself. <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf"><em>A Useful
Algebraic System of Statistical Models</em></a> (PDF) discusses some of the theoretical
structures underlying the library.
There is a <a href="https://github.com/b-k/apophenia/wiki">wiki</a> with some convenience
functions, tips, and so on.
<em>Notable features</em>
Much of what Apophenia does can be done in any typical statistics package. The \ref
apop_data element is much like an R data frame, for example, and there is nothing special
about being able to invert a matrix or take the product of two matrices with a single
function call (\ref apop_matrix_inverse and \ref apop_dot, respectively).
Even more advanced features like Loess smoothing (\ref apop_loess) and the Fisher Exact
Test (\ref apop_test_fisher_exact) are not especially Apophenia-specific. But here are
some things that are noteworthy.
\li It's a C library! You can build applications using Apophenia for the data-processing
back-end of your program, and not worry about the overhead associated with scripting
languages. For example, it is currently used in production for certain aspects of
processing for the U.S. Census Bureau's American Community Survey. And the numeric
routines in your favorite scripting language typically have a back-end in plain C;
perhaps Apophenia can facilitate writing your next one.
\li The \ref apop_model object allows for consistent treatment of distributions, regressions,
simulations, machine learning models, and who knows what other sorts of models you can
dream up. By transforming and combining existing models, it is easy to build complex
models from simple sub-models.
\li For example, the \ref apop_update function does Bayesian updating on any two
well-formed models. If they are on the table of conjugates, that is correctly
handled, and if they are not, an appropriate variant of MCMC produces an empirical
distribution. The output is yet another model, from which you can make random draws,
or which you can use as a prior for another round of Bayesian updating. Outside of
Bayesian updating, the \ref apop_model_metropolis function is good for approximating
other complex models.
\li The maximum likelihood system combines several subsystems into one
form: it will do a few flavors of conjugate gradient search, Nelder-Mead Simplex,
Newton's Method, or Simulated Annealing. You pick the method by a setting attached to
your model. If you want to use a method that requires derivatives and you don't have
a closed-form derivative, the ML subsystem will estimate a numerical gradient for
you. If you would like to do EM-style maximization (all but the first parameter are
fixed, that parameter is optimized, then all but the second parameter are fixed, that
parameter is optimized, ..., looping through dimensions until the change in objective
across cycles is less than <tt>eps</tt>), add a settings group specifying the
tolerance at which the cycle should stop: <tt>Apop_settings_add_group(your_model,
apop_mle, .dim_cycle_tolerance=eps)</tt>.
\li The Iterative Proportional Fitting algorithm, \ref apop_rake, is best-in-breed,
designed to handle large, sparse matrices.
<em>Contribute!</em>
\li Develop a new model object.
\li Contribute your favorite statistical routine.
\li Package Apophenia into an RPM, portage, cygwin package.
\li Report bugs or suggest features.
\li Write bindings for your preferred language. For example, here are
a <a href="https://github.com/swuecho/apophenia-perl">Perl wrapper</a> and early versions
of <a href="http://modelingwithdata.org/arch/00000173.htm"> a Julia wrapper</a> and <a
href="https://github.com/b-k/Rapophenia/">an R wrapper</a> which you could expand upon.
If you're interested, <a href="mailto:fluffmail@f-m.fm">write to the maintainer</a> (Ben Klemens), or join the
<a href="https://github.com/b-k/apophenia">GitHub</a> project.
*/
/** \page eg Some examples
Here are a few pieces of sample code for testing your installation or to give you a
sense of what code with Apophenia's tools looks like.
<em> Two data streams</em>
The sample program here is intended to show how one would integrate Apophenia into an
existing program. For example, say that you are running a simulation of two different
treatments, or say that two sensors are posting data at regular intervals. The goal is to
gather the data in an organized form, and then ask questions of the resulting data set.
Below, a thousand draws are made from the two processes and put into a database. Then,
the data is pulled out, some simple statistics are compiled, and the data is written to
a text file for inspection outside of the program. This program will compile cleanly
with the sample \ref makefile.
\include draw_to_db.c
<em> Run a regression</em>
See \ref gentle for an example of loading a data set and running a simple regression.
<em> A sequence of t-tests</em>
In \ref mapply "The section on map/apply", a new \f$t\f$-test on every row, with all
operations acting on entire rows rather than individual data points:
\include t_test_by_rows.c
In the documentation for \ref apop_query_to_text, a program to list all the tables in an SQLite database.
\include ls_tables.c
<em>Marginal distribution</em>
A demonstration of fixing parameters to create a marginal distribution, via \ref apop_model_fix_params
\include fix_params.c
*/
/** \page setup Setting up
\section cast The supporting cast
To use Apophenia, you will need to have a working C compiler, the GSL (v1.7 or higher) and SQLite installed. mySQL/mariaDB is optional.
\li Some readers may be unfamiliar with modern package managers and common methods for setting up a C development environment; see
<a href="http://modelingwithdata.org/appendix_o.html">Appendix O</a> of <em> Modeling with Data</em> for an introduction.
\li Other pages in this documentation have a few more notes for \ref windows "Windows" users, including \ref mingw users.
\li Install the basics using your package manager. E.g., try
\code
sudo apt-get install make gcc libgsl0-dev libsqlite3-dev
\endcode
or
\code
sudo yum install make gcc gsl-devel libsqlite3x-devel
\endcode
\li If you use a Debian-based system (including Ubuntu), try the version in Debian's Stretch distribution:
\code
#Add Stretch to your sources list
sudo sed -i '$a
deb http://httpredir.debian.org/debian stretch main
deb-src http://httpredir.debian.org/debian stretch main' /etc/apt/sources.list
sudo apt-get update
#Get Apophenia
sudo apt-get install apophenia-bin
#Optional: remove Stretch from your sources list.
sudo sed -i -e '$d' -e '$d' /etc/apt/sources.list
\endcode
Thanks to Jerome Benoit for getting the library packaged and adjusted to Debian standards.
\li If the Debian package is not for you, <a href="https://github.com/b-k/apophenia/archive/pkg.zip">Download Apophenia here</a>.
Once you have the library downloaded, compile it using
\code
tar xvzf apop*tgz && cd apophenia-0.999
./configure && make && make check && sudo make install
\endcode
If you decide not to keep the library on your system, run <tt>sudo make uninstall</tt>
from the source directory to remove it.
\li If you need to install packages in your home directory because you don't have root
permissions, see the \ref notroot page.
\li A \ref makefile will help immensely when you want to compile your program.
\li You can verify that your setup works by trying some \ref eg "sample programs".
\subpage notroot
\subpage makefile
\subpage windows
\li Those who would like to work on a cutting-edge copy of the source code
can get the latest version by cutting and pasting the following onto
the command line. If you follow this route, be sure to read the development README in the
<tt>apophenia</tt> directory this command will create.
\code
git clone https://github.com/b-k/apophenia.git
\endcode
<!--git clone git://apophenia.git.sourceforge.net/gitroot/apophenia/apophenia
cvs -z3 -d:ext:<i>(your sourceforge login)</i>@cvs.sourceforge.net:/cvsroot/apophenia co -P apophenia
cvs -z3 -d:pserver:anonymous@cvs.sf.net:/cvsroot/apophenia checkout -P apophenia
svn co https://apophenia.svn.sourceforge.net/svnroot/apophenia/trunk/apophenia -->
*/
/** \page windows Windows
\ref mingw users, see that page.
If you have a choice, <a href="http://www.cygwin.com">Cygwin</a> is strongly recommended. The setup program is
very self-explanatory. As a warning, it will probably take up >300MB on
your system. You should install at least the following programs:
\li autoconf/automake
\li binutils
\li gcc
\li gdb
\li gnuplot -- for plotting data
\li groff -- needed for the man program, below
\li gsl -- the engine that powers Apophenia
\li less -- to read text files
\li libtool -- needed for compiling programs
\li make
\li man -- for reading help files
\li more -- not as good as less but still good to have
\li sqlite3 -- a simple database engine, a requisite for Apophenia
If you are missing anything else, the program will probably tell you.
The following are not necessary but are good to have on hand as long as you are going to be using Unix and programming.
\li git -- to partake in the versioning system
\li emacs -- steep learning curve, but people love it
\li ghostscript (for reading .ps/.pdf files)
\li openssh -- needed for git
\li perl, python, ruby -- these are other languages that you might also be interested in
\li tetex -- write up your documentation using the nicest-looking formatter around
\li X11 -- a windowing system
X-Window will give you a nicer environment in which to work. After you start Cygwin, type <tt>startx</tt> to bring up a more usable, nice-looking terminal (and the ability to do a few thousand other things which are beyond the scope of this documentation). Once you have Cygwin installed and a good terminal running, you can follow along with the remainder of the discussion without modification.
Some older versions of Cygwin have a \c search.h file which doesn't include the function <tt>lsearch()</tt>. If this is the case on your system, you will have to update your Cygwin installation.
Finally, windows compilers often spit out lines like:
\code
Info: resolving _gsl_rng_taus by linking to __imp__gsl_rng_taus (auto-import)
\endcode
These lines are indeed just information, and not errors. Feel free to ignore them.
[Thanks to Andrew Felton and Derrick Higgins for their Cygwin debugging efforts.]
\subpage mingw
*/
/** \page notroot Not root?
If you aren't root, then the common procedure for installing a library is to create a subdirectory in your home directory in which to install packages. The key is the <tt>--prefix</tt> addition to the <tt>./configure</tt> command.
\code
export MY_LIBS = myroot #choose a directory name to be created in your home directory.
mkdir $HOME/$MY_LIBS
# From Apophenia's package directory:
./configure --prefix $HOME/$MY_LIBS
make
make install #Now you don't have to be root.
# Adjust your paths so the compiler and the OS can find the library.
# These are environment variables, and they are usually set in the
# shell's startup files. I assume you are using bash here.
echo "export PATH=$HOME/$MY_LIBS/include:\$PATH" >> ~/.bashrc
echo "export CPATH=$HOME/$MY_LIBS/include:\$CPATH" >> ~/.bashrc
echo "export LIBRARY_PATH=$HOME/$MY_LIBS:\$LIBRARY_PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=$HOME/$MY_LIBS:\$LD_LIBRARY_PATH" >> ~/.bashrc
\endcode
Once you have created this local root directory, you can use it to install as many new libraries as desired, and your paths will already be set up to find them.
*/
/** \page makefile Makefile
Instead of giving lengthy compiler commands at the command prompt, you can use a Makefile to do most of the work. How to:
\li Copy and paste the following into a file named \c makefile.
\li Change the first line to the name of your program (e.g., if you have written <tt>census.c</tt>, then the first line will read <tt>PROGNAME=census</tt>).
\li If your program has multiple <tt>.c</tt> files, add a corresponding <tt>.o</tt> to the currently blank <tt>objects</tt> variable, e.g. <tt>objects=sample2.o sample3.o</tt>
\li One you have a Makefile in the directory, simply type <tt>make</tt> at the command prompt to generate the executable.
\code
PROGNAME = your_program_name_here
objects =
CFLAGS = -g -Wall -O3
LDLIBS = -lapophenia -lgsl -lgslcblas -lsqlite3
$(PROGNAME): $(objects)
\endcode
\li If your system has \c pkg-config, then you can use it for a slightly more robust and readable makefile. Replace the above C and link flags with:
\code
CFLAGS = -g -Wall `pkg-config --cflags apophenia` -O3
LDLIBS = `pkg-config --libs apophenia`
\endcode
The \c pkg-config program will then fill in the appropriate directories and libraries. Pkg-config knows Apophenia depends on the GSL and database libraries, so you need only list the most-dependent library.
\li The <tt>-O3</tt> flag is optional, asking the compiler to run its highest level of optimization (for speed).
\li GCC users may need the <tt>--std=gnu99</tt> or <tt>--std=gnu11</tt> flag to use post-1989 C standards.
\li Order matters in the linking list: the files a package depends on should be listed after the package. E.g., since sample.c depends on Apophenia, <tt>gcc sample.c -lapophenia</tt> will work, while <tt>gcc -lapophenia sample.c</tt> is likely to give you errors. Similarly, list <tt>-lapophenia</tt> before <tt>-lgsl</tt>, which comes before <tt>-lgslcblas</tt>.
*/
/** \page designated Designated initializers
Functions so marked in this documentation use standard C designated initializers and compound literals to allow you to omit, call by name, or change the order of inputs. The following examples are all equivalent.
The standard format:
\code
apop_text_to_db("infile.txt", "intable", 0, 1, NULL);
\endcode
Omitted arguments are left at their default vaules:
\code
apop_text_to_db("infile.txt", "intable");
\endcode
You can use the variable's name, if you forget its ordering:
\code
apop_text_to_db("infile.txt", "intable", .has_col_name=1, .has_row_name=0);
\endcode
If an un-named element follows a named element, then that value is given to the next variable in the standard ordering:
\code
apop_text_to_db("infile.txt", "intable", .has_col_name=1, NULL);
\endcode
\li There may be cases where you can not use this form (it relies on a macro, which
may not be available). You can always call the underlying function directly, by adding
\c _base to the name and giving all arguments:
\code
apop_text_to_db_base("infile.txt", "intable", 0, 1, NULL);
\endcode
\li If one of the optional elements is an RNG and you do not provide one, I use one
from \ref apop_rng_get_thread.
*/
/** \page preliminaries Getting started
If you are entirely new to Apophenia, \ref gentle "have a look at the Gentle Introduction here".
As well as the information in this outline, there is a separate page covering the details of
\ref setup "setting up a computing environment" and another page with \ref eg "some sample code" for your perusal.
\subpage gentle
\subpage setup
\subpage eg
\subpage refstatusext
*/
/** \page refstatusext References and extensions
\section mwd The book version
Apophenia co-evolved with <em>Modeling with Data: Tools and Techniques for Statistical Computing</em>. You can read about the book, or download a free PDF copy of the full text, at <a href="http://modelingwithdata.org">modelingwithdata.org</a>.
As with many computer programs, the preferred manner of citing Apophenia is to cite its related book.
Here is a BibTeX-formatted entry giving the relevant information:
\code
@book{klemens:modeling,
title = "Modeling with Data: Tools and Techniques for Statistical Computing",
author="Ben Klemens",
year=2008,
publisher="Princeton University Press"
}
\endcode
The rationale for the \ref apop_model struct, based on an algebraic system of models, is detailed in a <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf">U.S. Census Bureau research report</a>:
\code
@techreport{klemens:algebra,
title = "A Useful Algebraic System of Statistical Models",
author="Ben Klemens",
month=jul,
year=2014,
institution="U.S.\ Census Bureau",
number="06"
}
\endcode
\section ext How do I write extensions?
The system is written to not require a registration or initialization step to add a new
model or other such parts. Just write your code and <tt>include</tt> it like any other
C code. A new \ref apop_model has to conform to some rules if it is to play well with
\ref apop_estimate, \ref apop_draw, and so forth. See the notes at \ref modeldetails.
Once your new model or function is working, please post the code or a link to the code
on the <a href="https://github.com/b-k/apophenia/wiki">Apophenia wiki</a>.
\subpage c
\section links Further references
For your convenience, here are links to some other libraries you are probably using.
\li <a href="http://www.gnu.org/software/libc/manual/html_node/index.html">The standard C library</a>
\li <a href="http://www.gnu.org/software/gsl/manual/html_node/index.html">The
GSL documentation</a>, and <a href="http://www.gnu.org/software/gsl/manual/html_node/Function-Index.html">its index</a>
\li <a href="http://sqlite.org/lang.html">SQL understood by SQLite</a>
*/
/** \page outline An outline of the library
The narrative in this section goes into greater detail on how to use the
components of Apophenia. You are encouraged to read \ref gentle first.
This overview begins with the \ref apop_data set, which is the central data structure
used throughout the system. Section \ref dbs covers the use of the database interface,
because there are a lot of things that a database will do better than a matrix
structure like the \ref apop_data struct.
Section \ref modelsec covers statistical models, in the form of the \ref apop_model structure.
This part of the system is built upon the \ref apop_data set to hold parameters, statistics, data sets, and so on.
\ref Histosec covers probability mass functions, which are statistical models
built directly around a data set, where the chance of drawing a given observation is
proportional to how often that observation appears in the source data. There are many
situations where one would want to treat a data set as a probability distribution,
such as using \ref apop_kl_divergence to find the information loss from an observed
data set to a theoretical model fit to that data.
Section \ref testpage covers traditional hypothesis testing, beginning with common
statistics that take an \ref apop_data set or two as input, and continuing on to
generalized hypothesis testing for any \ref apop_model.
Because estimation in the \ref apop_model relies heavily on maximum likelihood
estimation, Apophenia's optimizer subsystem is extensive. \ref maxipage offers
some additional notes on optimization and how it can be used in non-statistical contexts.
\subpage dataoverview
\subpage dbs
\subpage modelsec
\subpage Histosec
\subpage testpage
\subpage maxipage
\subpage moreasst
*/
/** \page c C, SQL and coding utilities
<em>Learning C</em>
<a href="http://modelingwithdata.org">Modeling with Data</a> has a full tutorial for C, oriented at users of standard stats packages. More nuts-and-bolts tutorials are <a href="http://www.google.com/search?hl=es&c2coff=1&q=c+tutorial">in abundance</a>. Some people find pointers to be especially difficult; fortunately, there's a <a href="http://www.youtube.com/watch?v=6pmWojisM_E">claymation cartoon</a> which clarifies everything.
<em>Header aggregation</em>
There is only one header. Put
\code
#include <apop.h>
\endcode
at the top of your file, and you're done. Everything declared in that file starts with \c apop_ or \c Apop_. It also includes
\c assert.h, \c math.h, \c signal.h, and \c string.h.
<em>Linking</em>
You will need to link to the Apophenia library, which involves adding the <tt>-lapophenia</tt> flag to your compiler. Apophenia depends on SQLite3 and the GNU Scientific Library (which depends on a BLAS), so you will probably need something like:
\code
gcc sample.c -lapophenia -lsqlite3 -lgsl -lgslcblas -o run_me -g -Wall -O3
\endcode
Your best bet is to encapsulate this mess in a \ref makefile "Makefile". Even if you are using an IDE and its command-line management tools, see the Makefile page for notes on useful flags.
<em>Standards compliance</em>
To the best of our abilities, Apophenia complies to the C standard (ISO/IEC
9899:2011). As well as relying on the GSL and SQLite, it uses some POSIX function calls,
such as \c strcasecmp and \c popen.
\subpage designated
\section debugging Errors, logging, debugging and stopping
<em>The \c error element</em>
The \ref apop_data set and the \ref apop_model both include an element named \c error. It is normally \c 0, indicating no (known) error.
For example, \ref apop_data_copy detects allocation errors and some circular links
(when <tt>Data->more == Data</tt>) and fails in those cases. You could thus use the
function with a form like
\code
apop_data *d = apop_text_to_data("indata");
apop_data *cp = apop_data_copy(d);
if (cp->error) {printf("Couldn't copy the input data; failing.\n"); return 1;}
\endcode
There is sometimes (but not always) benefit to handling specific error codes, which are listed in the documentation of those functions that set the \c error element. E.g.,
\code
apop_data *d = apop_text_to_data("indata");
apop_data *cp = apop_data_copy(d);
if (cp->error == 'a') {printf("Couldn't allocate space for the copy; failing.\n"); return 1;}
if (cp->error == 'c') {printf("Circular link in the data set; failing.\n"); return 2;}
\endcode
The end of <a href="http://modelingwithdata.org/appendix_o.html">Appendix O</a>
of <em>Modeling with Data</em> offers some GDB macros which can make dealing with
Apophenia from the GDB command line much more pleasant. As discussed below, it
also helps to set <tt>apop_opts.stop_on_warning='v'</tt> or <tt>'w'</tt> when running
under the debugger.
\section verbsec Verbosity level and logging
The global variable <tt>apop_opts.verbose</tt> determines how many notifications and warnings get printed by Apophenia's warning mechanism:
-1: turn off logging, print nothing (ill-advised) <br>
0: notify only of failures and clear danger <br>
1: warn of technically correct but odd situations that might indicate, e.g., numeric instability <br>
2: debugging-type information; print queries <br>
3: give me everything, such as the state of the data at each iteration of a loop.
These levels are of course subjective, but should give you some idea of where to place the
verbosity level. The default is 1.
The messages are printed to the \c FILE* handle at <tt>apop_opts.log_file</tt>. If
this is blank (which happens at startup), then this is set to \c stderr. This is the
typical behavior for a console program. Use
\code
apop_opts.log_file = fopen("mylog", "w");
\endcode
to write to the \c mylog file instead of \c stderr.
As well as the error and warning messages, some functions can also print diagnostics,
using the \ref Apop_notify macro. For example, \ref apop_query and friends will print the
query sent to the database engine iff <tt>apop_opts.verbose >=2</tt> (which is useful
when building complex queries). The diagnostics attempt to follow
the same verbosity scale as the warning messages.
\section Stopping
By default, warnings and errors never halt processing. It is up to the calling function
to decide whether to stop.
When running the program under a debugger, this is an annoyance: we want to stop as
soon as a problem turns up.
The global variable <tt>apop_opts.stop_on_warning</tt> changes when the system halts:
\c 'n': never halt. If you were using Apophenia to support a user-friendly GUI, for example, you would use this mode.<br>
The default: if the variable is <tt>'\0'</tt> (the default), halt on severe errors, continue on all warnings.<br>
\c 'v': If the verbosity level of the warning is such that the warning would print to screen, then halt;
if the warning message would be filtered out by your verbosity level, continue.<br>
\c 'w': Halt on all errors or warnings, including those below your verbosity threshold.
See the documentation for individual functions for details on how each reports errors to the caller and the level at which warnings are posted.
\section Legi Legible output
The output routines handle four sinks for your output. There is a global variable that
you can use for small projects where all data will go to the same place.
\code
apop_opts.output_type = 'f'; //named file
apop_opts.output_type = 'p'; //a pipe or already-opened file
apop_opts.output_type = 'd'; //the database
\endcode
You can also set the output type, the name of the output file or table, and other options
via arguments to individual calls to output functions. See \ref apop_prep_output for the list of options.
C makes minimal distinction between pipes and files, so you can set a
pipe or file as output and send all output there until further notice:
\code
apop_opts.output_type = 'p';
apop_opts.output_pipe = popen("gnuplot", "w");
apop_plot_lattice(...); //see https://github.com/b-k/Apophenia/wiki/gnuplot_snippets
fclose(apop_opts.output_pipe);
apop_opts.output_pipe = fopen("newfile", "w");
apop_data_print(set1);
fprintf(apop_opts.output_pipe, "\nNow set 2:\n");
apop_data_print(set2);
\endcode
Continuing the example, you can always override the global data with
a specific request:
\code
apop_vector_print(v, "vectorfile"); //put vectors in a separate file
apop_matrix_print(m, "matrix_table", .output_type = 'd'); //write to the db
apop_matrix_print(m, .output_pipe = stdout); //now show the same matrix on screen
\endcode
I will first look to the input file name, then the input pipe, then the
global \c output_pipe, in that order, to determine to where I should
write. Some combinations (like output type = \c 'd' and only a pipe) don't
make sense, and I'll try to warn you about those.
What if you have too much output and would like to use a pager, like \c less or \c more?
In C and POSIX terminology, you're asking to pipe your output to a paging program. Here is
the form:
\code
FILE *lesspipe = popen("less", "w");
assert(lesspipe);
apop_data_print(your_data_set, .output_pipe=lesspipe);
pclose(lesspipe);
\endcode
\c popen will search your usual program path for \c less, so you don't have to give a full path.
\li\ref apop_data_print
\li\ref apop_matrix_print
\li\ref apop_vector_print
\section sqlsec About SQL, the syntax for querying databases
For a reference, your best bet is the <a href="http://www.sqlite.org/lang.html">Structured Query Language reference</a> for SQLite. For a tutorial; there is an abundance of <a href="http://www.google.com/search?q=sql+tutorial">tutorials online</a>. Here is a nice blog <a href="http://fluff.info/blog/arch/00000118.htm">entry</a> about complementaries between SQL and matrix manipulation packages.
Apophenia currently supports two database engines: SQLite and mySQL/mariaDB. SQLite is the default, because it is simpler and generally more easygoing than mySQL, and supports in-memory databases.
The global <tt>apop_opts.db_engine</tt> is initially \c NULL, indicating no preference
for a database engine. You can explicitly set it:
\code
apop_opts.db_engine='s' //use SQLite
apop_opts.db_engine='m' //use mySQL/mariaDB
\endcode
If \c apop_opts.db_engine is still \c NUL on your first database operation, then I will check
for an environment variable <tt>APOP_DB_ENGINE</tt>, and set
<tt>apop_opts.db_engine='m'</tt> if it is found and matches (case insensitive) \c mariadb or \c mysql.
\code
export APOP_DB_ENGINE=mariadb
apop_text_to_db indata mtab db_for_maria
unset APOP_DB_ENGINE
apop_text_to_db indata stab db_for_sqlite.db
\endcode
Write \ref apop_data sets to the database using \ref apop_data_print, with <tt>.output_type='d'</tt>.
\li Column names are inserted if there are any. If there are, all dots are converted
to underscores. Otherwise, the columns will be named \c c1, \c c2, \c c3, &c.
\li If \ref apop_opts_type "apop_opts.db_name_column" is not blank (the default is
<tt>"row_name"</tt>), then a so-named column is created, and the row names are placed there.
\li If there are weights, they will be the last column of the table, and the column will be named \c weights.
\li If the table does not exist, create it. Use \ref apop_data_print <tt>(data, "tabname", .output_type='d', .output_append='w')</tt>
to overwrite an existing table or with <tt>.output_append='a'</tt> to append.
Appending is the default. Or,
call \ref apop_table_exists <tt>("tabname", 'd')</tt> to ensure that the table is
removed ahead of time.
\li If your data set has zero data (i.e., is just a list of column names or is entirely
blank), \ref apop_data_print returns without creating anything in the database.
\li Especially if you are using a pre-2007 version of SQLite, there may be a speed
gain to wrapping the call to this function in a begin/commit pair:
\code
apop_query("begin;");
apop_data_print(dataset, .output_name="dbtab", .output_type='d');
apop_query("commit;");
\endcode
Finally, Apophenia provides a few nonstandard SQL functions to facilitate math via database; see \ref db_moments.
\section threads Threading
Apophenia uses OpenMP for threading. You generally do not need to know how OpenMP works
to use Apophenia, and many points of work will thread without your doing anything.
\li All functions strive to be thread-safe. Part of how this is achieved is that static
variables are marked as thread-local or atomic, as per the C standard. There still
exist compilers that can't implement thread-local or atomic variables, in which case
your safest bet is to set OMP's thread count to one as below (or get a new compiler).
\li Some functions modify their inputs. It is up to you to use those functions in
a thread-safe manner. The \ref apop_matrix_realloc handles states and global variables
correctly in a threaded environment, but if you have two threads resizing the same \c
gsl_matrix at the same time, you're going to have problems.
\li There are few compilers that don't support OpenMP. When compiling on such a system all work will be single-threaded.
\li Set the maximum number of threads to \c N with the environment variable
\code
export OMP_NUM_THREADS N
\endcode
or the C function
\code
#include <omp.h>
omp_set_num_threads(N);
\endcode
Use one of these methods with <tt>N=1</tt> if you want a single-threaded program.
You can return later to using all available threads via <tt>omp_set_num_threads(omp_get_num_procs())</tt>.
\li \ref apop_map and friends distribute their \c for loop over the input \ref apop_data
set across multiple threads. Therefore, be careful to send thread-unsafe functions to
it only after calling \c omp_set_num_threads(1).
\li There are a few functions, like \ref apop_model_draws, that rely on \ref apop_map, and
therefore also thread by default.
\li The function \ref apop_rng_get_thread retrieves a statically-stored RNG specific
to a given thread. Therefore, if you use that function in the place of a \c gsl_rng,
you can parallelize functions that make random draws.
\li \ref apop_rng_get_thread allocates its store of threads using <tt>apop_opts.rng_seed</tt>,
then incrementing that seed by one. You thus probably have threads with seeds 479901,
479902, 479903, .... [If you have a better way to do it, please feel free to modify the
code to implement your improvement and submit a pull request on Github.]
See <a href="http://modelingwithdata.org/arch/00000175.htm">this tutorial on C
threading</a> if you would like to know more, or are unsure about whether your functions
are thread-safe or not.
*/
/** \page dataoverview Data sets
The \ref apop_data structure represents a data set. It joins together a \c gsl_vector,
a \c gsl_matrix, an \ref apop_name, and a table of strings. It tries to be lightweight,
so you can use it everywhere you would use a \c gsl_matrix or a \c gsl_vector.
Here is a diagram showing a sample data set with all of the elements in place. Together,
they represent a data set where each row is an observation, which includes both numeric
and text values, and where each row/column may be named.
\htmlinclude apop_data_fig.html
\latexinclude apop_data_fig.tex
In a regression, the vector would be the dependent variable, and the other columns
(after factor-izing the text) the independent variables. Or think of the \ref apop_data
set as a partitioned matrix, where the vector is column -1, and the first column of
the matrix is column zero. Here is some sample code to print the vector and matrix,
starting at column -1 (but you can use \ref apop_data_print to do this).
\code
for (int j = 0; j< data->matrix->size1; j++){
printf("%s\t", apop_name_get(data->names, j, 'r'));
for (int i = -1; i< data->matrix->size2; i++)
printf("%g\t", apop_data_get(data, j, i));
printf("\n");
}
\endcode
Most functions assume that each row represents one observation, so the data vector,
data matrix, and text have the same row count: \c data->vector->size==data->matrix->size1
and \c data->vector->size==*data->textsize. This means that the \ref apop_name structure
doesn't have separate \c vector_names, \c row_names, or \c text_row_names elements:
the \c rownames are assumed to apply for all.
See below for notes on managing the \c text element and the row/column names.
\section pps Pages
The \ref apop_data set includes a \c more pointer, which will typically be \c NULL,
but may point to another \ref apop_data set. This is intended for a main data set
and a second or third page with auxiliary information, such as estimated parameters
on the front page and their covariance matrix on page two, or predicted data on the
front page and a set of prediction intervals on page two.
The \c more pointer is not intended for a linked list for millions of data points. In
such situations, you can often improve efficiency by restructuring your data to use
a single table (perhaps via \ref apop_data_pack and \ref apop_data_unpack).
Most functions, such as \ref apop_data_copy and \ref apop_data_free, will handle all
the pages of information. For example, an optimization search over multi-page parameter
sets would search the space given by all pages.
Pages may also be appended as output or auxiliary information, such as
covariances, and an MLE would not search over these elements. Any page with a name in
XML-ish brackets, such as <tt>\<Covariance\></tt>, is considered information about the
data, not data itself, and therefore ignored by search routines, missing data routines,
et cetera. This is achieved by a rule in \ref apop_data_pack and \ref apop_data_unpack.
Here is a toy example that establishes a baseline data set, adds a page,
modifies it, and then later retrieves it.
\code
apop_data *d = apop_data_alloc(10, 10, 10); //the base data set, a 10-item vector + 10x10 matrix
apop_data *a_new_page = apop_data_add_page(d, apop_data_alloc(2,2), "new 2 x 2 page");
gsl_vector_set_all(a_new_page->matrix, 3);
//later:
apop_data *retrieved = apop_data_get_page(d, "new", 'r'); //'r'=search via regex, not literal match.
apop_data_print(retrieved); //print a 2x2 grid of 3s.
\endcode
\section datafns Functions for using apop_data sets
There are a great many functions to collate, copy, merge, sort, prune, and otherwise
manipulate the \ref apop_data structure and its components.
\li\ref apop_data_add_named_elmt
\li\ref apop_data_copy
\li\ref apop_data_fill
\li\ref apop_data_memcpy
\li\ref apop_data_pack
\li\ref apop_data_rm_columns
\li\ref apop_data_sort
\li\ref apop_data_split
\li\ref apop_data_stack
\li\ref apop_data_transpose : transpose matrices (square or not) and text grids
\li\ref apop_data_unpack
\li\ref apop_matrix_copy
\li\ref apop_matrix_realloc
\li\ref apop_matrix_stack
\li\ref apop_text_set
\li\ref apop_text_paste
\li\ref apop_text_to_data
\li\ref apop_vector_copy
\li\ref apop_vector_fill
\li\ref apop_vector_stack
\li\ref apop_vector_realloc
\li\ref apop_vector_unique_elements
Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate
the <a href="http://www.gnu.org/software/gsl/manual/html_node/index.html">GSL's documentation</a> here.
Meanwhile, here are prototypes for a few common functions. The GSL's
naming scheme is very consistent, so a simple reminder of the function name may be
sufficient to indicate how they are used.
\li <tt>gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>gsl_matrix_transpose_memcpy (gsl_matrix * dest, const gsl_matrix * src)</tt>
\li <tt>gsl_matrix_transpose (gsl_matrix * m)</tt> : square matrices only
\li <tt>gsl_matrix_set_all (gsl_matrix * m, double x)</tt>
\li <tt>gsl_matrix_set_zero (gsl_matrix * m)</tt>
\li <tt>gsl_matrix_set_identity (gsl_matrix * m)</tt>
\li <tt>gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)</tt>
\li <tt>void gsl_vector_set_all (gsl_vector * v, double x)</tt>
\li <tt>void gsl_vector_set_zero (gsl_vector * v)</tt>
\li <tt>int gsl_vector_set_basis (gsl_vector * v, size_t i)</tt>: set all elements to zero, but set item \f$i\f$ to one.
\li <tt>gsl_vector_reverse (gsl_vector * v)</tt>: reverse the order of your vector's elements
\li <tt>gsl_vector_ptr</tt> and <tt>gsl_matrix_ptr</tt>. To increment an element in a vector use, e.g., <tt>*gsl_vector_ptr(v, 7) += 3;</tt> or <tt>(*gsl_vector_ptr(v, 7))++</tt>.
\li <tt>gsl_vector_memcpy (gsl_vector * dest, const gsl_vector * src)</tt>
\subsection readin Reading from text files
The \ref apop_text_to_data() function takes in the name of a text file with a grid of data in (comma|tab|pipe|whatever)-delimited format and reads it to a matrix. If there are names in the text file, they are copied in to the data set. See \ref text_format for the full range and details of what can be read in.
If you have any columns of text, then you will need to read in via the database: use
\ref apop_text_to_db() to convert your text file to a database table,
do any database-appropriate cleaning of the input data, then use \ref
apop_query_to_data() or \ref apop_query_to_mixed_data() to pull the data to an \ref apop_data set.
\subpage text_format
\section datalloc Alloc/free
You may not need to use these functions often, given that \ref apop_query_to_data, \ref apop_text_to_data, and many transformation functions will auto-allocate \ref apop_data sets for you.
The \ref apop_data_alloc function allocates a vector, a matrix, or both. After this call, the structure will have blank names, \c NULL \c text element, and \c NULL \c weights. See \ref names for discussion of filling the names. Use \ref apop_text_alloc to allocate the \c text grid. The \c weights are a simple \c gsl_vector, so allocate a 100-unit weights vector via <tt>allocated_data_set->weights = gsl_vector_alloc(100)</tt>.
Examples of use can be found throughout the documentation; for example, see \ref gentle.
\li\ref apop_data_alloc
\li\ref apop_data_calloc
\li\ref apop_data_free
\li\ref apop_text_alloc : allocate or resize the text part of an \ref apop_data set.
\li\ref apop_text_free
See also:
\li <tt>gsl_matrix * gsl_matrix_alloc (size_t n1, size_t n2)</tt>
\li <tt>gsl_matrix * gsl_matrix_calloc (size_t n1, size_t n2)</tt>
\li <tt>void gsl_matrix_free (gsl_matrix * m)</tt>
\li <tt>gsl_vector * gsl_vector_alloc (size_t n)</tt>
\li <tt>gsl_vector * gsl_vector_calloc (size_t n)</tt>
\li <tt>void gsl_vector_free (gsl_vector * v)</tt>
\section gslviews Using views
There are several macros for the common task of viewing a single row or column of a \ref
apop_data set.
\code
apop_data *d = apop_query_to_data("select obs1, obs2, obs3 from a_table");
//Get a column using its name. Note that the generated view, ov, is the
//last item named in the call to the macro.
Apop_col_t(d, "obs1", ov);
double obs1_sum = apop_vector_sum(ov);
//Get row zero of the data set's matrix as a vector; get its sum
double row_zero_sum = apop_vector_sum(Apop_rv(d, 0));
//Get a row or rows as a standalone one-row apop_data set
apop_data_print(Apop_r(d, 0));
//ten rows starting at row 3:
apop_data *d10 = Apop_rs(d, 3, 10);
apop_data_print(d10);
//Column zero's sum
gsl_vector *cv = Apop_cv(d, 0);
double col_zero_sum = apop_vector_sum(cv);
//or one one line:
double col_zero_sum = apop_vector_sum(Apop_cv(d, 0));
//Pull a 10x5 submatrix, whose origin element is the (2,3)rd
//element of the parent data set's matrix
double sub_sum = apop_matrix_sum(Apop_subm(d, 2,3, 10,5));
\endcode
Because these macros can be used as arguments to a function, these macros have abbreviated names to save line space.
\li\ref Apop_r : get row as one-observation \ref apop_data set
\li\ref Apop_c : get column as \ref apop_data set
\li\ref Apop_cv : get column as \ref gsl_vector
\li\ref Apop_rv : get row as \ref gsl_vector
\li\ref Apop_cs : get columns as \ref apop_data set
\li\ref Apop_rs : get rows as \ref apop_data set
\li\ref Apop_mcv : matrix column as vector
\li\ref Apop_mrv : matrix row as vector
\li\ref Apop_subm : get submatrix of a \ref gsl_matrix
A second set of macros have a slightly different syntax, taking the name of the object to be declared as the last argument. These can not be used as expressions such as function arguments.
\li\ref Apop_col_t
\li\ref Apop_row_t
\li\ref Apop_col_tv
\li\ref Apop_row_tv
The view is an automatic variable, not a pointer, and therefore disappears at the end
of the scope in which it is declared. If you want to retain the data after the function
exits, copy it to another vector:
\code
return apop_vector_copy(Apop_rv(d, 2)); //return a gsl_vector copy of row 2
\endcode
Curly braces always delimit scope, not just at the end of a function.
When program evaluation exits a given block, all variables in that block are
erased. Here is some sample code that won't work:
\code
apop_data *outdata;
if (get_odd){
outdata = Apop_r(data, 1);
} else {
outdata = Apop_r(data, 0);
}
apop_data_print(outdata); //breaks: outdata points to out-of-scope variables.
\endcode
For this if/then statement, there are two sets of local variables
generated: one for the \c if block, and one for the \c else block. By the last line,
neither exists. You can get around the problem here by making sure to not put the macro
declaring new variables in a block. E.g.:
\code
apop_data *outdata = Apop_r(data, get_odd ? 1 : 0);
apop_data_print(outdata);
\endcode
\section data_set_get Set/get
First, some examples:
\code
apop_data *d = apop_data_alloc(10, 10, 10);
apop_name_add(d->names, "Zeroth row", 'r');
apop_name_add(d->names, "Zeroth col", 'c');
//set cell at row=8 col=0 to value=27
apop_data_set(d, 8, 0, .val=27);
assert(apop_data_get(d, 8, .colname="Zeroth") == 27);
double *x = apop_data_ptr(d, .col=7, .rowname="Zeroth");
*x = 270;
assert(apop_data_get(d, 0, 7) == 270);
// This is invalid---the value doesn't follow the colname. Use .val=5.
// apop_data_set(d, .row = 3, .colname="Column 8", 5);
// This is OK, to set (3, 8) to 5:
apop_data_set(d, 3, 8, 5);
//apop_data set holding a scalar:
apop_data *s = apop_data_alloc(1);
apop_data_set(s, .val=12);
assert(apop_data_get(s) == 12);
//apop_data set holding a vector:
apop_data *v = apop_data_alloc(12);
for (int i=0; i< 12; i++) apop_data_set(s, i, .val=i*10);
assert(apop_data_get(s,3) == 30);
//This is a common form from pulling from a list of named scalars,
//produced via apop_data_add_named_elmt
double AIC = apop_data_get(your_model->info, .rowname="AIC");
\endcode
\li The versions that take a column/row name use \ref apop_name_find
for the search; see notes there on the name matching rules.
\li For those that take a column number, column -1 is the vector element.
\li For those that take a column name, I will search the vector last---if I don't find the name among the matrix columns, but the name matches the vector name, I return column -1.
\li If you give me both a <tt>.row</tt> and a <tt>.rowname</tt>, I go with the name; similarly for <tt>.col</tt> and
<tt>.colname</tt>.
\li You can give the name of a page, e.g.
\code
double AIC = apop_data_get(data, .rowname="AIC", .col=-1, .page="<Info>");
\endcode
\li Numeric values default to zero, which is how the examples above that treated the \ref apop_data set as a vector or scalar could do so relatively gracefully.
So <tt>apop_data_get(dataset, 1)</tt> gets item (1, 0) from the matrix
element of \c dataset. But as a do-what-I-mean exception, if there is no matrix element
but there is a vector, then this form will get vector element 1. Relying on this DWIM
exception is useful iff you can guarantee that a data set will have only a vector or
a matrix but not both. Otherwise, be explicit and use <tt>apop_data_get(dataset, 1, -1)</tt>.
The \ref apop_data_ptr function follows the lead of \c gsl_vector_ptr and \c
gsl_matrix_ptr, and like those functions, returns a pointer to the appropriate \c double.
For example, to increment the (3,7)th element of an \ref apop_data set:
\code
(*apop_data_ptr(dataset, 3, 7))++;
\endcode
\li\ref apop_data_get
\li\ref apop_data_set
\li\ref apop_data_ptr : returns a pointer to the element.
\li\ref apop_data_get_page : retrieve a named page from a data set. If you only need a few items, you can specify a page name to \c apop_data_(get|set|ptr).
See also:
\li <tt>double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>double gsl_vector_get (const gsl_vector * v, size_t i)</tt>
\li <tt>void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x)</tt>
\li <tt>void gsl_vector_set (gsl_vector * v, size_t i, double x)</tt>
\li <tt>double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>double * gsl_vector_ptr (gsl_vector * v, size_t i)</tt>
\li <tt>const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i)</tt>
\li <tt>gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i)</tt>
\li <tt>gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j)</tt>
\li <tt>gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v)</tt>
\li <tt>gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)</tt>
\section mapply Map/apply
\anchor outline_mapply
These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The \c ..._sum functions return the sum of the mapped output.
There are two types, which were developed at different times. The \ref apop_map and
\ref apop_map_sum functions use variadic function inputs to cover a lot of different
types of process depending on the inputs. Other functions with types in their names,
like \ref apop_matrix_map and \ref apop_vector_apply, may be easier to use in some
cases. They use the same routines internally, so use whichever type is convenient.
You can do many things quickly with these functions.
Get the sum of squares of a vector's elements:
\code
//given apop_data *dataset and gsl_vector *v:
double sum_of_squares = apop_map_sum(dataset, gsl_pow_2);
double sum_of_sqvares = apop_vector_map_sum(v, gsl_pow_2);
\endcode
Create an index vector [\f$0, 1, 2, ...\f$].
\code
double index(double in, int index){return index;}
apop_data *d = apop_map(apop_data_alloc(100), .fn_di=index, .inplace='y');
\endcode
Given your log likelihood function, which acts on a \ref apop_data set with only one
row, and a data set where each row of the matrix is an observation, find the total
log likelihood via:
\code
static double your_log_likelihood_fn(apop_data * in)
{[your math goes here]}
double total_ll = apop_map_sum(dataset, .fn_r=your_log_likelihood_fn);
\endcode
How many missing elements are there in your data matrix?
\code
static double nan_check(const double in){ return isnan(in);}
int missing_ct = apop_map_sum(in, nan_check, .part='m');
\endcode
Get the mean of the not-NaN elements of a data set:
\code
static double no_nan_val(const double in){ return isnan(in)? 0 : in;}
static double not_nan_check(const double in){ return !isnan(in);}
static double apop_mean_no_nans(apop_data *in){
return apop_map_sum(in, no_nan_val)/apop_map_sum(in, not_nan_check);
}
\endcode
The following program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the \f$t\f$ statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.
Notice how the older \ref apop_vector_apply uses file-global variables to pass information into the functions, while the \ref apop_map uses a pointer to send parameters to the functions.
\include t_test_by_rows.c
One more toy example demonstrating the use of \ref apop_map and \ref apop_map_sum :
\include apop_map_row.c
\li If the number of threads is greater than one, then the matrix will be broken
into chunks and each sent to a different thread. Notice that the GSL is generally
threadsafe, and SQLite is threadsafe conditional on several commonsense caveats that
you'll find in the SQLite documentation. See \ref apop_rng_get_thread() to use the GSL's RNGs in a threaded environment.
\li The \c ...sum functions are convenience functions that call \c ...map and then add up the contents. Thus, you will need to have adequate memory for the allocation of the temp matrix/vector.
\li\ref apop_map
\li\ref apop_map_sum
\li\ref apop_matrix_apply
\li\ref apop_matrix_map
\li\ref apop_matrix_map_all_sum
\li\ref apop_matrix_map_sum
\li\ref apop_vector_apply
\li\ref apop_vector_map
\li\ref apop_vector_map_sum
\section matrixmathtwo Basic Math
\li\ref apop_vector_exp : exponentiate every element of a vector
\li\ref apop_vector_log : take the natural log of every element of a vector
\li\ref apop_vector_log10 : take the log (base 10) of every element of a vector
\li\ref apop_vector_distance : find the distance between two vectors via various metrics
\li\ref apop_vector_normalize : scale/shift a matrix to have mean zero, sum to one, have a range of exactly \f$[0, 1]\f$, et cetera
\li\ref apop_vector_entropy : calculate the entropy of a vector of frequencies or probabilities
See also:
\li <tt>int gsl_matrix_add (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_sub (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_mul_elements (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_div_elements (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_scale (gsl_matrix * a, const double x)</tt>
\li <tt>int gsl_matrix_add_constant (gsl_matrix * a, const double x)</tt>
\li <tt>gsl_vector_add (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_sub (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_mul (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_div (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_scale (gsl_vector * a, const double x)</tt>
\li <tt>gsl_vector_add_constant (gsl_vector * a, const double x)</tt>
\section matrixmath Matrix math
\li\ref apop_dot : matrix \f$\cdot\f$ matrix, matrix \f$\cdot\f$ vector, or vector \f$\cdot\f$ matrix
\li\ref apop_matrix_determinant
\li\ref apop_matrix_inverse
\li\ref apop_det_and_inv : find determinant and inverse at the same time
See the GSL documentation for myriad further options.
\section sumstats Summary stats
\li\ref apop_data_summarize
\li\ref apop_vector_moving_average
\li\ref apop_vector_percentiles
\li\ref apop_vector_bounded
See also:
\li <tt>double gsl_matrix_max (const gsl_matrix * m)</tt>
\li <tt>double gsl_matrix_min (const gsl_matrix * m)</tt>
\li <tt>void gsl_matrix_minmax (const gsl_matrix * m, double * min_out, double * max_out)</tt>
\li <tt>void gsl_matrix_max_index (const gsl_matrix * m, size_t * imax, size_t * jmax)</tt>
\li <tt>void gsl_matrix_min_index (const gsl_matrix * m, size_t * imin, size_t * jmin)</tt>
\li <tt>void gsl_matrix_minmax_index (const gsl_matrix * m, size_t * imin, size_t * jmin, size_t * imax, size_t * jmax)</tt>
\li <tt>gsl_vector_max (const gsl_vector * v)</tt>
\li <tt>gsl_vector_min (const gsl_vector * v)</tt>
\li <tt>gsl_vector_minmax (const gsl_vector * v, double * min_out, double * max_out)</tt>
\li <tt>gsl_vector_max_index (const gsl_vector * v)</tt>
\li <tt>gsl_vector_min_index (const gsl_vector * v)</tt>
\li <tt>gsl_vector_minmax_index (const gsl_vector * v, size_t * imin, size_t * imax)</tt>
\section moments Moments
For most of these, you can add a weights vector for weighted mean/var/cov/..., such as
<tt>apop_vector_mean(d->vector, .weights=d->weights)</tt>
\li\ref apop_mean : the first three with short names operate on a vector.
\li\ref apop_sum
\li\ref apop_var
\li\ref apop_matrix_sum
\li\ref apop_data_correlation
\li\ref apop_data_covariance
\li\ref apop_data_summarize
\li\ref apop_matrix_mean
\li\ref apop_matrix_mean_and_var
\li\ref apop_vector_correlation
\li\ref apop_vector_cov
\li\ref apop_vector_kurtosis
\li\ref apop_vector_kurtosis_pop
\li\ref apop_vector_mean
\li\ref apop_vector_skew
\li\ref apop_vector_skew_pop
\li\ref apop_vector_sum
\li\ref apop_vector_var
\li\ref apop_vector_var_m
\section convsec Conversion among types
There are no functions provided to convert from \ref apop_data to the constituent
elements, because you don't need a function.
If you need an individual element, you can use its pointer to retrieve it:
\code
apop_data *d = apop_query_to_mixed_data("vmmw", "select result, age, "
"income, replicate_weight from data");
double avg_result = apop_vector_mean(d->vector, .weights=d->weights);
\endcode
In the other direction, you can use compound literals to wrap an \ref apop_data struct
around a loose vector or matrix:
\code
//Given:
gsl_vector *v;
gsl_matrix *m;
// Then this form wraps the elements into automatically-allocated apop_data structs.
apop_data *dv = &(apop_data){.vector=v};
apop_data *dm = &(apop_data){.matrix=m};
apop_data *v_dot_m = apop_dot(dv, dm);
//Here is a macro to hide C's ugliness:
#define As_data(...) (&(apop_data){__VA_ARGS__})
apop_data *v_dot_m2 = apop_dot(As_data(.vector=v), As_data(.matrix=m));
//The wrapped object is an automatically-allocated structure pointing to the
//original data. If it needs to persist or be separate from the original,
//make a copy:
apop_data *dm_copy = apop_data_copy(As_data(.vector=v, .matrix=m));
\endcode
\li\ref apop_array_to_vector : <tt>double*</tt>\f$\to\f$ <tt>gsl_vector</tt>
\li\ref apop_data_fill : <tt>double*</tt>\f$\to\f$ \ref apop_data
\li\ref apop_data_falloc : macro to allocate and fill a \ref apop_data set
\li\ref apop_text_to_data : delimited text file\f$\to\f$ \ref apop_data
\li\ref apop_text_to_db : delimited text file\f$\to\f$ database table
\li\ref apop_vector_to_matrix
\section names Name handling
If you generate your data set via \ref apop_text_to_data or from the database via
\ref apop_query_to_data (or \ref apop_query_to_text or \ref apop_query_to_mixed_data)
then column names appear as expected. Set <tt>apop_opts.db_name_column</tt> to the
name of a column in your query result to use that column name for row names.
Sample uses, given \ref apop_data set <tt>d</tt>:
\code
int row_name_count = d->names->rowct
int col_name_count = d->names->colct
int text_name_count = d->names->textct
//Manually add names in sequence:
apop_name_add(d->names, "the vector", 'v');
apop_name_add(d->names, "row 0", 'r');
apop_name_add(d->names, "row 1", 'r');
apop_name_add(d->names, "row 2", 'r');
apop_name_add(d->names, "numeric column 0", 'c');
apop_name_add(d->names, "text column 0", 't');
apop_name_add(d->names, "The name of the data set.", 'h');
//or append several names at once
apop_data_add_names(d, 'c', "numeric column 1", "numeric column 2", "numeric column 3");
//point to element i from the row/col/text names:
char *rowname_i = d->names->row[i];
char *colname_i = d->names->col[i];
char *textname_i = d->names->text[i];
//The vector also has a name:
char *vname = d->names->vector;
\endcode
\li\ref apop_name_add : add one name
\li\ref apop_data_add_names : add a sequence of names at once
\li\ref apop_name_stack : copy the contents of one name list to another
\li\ref apop_name_find : find the row/col number for a given name.
\li\ref apop_name_print : print the \ref apop_name struct, for diagnostic purposes.
\section textsec Text data
The \ref apop_data set includes a grid of strings, named <tt>text</tt>, for holding text data.
Text should be encoded in UTF-8. ASCII is a subset of UTF-8, so that's OK too.
There are a few simple forms for handling the \c text element of an \c apop_data set.
\li Use \ref apop_text_alloc to allocate the block of text. It is actually a realloc function, which you can use to resize an existing block without leaks. See the example below.
\li Use \ref apop_text_set to write text elements. It replaces any existing text in the given slot without memory leaks.
\li The number of rows of text data in <tt>tdata</tt> is
<tt>tdata->textsize[0]</tt>;
the number of columns is <tt>tdata->textsize[1]</tt>.
\li Refer to individual elements using the usual 2-D array notation, <tt>tdata->text[row][col]</tt>.
\li <tt>x[0]</tt> can always be written as <tt>*x</tt>, which may save some typing. The number of rows is <tt>*tdata->textsize</tt>. If you have a single column of text data (i.e., all data is in column zero), then item \c i is <tt>*tdata->text[i]</tt>. If you know you have exactly one cell of text, then its value is <tt>**tdata->text</tt>.
\li After \ref apop_text_alloc, all elements are the empty string <tt>""</tt>, which
you can check via
\code
if (!strlen(dataset->text[i][j])) printf("<blank>")
//or
if (!*dataset->text[i][j]) printf("<blank>")
\endcode
For the sake of efficiency when dealing with large, sparse data sets, all blank cells
point to <em>the same</em> static empty string, meaning that freeing cells must be
done with care. Your best bet is to rely on \ref apop_text_set, \ref apop_text_alloc,
and \ref apop_text_free to do the memory management for you.
Here is a sample program that uses these forms, plus a few text-handling functions.
\include eg/text_demo.c
\li\ref apop_data_transpose() : also transposes the text data. Say that you use
<tt>dataset = apop_query_to_text("select onecolumn from data");</tt> then you have a
sequence of strings, <tt>d->text[0][0], d->text[1][0], </tt>.... After <tt>apop_data
*dt = apop_data_transpose(dataset)</tt>, you will have a single list of strings,
<tt>dt->text[0]</tt>, which is often useful as input to list-of-strings handling
functions.
\li\ref apop_query_to_text
\li\ref apop_text_alloc : allocate or resize the text part of an \ref apop_data set.
\li\ref apop_text_set : replace a single cell of the text grid with new text.
\li\ref apop_text_paste : convert a table of strings into one long string.
\li\ref apop_text_unique_elements : get a sorted list of unique elements for one column of text.
\li\ref apop_text_free : you may never need this, because \ref apop_data_free calls it.
\li\ref apop_regex : friendlier front-end for POSIX-standard regular expression
searching; pulls matches into an \ref apop_data set.
\li\ref apop_text_unique_elements
\subsection fact Generating factors
\em Factor is jargon for a numbered category. Number-crunching programs prefer integers over text, so we need a function to produce a one-to-one mapping from text categories into numeric factors.
A \em dummy is a variable that is either one or zero, depending on membership in a given
group. Some methods (typically when the variable is an input or independent variable
in a regression) prefer dummies; some methods (typically for outcome or dependent
variables) prefer factors. The functions that generate factors and dummies will add
an informational page to your \ref apop_data set with a name like <tt>\<categories
for your_column\></tt> listing the conversion from the artificial numeric factor to
the original data. Use \ref apop_data_get_factor_names to get a pointer to that page.
You can use the factor table to translate from numeric categories back to text (though
you probably have the original text column in your data anyway).
Having the factor list in an auxiliary table makes it easy to ensure that multiple
\ref apop_data sets use the same single categorization scheme. Generate factors in the
first set, then copy the factor list to the second, then run \ref apop_data_to_factors
on the second:
\code
apop_data_to_factors(d1);
d2->more = apop_data_copy(apop_data_get_factor_names(d1));
apop_data_to_factors(d2);
\endcode
See the documentation for \ref apop_logit for a sample linear model using a factor dependent variable and dummy independent variable.
\li\ref apop_data_to_dummies
\li\ref apop_data_to_factors
\li\ref apop_data_get_factor_names
*/
/** \page dbs Databases
These are convenience functions to handle interaction with SQLite or mySQL/mariaDB. They open one and only one database, and handle most of the interaction therewith for you.
You will probably first use \ref apop_text_to_db to pull data into the database, then \ref apop_query to clean the data in the database, and finally \ref apop_query_to_data to pull some subset of the data out for analysis.
\li In all cases, your query may be in <tt>printf</tt> form. For example:
\code
char tabname[] = "demographics";
char colname[] = "heights";
int min_height = 175;
apop_query("select %s from %s where %s > %i", colname, tabname, colname, min_height);
\endcode
See the \ref db_moments section below for not-SQL-standard math functions that you can
use when sending queries from Apophenia, such as \c pow, \c stddev, or \c sqrt.
\li \ref apop_text_to_db : Read a text file on disk into the database. Data analysis projects often start with a call to this.
\li \ref apop_data_print : If you include the argument <tt>.output_type='d'</tt>, this prints your \ref apop_data set to the database.
\li \ref apop_query : Manipulate the database, return nothing (e.g., insert rows or create table).
\li \ref apop_db_open : Optional, for when you want to use a database on disk.
\li \ref apop_db_close : A useful (and in some cases, optional) companion to \ref apop_db_open.
\li \ref apop_table_exists : Check to make sure you aren't reinventing or destroying data. Also, a clean way to drop a table.
\li Apophenia reserves the right to insert temp tables into the opened database. They
will all have names beginning with <tt>apop_</tt>, so the reader is advised to not
generate tables with such names, and is free to ignore or delete any such tables that
turn up.
\li If you need to deal with two databases, use SQL's <a
href="https://sqlite.org/lang_attach.html"><tt>attach database</tt></a>. By default
with SQLite, Apophenia opens an in-memory database handle. It is a sensible workflow to
use the faster in-memory database as the primary database, and then attach an on-disk database
to read in data and write final output tables.
\section edftd Extracting data from the database
\li\ref apop_db_to_crosstab : take up to three columns in the database (row, column, value) and produce a table of values.
\li\ref apop_query_to_data
\li\ref apop_query_to_float
\li\ref apop_query_to_mixed_data
\li\ref apop_query_to_text
\li\ref apop_query_to_vector
\section wdttd Writing data to the database
See the print functions at \ref Legi. E.g.
\code
apop_data_print(yourdata, .output_type='d', .output_name="dbtab");
\endcode
\section cmdline Command-line utilities
A few functions have proven to be useful enough to be worth breaking out into their own programs, for use in scripts or other data analysis from the command line:
\li The \c apop_text_to_db command line utility is a wrapper for the \ref apop_text_to_db command.
\li The \c apop_db_to_crosstab function is a wrapper for the \ref apop_db_to_crosstab function.
\section db_moments Database moments (plus pow()!)
SQLite lets users define new functions for use in queries, and Apophenia uses this facility to define a few common functions.
\li <tt>select ran() from table</tt> will produce a new random number between zero and one for every row of the input table, using \c gsl_rng_uniform.
\li The SQL standard includes the <tt>count(x)</tt> and <tt>avg(x)</tt> aggregators,
but statisticians are usually interested in higher moments as well---at least the
variance. Therefore, SQL queries using the Apophenia library may include any of these moments:
\code
select count(x), stddev(x), avg(x), var(x), variance(x), skew(x), kurt(x), kurtosis(x),
std(x), stddev_samp(x), stddev_pop(x), var_samp(x), var_pop(x)
from table
group by whatever
\endcode
<tt>var</tt> and <tt>variance</tt>; <tt>kurt</tt> and <tt>kurtosis</tt> do the same thing; choose the one that sounds better to you.
Kurtosis is the fourth central moment by itself, not adjusted by subtracting three or dividing by variance squared.
<tt>var</tt>, <tt>var_samp</tt>, <tt>stddev</tt> and <tt>stddev_samp</tt> give sample variance/standard deviation; <tt>variance</tt>, <tt>var_pop</tt>, <tt>std</tt> and <tt>stddev_pop</tt> give population standard deviation. The plethora of variants are for mySQL compatibility.
\li The var/skew/kurtosis functions calculate sample moments. If you want the second
population moment, multiply the variance by \f$(n-1)/n\f$; for the third population moment,
multiply the skew by \f$(n-1)(n-2)/n^2\f$. The equation for the unbiased sample kurtosis
as calculated in <a href="http://modelingwithdata.org/pdfs/moments.pdf">Appendix M of
<em>Modeling with Data</em></a> is not quite as easy to adjust.
\li Also provided: wrapper functions for standard math library
functions---<tt>sqrt(x)</tt>, <tt>pow(x,y)</tt>, <tt>exp(x)</tt>, <tt>log(x)</tt>,
and trig functions. They call the standard math library function of the same name
to calculate \f$\sqrt{x}\f$, \f$x^y\f$, \f$e^x\f$, \f$\ln(x)\f$, \f$\sin(x)\f$,
\f$\arcsin(x)\f$, et cetera. For example:
\code
select sqrt(x), pow(x,0.5), exp(x), log(x), log10(x),
sin(x), cos(x), tan(x), asin(x), acos(x), atan(x)
from table
\endcode
\li The <tt>ran()</tt> function calls <tt>gsl_rng_uniform</tt> to produce a uniform
draw between zero and one. It uses the stock of RNGs from \ref apop_rng_get_thread.
Here is a test script using many of the above.
\include db_fns.c
*/
/** \page modelsec Models
See \ref gentle_model for an overview of the intent and basic use of the \ref apop_model struct.
This segment goes into greater detail on the use of existing \ref apop_model objects.
If you need to write a new model, see \ref modeldetails.
The \c estimate function will estimate the parameters of your model. Just prep the data, select a model, and produce an estimate:
\code
apop_data *data = apop_query_to_data("select outcome, in1, in2, in3 from dataset");
apop_model *the_estimate = apop_estimate(data, apop_probit);
apop_model_print(the_estimate);
\endcode
Along the way to estimating the parameters, most models also find covariance estimates for
the parameters, calculate statistics like log likelihood, and so on, which the final print statement will show.
The <tt>apop_probit</tt> model that ships with Apophenia is unparameterized:
<tt>apop_probit->parameters==NULL</tt>. The output from the estimation,
<tt>the_estimate</tt>, has the same form as <tt>apop_probit</tt>, but
<tt>the_estimate->parameters</tt> has a meaningful value.
Apophenia ships with many well-known models for your immediate use, including
probability distributions, such as the \ref apop_normal, \ref apop_poisson, or \ref
apop_beta models. The data is assumed to have been drawn from a given distribution and
the question is only what distributional parameters best fit. For example, given that
the data is Normally distributed, find \f$\mu\f$ and \f$\sigma\f$ via
<tt>apop_estimate(your_data, apop_normal)</tt>.
There are also linear models like \ref apop_ols, \ref apop_probit, and \ref apop_logit. As in the example, they are on equal footing with the distributions, so nothing keeps you from making random draws from an estimated linear model.
\li If you send a data set with the \c weights vector filled, \ref apop_ols estimates Weighted OLS.
\li If the dependent variable has more than two categories, the \ref apop_probit and
\ref apop_logit models estimate a multinomial logit or probit.
\li There are separate \ref apop_normal and \ref apop_multivariate_normal functions
because the parameter formats are slightly different: the univariate Normal keeps both
\f$\mu\f$ and \f$\sigma\f$ in the vector element of the parameters; the multivariate
version uses the vector for the vector of means and the matrix for the \f$\Sigma\f$
matrix. The univariate version is so heavily used that it merits a special-case model.
See the \ref models page for a list of models shipped with Apophenia,
including popular favorites like \ref apop_beta, \ref apop_binomial, \ref apop_iv
(instrumental variables), \ref apop_kernel_density, \ref apop_loess, \ref apop_lognormal,
\ref apop_pmf (see \ref histosec below), and \ref apop_poisson.
Simulation models seem to not fit this form, but you will see below that if you can write an objective function for the \c p method of the model, you can use the above tools. Notably, you can estimate parameters via maximum likelihood and then give confidence intervals around those parameters.
<em>More estimation output</em>
In the \ref apop_model returned by \ref apop_estimate, you will find:
\li The actual parameter estimates are in an \ref apop_data set at \c your_model->parameters.
\li A pointer to the \ref apop_data set used for estimation, named \c data.
\li Scalar statistics of the model listed in the output model's \c info group,
which may include some hypothesis tests, a list of expected values, log likelihood, AIC, AIC_c, BIC, et cetera.
These can be retrieved via a form like
\code
apop_data_get(your_model->info, .rowname="log likelihood");
//or
apop_data_get(your_model->info, .rowname="AIC");
\endcode
If those are not necessary, adding to your model an \ref apop_parts_wanted_settings
group with its default values (see below on settings groups) signals to the model
that you want only the parameters and to not waste possibly significant CPU time
on covariances, expected values, et cetera. See the \ref apop_parts_wanted_settings
documentation for examples and further refinements.
\li In many cases, covariances of the parameters as a page appended to the parameters; retrieve via
\code
apop_data *cov = apop_data_get_page(your_model->parameters, "<Covariance>");
\endcode
\li Typically for regression-type models, the table of expected values (typically including
expected value, actual value, and residual) is a page stapled to the main info
page. Retrieve via:
\code
apop_data *predict = apop_data_get_page(your_model->info, "<Predicted>");
\endcode
See individual model documentation for what is provided by any given model.
<em>Post-estimation uses</em>
But we expect much more from a model than estimating parameters from data.
Continuing the above example where we got an estimated Probit model named \c the_estimate, we can interrogate the estimate in various familiar ways:
\code
apop_data *expected_value = apop_predict(NULL, the_estimate);
double density_under = apop_cdf(expected_value, the_estimate);
apop_data *draws = apop_model_draws(the_estimate, .count=1000);
\endcode
\subpage dataones
\section modelparameterization Parameterizing or initializing a model
The models that ship with Apophenia have the requisite procedures for estimation,
making draws, and so on, but have <tt>parameters==NULL</tt> and <tt>settings==NULL</tt>. The
model is thus, for many purposes, incomplete, and you will need to take some action to
complete the model. As per the examples to follow, there are several possibilities:
\li Estimate it! Almost all models can be sent with a data set as an argument to the
<tt>apop_estimate</tt> function. The input model is unchanged, but the output model
has parameters and settings in place.
\li If your model has a fixed number of numeric parameters, then you can set them with
\ref apop_model_set_parameters.
\li If your model has a variable number of parameters, you can directly set the
\c parameters element via \ref apop_data_falloc. For most purposes, you will also need to
set the \c msize1, \c msize2, \c vsize, and \c dsize elements to the size you want. See
the example below.
\li Some models have disparate, non-numeric settings rather than a simple matrix of
parameters. For example, an kernel density estimate needs a model as a kernel and a
base data set, which can be set via \ref apop_model_set_settings.
Here is an example that shows the options for parameterizing a model. After each
parameterization, 20 draws are made and written to a file named draws-[modelname].
\include ../eg/parameterization.c
\section transformsec Filtering & updating
The model structure makes it easy to generate new models that are variants of prior
models. Bayesian updating, for example, takes in one \ref apop_model that we call the
prior, one \ref apop_model that we call a likelihood, and outputs an \ref apop_model
that we call the posterior. One can produce complex models using simpler transformations
as well. For example, \ref apop_model_fix_params will set the free parameters of
an input model to a fixed value, thus producing a model with fewer parameters. To
transform a Normal(\f$\mu\f$, \f$\sigma\f$) into a one-parameter Normal(\f$\mu\f$, 1):
\code
apop_model *N_sigma1 = apop_model_fix_params(apop_model_set_parameters(apop_normal, NAN, 1));
\endcode
This can be used anywhere the original Normal distribution can be. To give another
example, if we need to truncate the distribution in the data space:
\code
//The constraint function.
double over_zero(apop_data *in, apop_model *m){
return apop_data_get(in) > 0;
}
apop_model *trunc = apop_model_dconstrain(.base_model=N_sigma1,
.constraint=over_zero);
\endcode
Chaining together simpler transformations is an easy method to produce
models of arbitrary detail. In the following example:
\li Nature generated data using a mixture of three Poisson distributions,
with \f$\lambda=2.8\f$, \f$2.0\f$, and \f$1.3\f$.
The resulting model is generated using \ref apop_model_mixture.
\li Not knowing the true distribution, the analyst
models the data with a single Poisson\f$(\lambda)\f$ distribution with a prior on \f$\lambda\f$.
The prior selected is a
truncated Normal(2, 1), generated by sending the stock
\ref apop_normal model to the data-space constraint function \ref apop_dconstrain.
\li The \ref apop_update function takes three arguments: the data set, which comes from
draws from the mixture, the prior, and the likelihood. It produces an output model
which, in this case, is a PMF describing a distribution over \f$\lambda\f$, because
a truncated Normal and a Poisson are not conjugate distributions. Knowing that it is
a PMF, the <tt>->data</tt> element holds a set of draws from the posterior.
\li The analyst would like to present an approximation to the posterior in a simpler form,
and so finds the parameters \f$\mu\f$ and \f$\sigma\f$ of the Normal distribution that
is closest to that posterior.
Here is a program---almost a single line of code---that builds the final approximation to the posterior
model from the subcomponents, including draws from Nature and the analyst's prior
and likelihood:
\include ../eg/transform.c
\section mathmethods Model methods
\li\ref apop_estimate : estimate the parameters of the model with data.
\li\ref apop_predict : the expected value function.
\li\ref apop_draw : random draws from an estimated model.
\li\ref apop_p : the probability of a given data set given the model.
\li\ref apop_log_likelihood : the log of \ref apop_p
\li\ref apop_score : the derivative of \ref apop_log_likelihood
\li\ref apop_model_print : write model components to the screen or a file
\li\ref apop_model_copy : duplicate a model
\li\ref apop_model_set_parameters : Use this to convert a Normal(\f$\mu\f$, \f$\sigma\f$) with unknown \f$\mu\f$ and \f$\sigma\f$ into a Normal(0, 1), for example.
\li\ref apop_model_free
\li\ref apop_model_clear , \ref apop_prep : remove the parameters from a parameterized model. Used infrequently.
\li\ref apop_model_draws : many random draws from an estimated model.
\li\ref apop_update : Bayesian updating
\li\ref apop_model_coordinate_transform : apply an invertible transformation to the data space
\li\ref apop_model_dconstrain : constrain the data space of a model to a subspace. E.g., truncate a Normal distribution so \f$x>0\f$.
\li\ref apop_model_fix_params : hold some parameters constant
\li\ref apop_model_mixture : a linear combination of models
\li\ref apop_model_cross : If \f$(d_1)\f$ has a Normal\f$(\mu, \sigma)\f$ distribution
and \f$d_2\f$ has an independent Poisson\f$(\lambda)\f$ distribution, then \f$(d_1,
d_2)\f$ has an <tt>apop_model_cross(apop_normal, apop_poisson)</tt> distribution with
parameters \f$(\mu, \sigma, \lambda)\f$.
\section modelsettings Settings groups
Describing a statistical, agent-based, social, or physical model in a standardized
form is difficult because every model has significantly different settings. An
MLE requires a method of search (conjugate gradient, simplex, simulated annealing),
and a histogram needs the number of bins to be filled with data.
So, the \ref apop_model includes a single list which can hold an arbitrary number of settings groups, like the search specifications for finding the maximum likelihood, a histogram for making random draws, and options about the model type.
Settings groups are automatically initialized with default values when
needed. If the defaults do no harm, then you don't need to think about
these settings groups at all.
Here is an example where a settings group is worth tweaking: the \ref apop_parts_wanted_settings group indicates which parts
of the auxiliary data you want.
\code
1 apop_model *m = apop_model_copy(apop_ols);
2 Apop_settings_add_group(m, apop_parts_wanted, .covariance='y');
3 apop_model *est = apop_estimate(data, m);
\endcode
Line one establishes the baseline form of the model. Line two adds a settings group
of type \ref apop_parts_wanted_settings to the model. By default other auxiliary items, like the expected values, are set to \c 'n' when using this group, so this specifies that we want covariance and only covariance. Having stated our preferences, line three does the estimation we want.
Notice that the \c _settings ending to the settings group's name isn't written---macros
make it happen. The remaining arguments to \c Apop_settings_add_group (if any) follow
the \ref designated syntax of the form <tt>.setting=value</tt>.
There is an \ref apop_model_copy_set macro that adds a settings group when it is first copied, joining up lines one and two above:
\code
apop_model *m = apop_model_copy_set(apop_ols, apop_parts_wanted, .covariance='y');
\endcode
Settings groups are copied with the model, which facilitates chaining
estimations. Continuing the above example, you could re-estimate to get the predicted
values and covariance via:
\code
Apop_settings_set(est, apop_parts_wanted, predicted, 'y');
apop_model *est2 = apop_estimate(data, est);
\endcode
Maximum likelihood search has many settings that could be modified, and so provides
another common example of using settings groups:
\code
apop_model *the_estimate = apop_estimate(data, apop_probit);
//Redo the Probit's MLE search using Newton's Method:
Apop_settings_add_group(the_estimate, apop_mle, .verbose='y',
.tolerance=1e-4, .method="Newton");
apop_model *re_est = apop_estimate(data, the_estimate);
\endcode
To clarify the distinction between parameters and settings, note that parameters are
estimated from the data, often via a maximum likelihood search. In an ML search,
the method of search, the number of bins in a histogram, or the number of steps in a
simulation would be held fixed as the search iterates over possible parameters (and
if these settings do change, then that is a meta-model that could be encapsulated into another
\ref apop_model). As a consequence, parameters are always numeric, while settings may
be any type.
\li \ref Apop_settings_set, for modifying a single setting, doesn't use the designated initializers format.
\li Because the settings groups are buried within the model, debugging them can be a
pain. Here is a documented macro for \c gdb that will help you pull a settings group out of a
model for your inspection, to cut and paste into your <tt>.gdbinit</tt>. It shouldn't be too difficult to modify this macro for other debuggers.
\code
define get_group
set $group = ($arg1_settings *) apop_settings_get_grp( $arg0, "$arg1", 0 )
p *$group
end
document get_group
Gets a settings group from a model.
Give the model name and the name of the group, like
get_group my_model apop_mle
and I will set a gdb variable named $group that points to that model,
which you can use like any other pointer. For example, print the contents with
p *$group
The contents of $group are printed to the screen as visible output to this macro.
end
\endcode
For using a model, that's all of what you need to know. For details on writing a new settings group, see \ref settingswriting .
\li\ref Apop_settings_add_group
\li\ref Apop_settings_set
\li\ref Apop_settings_get : get a single element from a settings group.
\li\ref Apop_settings_get_group : get the whole settings group.
*/
/** \page dataones Data format for regression-type models
Regression-type estimations typically require a constant column. That is, the 0th
column of the data is a constant (one), so the parameter \f$\beta_0\f$ is
slightly special in corresponding to a constant rather than a variable.
Some stats packages implicitly assume a constant column, which the user never
sees. This violates the principle of transparency upon which Apophenia is based.
Given a data matrix \f$X\f$ with the estimated parameters
\f$\beta\f$, if the model asserts that the product \f$X\beta\f$ has meaning, then you
should be able to easily calculate that product. With a ones column, a dot product is one line:
<tt>apop_dot(x, your_est->parameters)</tt>; without a ones column, one would basically
have to construct one (using \c gsl_matrix_set_all and \c apop_data_stack).
Each regression-type estimation has one dependent variable and several independent. In
the end, we want the dependent variable to be in the vector element. Removing
a column from a <tt>gsl_matrix</tt> and adjusting all subsequent columns is relatively
difficult, because (like most structs built with the aim of very efficient processing) the
struct depends on an equal spacing in memory between each element.
<em> The automatic case</em>
We can resolve both the need for a ones column and for having the dependent column in
the vector at the same time. Given a data set with no vector element and the dependent
variable in the first column of the matrix, we can copy the dependent variable into
the vector and then replace the first column of the matrix with ones. The result fits
all of the above expectations.
You as a user merely have to send in a \c apop_data set with \c NULL vector and a dependent
column in the first column. If the data is coming from the database, then the query
is natural:
\code
apop_data *regression_data = apop_query_to_data("select depvar, indyvar1, indyvar2, indyvar3 from dataset");
apop_model_print(apop_estimate(regression_data, apop_ols));
\endcode
<em> The already-prepped case</em>
If your data has a vector element, then the prep routines won't change anything.
If you don't want to use a constant column, or your data has already been prepped by
an estimation, then this is what you want.
\code
apop_data *regression_data = apop_query_to_mixed_data("vmmm", "select depvar, indyvar1, indvar2, indvar3 from dataset");
apop_model_print(apop_estimate(regression_data, apop_logit));
\endcode
*/
/** \page testpage Tests & diagnostics
Here is the model for all hypothesis testing within Apophenia:
\li Calculate a statistic.
\li Describe the distribution of that statistic.
\li Work out how much of the distribution is (above|below|closer to zero than) the statistic.
There are a handful of named tests that produce a known statistic and then compare to a
known distribution, like \ref apop_test_kolmogorov or \ref apop_test_fisher_exact. For
traditional distributions (Normal, \f$t\f$, \f$\chi^2\f$), use the \ref apop_test convenience
function.
In especially common cases, like the parameters from an OLS regression,
the commonly-associated \f$t\f$ test is included as part of the estimation
output, typically as a row in the \c info element of the output \ref apop_model.
\li\ref apop_test
\li\ref apop_paired_t_test
\li\ref apop_f_test
\li\ref apop_t_test
\li\ref apop_test_anova_independence
\li\ref apop_test_fisher_exact
\li\ref apop_test_kolmogorov
\li\ref apop_estimate_coefficient_of_determination
\li\ref apop_estimate_r_squared
See also these Monte Carlo methods:
\li\ref apop_bootstrap_cov
\li\ref apop_jackknife_cov
To give another example of testing, here is a function that was briefly a part of
Apophenia, but seemed a bit out of place. Here it is as a sample:
\code
// Input: any vector, which will be normalized in place. Output: 1 - the p-value
// for a chi-squared test to answer the question, "with what confidence can I
// reject the hypothesis that the variance of my data is zero?"
double apop_test_chi_squared_var_not_zero(gsl_vector *in){
Apop_stopif(!in, return NAN, 0, "input vector is NULL. Doing nothing.");
apop_vector_normalize(in, .normalization_type='s');
double sum=apop_vector_map_sum(in, gsl_pow_2);
return gsl_cdf_chisq_P(sum, in->size);
}
\endcode
Or, consider the Rao statistic,
\f${\partial\over \partial\beta}\log L(\beta)'I^{-1}(\beta){\partial\over \partial\beta}\log L(\beta)\f$
where \f$L\f$ is a model's likelihood function and \f$I\f$ its information matrix. In code:
\code
apop_data * infoinv = apop_model_numerical_covariance(data, your_model);
apop_data * score = &(apop_data*){.vector=apop_numerical_gradient(data, your_model)};
apop_data * stat = apop_dot(apop_dot(score, infoinv), score);
\endcode
Given the correct assumptions, this is \f$\sim \chi^2_m\f$, where \f$m\f$ is the dimension of \f$\beta\f$, so the odds of a Type I error given the model is:
\code
double p_value = apop_test(stat, "chi squared", beta->size);
\endcode
<em>Generalized parameter tests</em>
But if your model is not from the textbook, then you have the tools to apply the
above three-step process to the parameters of any \ref apop_model.
\li Model parameters are a statistic, and you know that <tt>apop_estimate(your_data,
your_model)</tt> will output a model with a <tt>parameters</tt> element.
\li \ref apop_parameter_model will return an \ref apop_model describing
the distribution of these parameters.
\li We now have the two ingredients to send to \ref apop_cdf, which takes in a model
and a data point and returns the area under the data point.
Defaults for the parameter models are filled in via bootstrapping or resampling, meaning
that if your model's parameters are decidedly off the Normal path, you can still test
claims about the parameters.
The introductory example in \ref gentle ran a standard OLS regression, whose output includes some
standard hypothesis tests; to conclude, let us go the long way and replicate those results
via the general \ref apop_parameter_model mechanism. The results here will of course be
identical, but the more general mechanism can be used in situations where the standard
models don't apply.
The first part of this program is identical to the introductory program, using \c
ss08pdc.csv if you have downloaded it as per the instructions in \ref gentle, or a
simple sample data set if not. The second half executes the three steps uses many
of the above features: one of the inputs to \ref apop_parameter_model (which row of
the parameter set to use) is sent by adding a settings group, we pull that row into
a separate data set using \ref Apop_r, and we set its vector value by referring to it
as the -1st element.
\include ols2.c
Note that the procedure did not assume the model parameters had a certain form. It
queried the model for the distribution of parameter \c agep, and if the model didn't have
a closed-form answer then a distribution via bootstrap would be provided. Then that model
was queried for its CDF. [The procedure does assume a symmetric distribution. Fixing this
is left as an exercise for the reader.] For a model like OLS, this is entirely overkill,
which is why OLS provides the basic hypothesis tests automatically. But for models
where the distribution of parameters is unknown or has no closed-form solution, this
may be the only recourse.
*/
/** \page histosec Empirical distributions and PMFs (probability mass functions)
The \ref apop_pmf model wraps an \ref apop_data set so it can be read as an empirical
model, with a likelihood function (equal to the associated weight for observed
values and zero for unobserved values), a random number generator (which
simply makes weighted random draws from the data), and so on. Setting it up is a
model estimation from data like any other, done via \ref apop_estimate(\c your_data,
\ref apop_pmf).
You have the option of cleaning up the data before turning it into a PMF. For example...
\code
apop_data_pmf_compress(your_data); //remove duplicates
apop_data_sort(your_data);
apop_vector_normalize(your_data->weights); //weights sum to one
apop_model *a_pmf = apop_estimate(your_data, apop_pmf);
\endcode
These are largely optional.
\li The CDF is calculated based on the percent of the weights between the zeroth row of the PMF
and the row specified. This generally makes more sense after \ref apop_data_sort.
\li Compression produces a corresponding improvement in efficiency when first calculating
CDFs, but is otherwise not necessary.
\li Sorting or normalizing is not necessary for making draws or getting a likelihood or log likelihood.
It is the \c weights vector that holds the density represented by each row; the rest of the row represents the coordinates of that density. If the input data set has no \c weights segment, then I assume that all rows have equal weight.
For a PMF model, the \c parameters are \c NULL, and the \c data itself
is used for calculation. Therefore, modifying the data post-estimation can break some
internal settings set during estimation. If you modify the data, throw away any existing
PMFs (via \ref apop_model_free) and re-estimate a new one.
\section histocompare Comparing histograms
Using \ref apop_data_pmf_compress puts the data into one bin for each unique value in
the data set. You may instead want bins of fixed with, in the style of a histogram,
which you can get via \ref apop_data_to_bins. It requires a bin specification. If you
send a \c NULL binspec, then the offset is zero and the bin size is big enough to ensure
that there are \f$\sqrt{N}\f$ bins from minimum to maximum. The binspec will be added
as a page to the data set, named <tt>"<binspec>"</tt>. See the \ref apop_data_to_bins
documentation on how to write a custom bin spec.
There are a few ways of testing the claim that one distribution equals another, typically an empirical PMF versus a smooth theoretical distribution. In both cases, you will need two distributions based on the same binspec.
For example, if you do not have a prior binspec in mind, then you can use the one generated by the first call to the histogram binning function to make sure that the second data set is in sync:
\code
apop_data_to_bins(first_set, NULL);
apop_data_to_bins(second_set, apop_data_get_page(first_set, "<binspec>"));
\endcode
You can use \ref apop_test_kolmogorov or \ref apop_histograms_test_goodness_of_fit to generate the appropriate statistics from the pairs of bins.
Kernel density estimation will produce a smoothed PDF. See \ref apop_kernel_density for details.
Or, use \ref apop_vector_moving_average for a simpler smoothing method.
\li\ref apop_data_pmf_compress() : merge together redundant rows in a data set before calling
\ref apop_estimate(\c your_data, \ref apop_pmf); optional.
\li\ref apop_vector_moving_average() : smooth a vector (e.g., <tt>your_pmf->data->weights</tt>) via moving average.
\li\ref apop_histograms_test_goodness_of_fit() : goodness-of-fit via \f$\chi^2\f$ statistic
\li\ref apop_test_kolmogorov() : goodness-of-fit via Kolmogorov-Smirnov statistic
\li\ref apop_kl_divergence() : measure the information loss from one (typically empirical) distribution to another distribution.
*/
/** \page maxipage Optimization
This section includes some notes on the maximum likelihood routine. As in the section
on writing models above, if a model has a \c p or \c log_likelihood method but no \c
estimate method, then calling \c apop_estimate(your_data, your_model) executes the
default estimation routine of maximum likelihood.
If you are a not a statistician, then there are a few things you will need to keep in
mind:
\li Physicists, pure mathematicians, and the GSL minimize; economists, statisticians, and
Apophenia maximize. If you are doing a minimization, be sure that your function returns minus the objective
function's value.
\li The overall setup is about estimating the parameters of a model with data. The user
provides a data set and an unparameterized model, and the system tries parameterized
models until one of them is found to be optimal. The data is fixed. The optimization
tries a series of parameterized models, searching for the one that is most likely. In
a non-stats setting, you may have \c NULL data.
\li Because the unit of analysis is a parameterized model,
not just parameters, you need to have an \ref apop_model
wrapping your objective function.
This example, to be discussed in detail below, optimizes
Rosenbrock's banana function, \f$(1-x)^2+ s(y - x^2)^2\f$, where the
scaling factor \f$s\f$ is fixed ahead of time, say at 100.
\include ../eg/banana.c
The \c banana function returns a single number to be minimized. You will need to write an
\ref apop_model to send to the optimizer, which is a two step process: write a log
likelihood function wrapping the real objective function (\c ll), and a model that uses that
log likelihood (\c b).
\li The <tt>.vsize=2</tt> part of the declaration of \c b on the second
line of <tt>main()</tt> specified that the model's parameters are a vector of
size two. That is, the list of <tt>double</tt>s to send to \c banana is set in
<tt>in->parameters->vector->data</tt>.
\li The \c more element of the \ref apop_model
structure is designed to hold any arbitrary structure of size \c more_size, which
is useful for models that require additional constants or other settings, like the
<tt>coeff_struct</tt> here. See \ref settingswriting for more on handling model settings.
\li Statisticians want the covariance and basic tests about the parameters.
This line shuts off all auxiliary calculations:
\code
Apop_settings_add_group(your_model, apop_parts_wanted);
\endcode
See the documentation for \ref apop_parts_wanted_settings for details about how this
works. It can also offer quite the speedup: especially for high-dimensional problems,
finding the covariance matrix without any additional information can take dozens of evaluations
of the objective function for each evaluation that is part of the search itself.
\li MLEs have an especially large number of parameter tweaks that could be made;
see the \ref apop_mle_settings page.
\li As a useful diagnostic, you can add a \c NULL \ref apop_data set to the MLE
settings in the <tt>.path</tt> slot, and it will be allocated and filled with the
sequence of points tried by the optimizer.
\li The program has some extras above and beyond the necessary: it uses two methods
(notice how easy it is to re-run an estimation with an alternate method, but the syntax
for modifying a setting differs from the initialization syntax) and checks that the
results are accurate.
\section constr Setting Constraints
The problem is that the parameters of a function must not take on certain values,
either because the function is undefined for those values or because parameters with
certain values would not fit the real-world problem.
If you give the optimizer an unconstrained likelihood function plus a separate constraint
function, \ref apop_maximum_likelihood will combine them to a function that is continuous
at the constraint boundary, but which is guaranteed to never have an optimum outside of the constraint.
A constraint function must do three things:
\li If the constraint does not bind (i.e. the parameter values are OK), then it must return zero.
\li If the constraint does bind, it must return a penalty, that indicates how far off the parameter is from meeting the constraint.
\li If the constraint does bind, it must set a return vector that the likelihood function can take as a valid input. The penalty at this returned value must be zero.
The idea is that if the constraint returns zero, the log likelihood function will
return the log likelihood as usual, and if not, it will return the log likelihood at
the constraint's return vector minus the penalty. To give a concrete example, here
is a constraint function that will ensure that both parameters of a two-dimensional
input are both greater than zero, and that their sum is greater than two. As with the
constraints for many of the models that ship with Apophenia, it is a wrapper for \ref
apop_linear_constraint.
\code
static long double greater_than_zero_constraint(apop_data *data, apop_model *v){
static apop_data *constraint = NULL;
if (!constraint) constraint= apop_data_falloc((3,3,2), 0, 1, 0, //0 < 1x + 0y
0, 0, 1, //0 < 0x + 1y
2, 1, 1); //2 < 1x + 1y
return apop_linear_constraint(v->parameters->vector, constraint, 1e-3);
}
\endcode
\li\ref apop_linear_constraint()
\section simanneal Notes on simulated annealing
For convex optimizations, methods like conjugate gradient search work well, and
for relatively smooth optimizations, the Nelder-Mead simplex algorithm is a good
choice. For situations where the surface being searched may have several local optima
and be otherwise badly behaved, there is simulated annealing.
Simulated annealing is a controlled random walk. As with the other methods, the
system tries a new point, and if it is better, switches. Initially, the system is
allowed to make large jumps, and then with each iteration, the jumps get smaller,
eventually converging. Also, there is some decreasing probability that if the new
point is less likely, it will still be chosen. Simulated annealing is best for
situations where there may be multiple local optima. Early in the random walk, the
system can readily jump from one to another; later it will fine-tune its way toward the
optimum. The number of points tested is determined by the parameters of the simulated
colling program, not the values returned by the likelihood function. If you know your
function is globally convex (as are most standard probability functions), then this
method is overkill.
\section mlfns Useful functions
\li\ref apop_estimate_restart : Restarting an MLE with different settings can improve results.
\li\ref apop_maximum_likelihood : Rarely called directly. If a model has no \c estimate element, call \ref apop_estimate to prep the model and run an MLE.
\li\ref apop_model_numerical_covariance
\li\ref apop_numerical_gradient
*/
/** \page moreasst Assorted
Some functions for missing data:
\li\ref apop_data_listwise_delete
\li\ref apop_ml_impute
A few more descriptive methods:
\li\ref apop_matrix_pca : Principal component analysis
\li\ref apop_anova : One-way or two-way ANOVA tables
\li\ref apop_rake : Iterative proportional fitting on large, sparse tables
General utilities:
\li\ref Apop_stopif : Apophenia's error-handling and warning-printing macro.
\li\ref apop_opts : the global options
\li\ref apop_system : a printf-style wrapper around the standard \c system function.
A few more math utilities:
\li\ref apop_matrix_is_positive_semidefinite
\li\ref apop_matrix_to_positive_semidefinite
\li\ref apop_generalized_harmonic
\li\ref apop_multivariate_gamma
\li\ref apop_multivariate_lngamma
\li\ref apop_rng_alloc
*/
/** \page mingw MinGW
Minimalist GNU for Windows is indeed minimalist: it is not a full POSIX subsystem, and provides no package manager. Therefore, you will have to make some adjustments and install the dependencies yourself.
Matt P. Dziubinski successfully used Apophenia via MinGW; here are his instructions (with edits by BK):
\li get libregex (the ZIP file) from:
http://sourceforge.net/project/showfiles.php?group_id=204414&package_id=306189
\li get libintl (three ZIP files) from:
http://gnuwin32.sourceforge.net/packages/libintl.htm .
download "Binaries", "Dependencies", "Developer files"
\li follow "libintl" steps from:
http://kayalang.org/download/compiling/windows
\li Modify \c Makefile, adding -lpthread to AM_CFLAGS (removing -pthread) and -lregex to AM_CFLAGS and LIBS
\li Now compile the main library:
\code
make
\endcode
\li Finally, put one more expected directory in place and install:
\code
mkdir -p -- "/usr/local/Lib/site-packages"
make install
\endcode
\li You will get the usual warning about library paths, and may have to take the specified action:
\code
----------------------------------------------------------------------
Libraries have been installed in:
/usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-LLIBDIR' linker flag
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
\endcode
*/
/* optionaldetails Implementation of optional arguments [this section ignored by doxygen]
Optional and named arguments are among the most commonly commented-on features of Apophenia, so this page goes into full detail about the implementation.
To use these features, see the all-you-really-need summary at the \ref designated
page. For a background and rationale, see the blog entry at http://modelingwithdata.org/arch/00000022.htm .
I'll assume you've read both links before continuing.
OK, now that you've read the how-to-use and the discussion of how optional and named arguments can be constructed in C, this page will show how they are done in Apophenia. The level of details should be sufficient to implement them in your own code if you so desire.
There are three components to the process of generating optional arguments as implemented here:
\li Produce a \c struct whose elements match the arguments to the function.
\li Write a wrapper function that takes in the struct, unpacks it, and calls the original function.
\li Write a macro that makes the user think the wrapper function is the real thing.
None of these steps are really rocket science, but there is a huge amount of redundancy.
Apophenia includes some macros that reduce the boilerplate redundancy significantly. There are two layers: the C-standard code, and the script that produces the C-standard code.
We'll begin with the C-standard header file:
\code
#ifdef APOP_NO_VARIADIC
void apop_vector_increment(gsl_vector * v, int i, double amt);
#else
void apop_vector_increment_base(gsl_vector * v, int i, double amt);
apop_varad_declare(void, apop_vector_increment, gsl_vector * v; int i; double amt);
#define apop_vector_increment(...) apop_varad_link(apop_vector_increment, __VA_ARGS__)
#endif
\endcode
First, there is an if/else that allows the system to degrade gracefully
if you are sending C code to a parser like swig, whose goals differ
too much from straight C compilation for this to work. Set \c
APOP_NO_VARIADIC to produce a plain function with no variadic support.
Else, we begin the above steps. The \c apop_varad_declare line expands to the following:
\code
typedef struct {
gsl_vector * v; int i; double amt ;
} variadic_type_apop_vector_increment;
void variadic_apop_vector_increment(variadic_type_apop_vector_increment varad_in);
\endcode
So there's the ad-hoc struct and the declaration for the wrapper
function. Notice how the arguments to the macro had semicolons, like a
struct declaration, rather than commas, because the macro does indeed
wrap the arguments into a struct.
Here is what the \c apop_varad_link would expand to:
\code
#define apop_vector_increment(...) variadic_apop_increment_base((variadic_type_apop_vector_increment) {__VA_ARGS__})
\endcode
That gives us part three: a macro that lets the user think that they are
making a typical function call with a set of arguments, but wraps what
they type into a struct.
Now for the code file where the function is declared. Again, there is is an \c APOP_NO_VARIADIC wrapper. Inside the interesting part, we find the wrapper function to unpack the struct that comes in.
\code
\#ifdef APOP_NO_VARIADIC
void apop_vector_increment(gsl_vector * v, int i, double amt){
\#else
apop_varad_head( void , apop_vector_increment){
gsl_vector * apop_varad_var(v, NULL);
Apop_stopif(!v, return, 0, "You sent me a NULL vector.");
int apop_varad_var(i, 0);
double apop_varad_var(amt, 1);
apop_vector_increment_base(v, i, amt);
}
void apop_vector_increment_base(gsl_vector * v, int i, double amt){
#endif
v->data[i * v->stride] += amt;
}
\endcode
The
\c apop_varad_head macro reduces redundancy, and will expand to
\code
void variadic_apop_vector_increment (variadic_type_variadic_apop_vector_increment varad_in)
\endcode
The function with this header thus takes in a single struct, and for every variable, there is a line like
\code
double apop_varad_var(amt, 1);
\endcode
which simply expands to:
\code
double amt = varad_in.amt ? varad_in.amt : 1;
\endcode
Thus, the macro declares each not-in-struct variable, and so there will need to be
one such declaration line for each argument. Apart from requiring declarations, you
can be creative: include sanity checks, post-vary the variables of the inputs, unpack
without the macro, and so on. That is, this parent function does all of the bookkeeping,
checking, and introductory shunting, so the base function can do the math. Finally,
the introductory section will call the base function.
The setup goes out of its way to leave the \c _base function in the public namespace,
so that those who would prefer speed to bounds-checking can simply call that function
directly, using standard notation. You could eliminate this feature by merging
the two functions.
<b>The m4 script</b>
The above is all you need to make this work: the varad.h file, and the above structures. But there is still a lot of redundancy, which can't be eliminated by the plain C preprocessor.
Thus, in Apophenia's code base (the one you'll get from checking out the git repository, not the gzipped distribution that has already been post-processed) you will find a pre-preprocessing script that converts a few markers to the above form. Here is the code that will expand to the above C-standard code:
\code
//header file
APOP_VAR_DECLARE void apop_vector_increment(gsl_vector * v, int i, double amt);
//code file
APOP_VAR_HEAD void apop_vector_increment(gsl_vector * v, int i, double amt){
gsl_vector * apop_varad_var(v, NULL);
Apop_stopif(!v, return, 0, "You sent me a NULL vector.");
int apop_varad_var(i, 0);
double apop_varad_var(amt, 1);
APOP_VAR_END_HEAD
v->data[i * v->stride] += amt;
}
\endcode
It is obviously much shorter. The declaration line is actually a C-standard declaration with the \c APOP_VAR_DECLARE preface, so you don't have to remember when to use semicolons. The function itself looks like a single function, but there is again a marker before the declaration line, and the introductory material is separated from the main matter by the \c APOP_VAR_END_HEAD line. Done right, drawing a line between the introductory checks or initializations and the main function can improve readability.
The m4 script inserts a <tt>return function_base(...)</tt> at the end of the header
function, so you don't have to. If you want to call the function before the last line, you
can do so explicitly, as in the expansion above, and add a bare <tt>return;</tt> to
guarantee that the call to the base function that the m4 script will insert won't ever be
reached.
One final detail: it is valid to have types with commas in them---function arguments. Because commas get turned to semicolons, and m4 isn't a real parser, there is an exception built in: you will have to replace commas with exclamation marks in the header file (only). E.g.,
\code
APOP_VAR_DECLARE apop_data * f_of_f(apop_data *in, void *param, int n, double (*fn_d)(double ! void * !int));
\endcode
m4 is POSIX standard, so even if you can't read the script, you have the program needed to run it. For example, if you name it \c prep_variadics.m4, then run
\code
m4 prep_variadics.m4 myfile.m4.c > myfile.c
\endcode
*/
/**
\page gentle A quick overview
This is a "gentle introduction" to the Apophenia library. It is intended
to give you some initial bearings on the typical workflow and the concepts and tricks that
the manual pages assume you are familiar with.
If you want to install Apophenia now so you can try the samples on this page, see the \ref setup page.
An outline of this overview:
\li Apophenia fills a space between traditional C libraries and stats packages.
\li The \ref apop_data structure represents a data set (of course). Data sets are inherently complex,
but there are many functions that act on \ref apop_data sets to make life easier.
\li The \ref apop_model encapsulates the sort of actions one would take with a model, like estimating model parameters or predicting values based on new inputs.
\li Databases are great, and a perfect fit for the sort of paradigm here. Apophenia
provides functions to make it easy to jump between database tables and \ref apop_data sets.
<em> The opening example</em>
Setting aside the more advanced applications and model-building tasks, let us begin with
the workflow of a typical fitting-a-model project using Apophenia's tools:
\li Read the raw data into the database using \ref apop_text_to_db.
\li Use SQL queries handled by \ref apop_query to massage the data as needed.
\li Use \ref apop_query_to_data to pull some of the data into an in-memory \ref apop_data set.
\li Call a model estimation such as \code apop_estimate (data_set, apop_ols)\endcode or \code apop_estimate (data_set, apop_probit)\endcode to fit parameters to the data. This will return an \ref apop_model with parameter estimates.
\li Interrogate the returned estimate, by dumping it to the screen with \ref apop_model_print, sending its parameters and variance-covariance matrices to additional tests (the \c estimate step runs a few for you), or send the model's output to be input to another model.
Here is an example of most of the above steps which you can compile and run, to be discussed in detail below.
The program relies on the U.S. Census's American Community Survey public use microdata for DC 2008, which you can get from the command line via:
\code
wget https://raw.github.com/rodri363/tea/master/demo/ss08pdc.csv
\endcode
or by pointing your browser to that address and saving the file.
The program:
\code
#include <apop.h>
int main(){
apop_text_to_db(.text_file="ss08pdc.csv", .tabname="dc");
apop_data *data = apop_query_to_data("select log(pincp+10) as log_income, agep, sex "
"from dc where agep+ pincp+sex is not null and pincp>=0");
apop_model *est = apop_estimate(data, apop_ols);
apop_model_print(est);
}
\endcode
If you saved the code to <tt>census.c</tt> and don't have a \ref makefile or other
build system, then you can compile it with
\code
gcc census.c -std=gnu99 -lapophenia -lgsl -lgslcblas -lsqlite3 -o census
\endcode
or
\code
clang census.c -lapophenia -lgsl -lgslcblas -lsqlite3 -o census
\endcode
and then run it with <tt>./census</tt>. This compile line will work on any system with all the requisite tools,
but for full-time work with this or any other C library, you will probably want to write a \ref makefile.
The results are unremarkable---age has a positive effect on income, and sex
(1=male, 2=female) does has a negative effect---but it does give us some lines of
code to dissect.
The first two lines in \c main() make use of a database.
I'll discuss the value of the database step more at the end of this page, but for now,
note that there are several functions, \ref apop_query and \ref apop_query_to_data
being the ones you will most frequently be using, that will allow you to talk to and
pull data from either an SQLite or mySQL/mariaDB database. The database is a natural
place to do data processing like renaming variables, selecting subsets, and transforming values.
<em> Designated initializers</em>
Like this line,
\code
apop_text_to_db(.text_file="data", .tabname="d");
\endcode
many Apophenia functions accept named, optional arguments. To give another example,
the \ref apop_data set has the usual row and column numbers, but also row and column
names. So you should be able to refer to a cell by any combination of name or number;
for the data set you read in above, which has column names, all of the following work:
\code
x = apop_data_get(data, 2, 3); //observation 2, column 3
x = apop_data_get(data, .row=2, .colname="sex"); // same
apop_data_set(data, 2, 3, 1);
apop_data_set(data, .colname="sex", .row=2, .val=1);
\endcode
Default values mean that the \ref apop_data_get, \ref apop_data_set, and \ref apop_data_ptr functions handle matrices, vectors, and scalars sensibly:
\code
//Let v be a hundred-element vector:
apop_data *v = apop_data_alloc(100);
[fill with data here]
double x1 = apop_data_get(v, 10);
apop_data_set(v, 2, .val=x1);
//A 100x1 matrix behaves like a vector
apop_data *m = apop_data_alloc(100, 1);
[fill with data here]
double m1 = apop_data_get(v, 1);
//let s be a scalar stored in a 1x1 apop_data set:
apop_data *v = apop_data_alloc(1);
double *scalar = apop_data_ptr(s);
\endcode
These conveniences may be new to users of less user-friendly C libraries, but it it fully
conforms to the C standard (ISO/IEC 9899:2011). See the \ref designated page for details.
\section apop_data
A lot of real-world data processing is about quotidian annoyances about text versus
numeric data or dealing with missing values, and the \ref apop_data set and its
many support functions are intended to make data processing in C easy. Some users of
Apophenia use the library only for its \ref apop_data set and associated functions. See
\ref dataoverview for extensive notes on using the structure.
The structure includes seven parts:
\li a vector,
\li a matrix,
\li a grid of text elements,
\li a vector of weights,
\li names for everything: row names, a vector name, matrix column names, text names,
\li a link to a second page of data, and
\li an error marker
This is not a generic and abstract ideal, but is the sort of mess that real-world data sets look like. For
example, here is some data for a weighted OLS regression. It includes an outcome
variable in the vector, dependent variables in the matrix and text grid,
replicate weights, and column names in bold labeling the variables:
\htmlinclude apop_data_fig.html
\latexinclude apop_data_fig.tex
Apophenia's functions generally assume that one row across all of these elements
describes a single observation or data point.
See above for some examples of getting and setting individual elements.
Also, \ref apop_data_get, \ref apop_data_set, and \ref apop_data_ptr consider the vector to be the -1st column,
so using the data set in the figure, <tt>apop_data_get(sample_set, .row=0, .col=-1) == 1</tt>.
<em> Reading in data</em>
As per the example above, use \ref apop_text_to_data or \ref apop_text_to_db and then \ref apop_query_to_data.
<em> Subsets</em>
There are many macros to get views of subsets of the data. Each generates a disposable
wrapper around the base data: once the variable goes out of scope, the wrapper
disappears, but modifications made to the data in the view are modifications to the
base data itself.
\include simple_subsets.c
All of these slicing routines are macros, because they generate several
background variables in the current scope (something a function can't do). Traditional
custom is to put macro names in all caps, like \c APOP_DATA_ROWS, which to modern
sensibilities looks like yelling. The custom has a logic: there are ways to hang
yourself with macros, so it is worth distinguishing them typographically.
Apophenia tones it down by capitalizing only the first letter.
<em> Basic manipulations</em>
See \ref dataoverview for a list of many other manipulations of data sets, such as
\ref apop_data_listwise_delete for quick-and-dirty removal of observations with <tt>NaN</tt>s,
\ref apop_data_split / \ref apop_data_stack,
or \ref apop_data_sort to sort all elements by a single column.
<em> Apply and map</em>
If you have an operation of the form <em>for each element of my data set, call this
function</em>, then you can use \ref apop_map to do it. You could basically do everything you
can do with an apply/map function via a \c for loop, but the apply/map approach is clearer
and more fun. Also, if you set OpenMP's <tt>omp_set_num_threads(N)</tt> for any \c N
greater than 1 (the default on most systems is the number of CPU cores), then the work
of mapping will be split across multiple CPU threads. See \ref mapply for a number
of examples.
<em> Text</em>
String handling in C usually requires some tedious pointer and memory handling, but the functions
to put strings into the text grid in the \ref apop_data structure and get them out
again will do the pointer shunting for you. The \ref apop_text_alloc function is
really a realloc function: you can use it to resize the text grid as necessary. The
\ref apop_text_set function will write a single string to the grid, though you may be
using \ref apop_query_to_text or \ref apop_query_to_mixed_data to read in an entire
data set at once. Functions that act on entire data sets, like \ref apop_data_rm_rows,
handle the text part as well.
The text grid for \c your_data has <tt>your_data->textsize[0]</tt> rows and <tt>your_data->textsize[1]</tt> columns. If you are using only the functions to this point, then empty elements are a blank string (<tt>""</tt>), not \c NULL.
For reading individual elements, refer to the \f$(i,j)\f$th text element via <tt>your_data->text[i][j]</tt>.
<em> Errors</em>
Many functions will set the <tt>error</tt> element of the \ref apop_data structure being operated on if anything goes wrong. You can use this to halt the program or take corrective action:
\code
apop_data *the_data = apop_query_to_data("select * from d");
Apop_stopif(!the_data || the_data->error, exit(1), 0, "Trouble querying the data");
\endcode
<em> The whole structure</em>
Here is a diagram of all of Apophenia's structures and how they
relate. It is taken from this
<a href="http://modelingwithdata.org/pdfs/cheatsheet.pdf">cheat sheet</a> on general C and SQL use (2 page PDF).
\image html structs.png width="100%"
\image latex structs.png width=18cm
All of the elements of the \ref apop_data structure are laid out at middle-left. You have
already met the vector, matrix, weights, and text grid.
The diagram shows the \ref apop_name structure, which has received little mention so far because names
basically take care of themselves. A query will bring in column names (and row names if you set <tt>apop_opts.db_name_column</tt>), or use \ref apop_data_add_names to add names to your data
set and \ref apop_name_stack to copy from one data set to another.
The \ref apop_data structure has a \c more element, for when your data is best expressed
in more than one page of data. Use \ref apop_data_add_page, \ref apop_data_rm_page,
and \ref apop_data_get_page. Output routines will sometimes append an extra page of
auxiliary information to a data set, such as pages named <tt>\<Covariance\></tt> or
<tt>\<Factors\></tt>. The angle-brackets indicate a page that describes the data set
but is not a part of it (so an MLE search would ignore that page, for example).
Now let us move up the structure diagram to the \ref apop_model structure.
\section gentle_model apop_model
Even restricting ourselves to the most basic operations, there are a lot of things that
we want to do with our models: use a data set to estimate the parameters of a model (like the mean and
variance of a Normal distribution), or draw random numbers, or show the
expected value, or show the expected value of one part of the data given fixed values
for the rest of it. The \ref apop_model is intended to encapsulate most of these desires
into one object, so that models can easily be swapped around, modified to create new models,
compared, and so on.
From the figure above, you can see that the \ref apop_model structure
includes a number of informational items, key being the \c parameters, \c data, and
\c info elements; a list of settings to be discussed below; and a set of procedures
for many operations. Its contents are not (entirely) arbitrary: the overall intent
and the theoretical basis for what is and is not included in an \ref apop_model are
described in this <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf">U.S.
Census Bureau research report</a>.
There are helper functions that will allow you to avoid dealing with the model
internals. For example, the \ref apop_estimate helper function means you never have
to look at the model's \c estimate method (if it even has one), and you will simply
pass the model to a function, as with the above form:
\code
apop_model *est = apop_estimate(data, apop_ols);
\endcode
\li Apophenia ships with a broad set of models, like \ref apop_ols, \ref apop_dirichlet,
\ref apop_loess, and \ref apop_pmf (probability mass function); see the full list on <a href="http://apophenia.info/group__models.html">the models documentation page</a>. You would fit
any of them using \ref apop_estimate call, with the appropriate model as the second input.
\li The models that ship with Apophenia, like \ref apop_ols, include the procedures and some metadata, but are of course not yet estimated using a data set (i.e., <tt>data == NULL</tt>, <tt>parameters == NULL</tt>). The line above generated a new
model, \c est, which is identical to the base OLS model but has estimated parameters
(and covariances, and basic hypothesis tests, a log likelihood, \f$AIC_c\f$, \f$BIC\f$, et cetera), and a \c data pointer to the \ref apop_data set used for estimation.
\li You will mostly use the models by passing them as inputs to
functions like \ref apop_estimate, \ref apop_draw, or \ref apop_predict; more examples below.
Other than \ref apop_estimate, most require a parameterized model like \c est. After all, it doesn't make sense to
draw from a Normal distribution until its mean and standard deviation are specified.
\li If you know what the parameters should be, for most models use \ref apop_model_set_parameters. E.g.
\code
apop_model *std_normal = apop_model_set_parameters(apop_normal, 0, 1);
apop_data *a_thousand_normals = apop_model_draws(std_normal, 1000);
apop_model *poisson = apop_model_set_parameters(apop_poisson, 1.5);
apop_data *a_thousand_waits = apop_model_draws(poisson, 1000);
\endcode
\li You can use \ref apop_model_print to print the various elements to screen.
\li You can combine and transform models with functions such as \ref
apop_model_fix_params, \ref apop_model_coordinate_transform, or \ref
apop_model_mixture. Each of these functions produce a new model, which can be estimated,
re-combined, or otherwise used like any other model.
\code
//A helper function to check whether a data point is nonnegative
double over_zero(apop_data *in, apop_model *m){ return apop_data_get(in) > 0; }
//Generate a truncated Normal distribution by adding a data constraint:
apop_model *truncated_normal= apop_model_dconstrain(.base_model=apop_normal,
.constraint=over_zero);
//Get the cross product of that and a free Normal.
apop_model *cross = apop_model_cross(apop_normal, truncated_normal);
//Given assumed data, estimate the parameters of the cross product.
apop_model *xest = apop_estimate(assumed_data, cross);
//Assuming more data, use the fitted cross product as the prior for a Normal distribution.
apop_model *posterior = apop_update(moredata, .prior=xest, .likelihood=apop_normal);
//Assuming more data, use the posterior as the prior for another updating round.
apop_model *post2 = apop_update(moredata2, .prior=posterior, .likelihood=apop_normal);
\endcode
\li Writing your own models won't be covered in this introduction, but it can be easy to
copy and modify the procedures of an existing model to fit your needs. When in doubt, delete a procedure, because any procedures that are missing will have
defaults filled when used by functions like \ref apop_estimate (which uses \ref
apop_maximum_likelihood) or \ref apop_cdf (which uses integration via random draws). See \ref modeldetails for details.
\li There's a simple rule of thumb for remembering the order of the arguments to most of
Apophenia's functions, including \ref apop_estimate : the data always comes first.
<em> Settings</em>
How many bins are in a histogram? At what tolerance does the maximum likelihood
search end? What are the models being combined in an \ref apop_mixture distribution?
Apophenia organizes settings in <em>settings groups</em>, which are then attached
to models. In the following snippet demonstrating Bayesian updating, we specify a Beta distribution prior. If the
likelihood function were a Binomial distribution, \ref apop_update knows the closed-form
posterior for a Beta-Binomial pair, but in this case, with a PMF as a likelihood,
it will have to run Markov chain Monte Carlo. The \ref apop_mcmc_settings group attached
to the prior specifies details of how the run should work.
For a likelihood, we generate an empirical distribution---a PMF---from an input
data set, via <tt>apop_estimate(your_data, apop_pmf)</tt>.
When we call \ref apop_update on the last line, it already has all of the above info
on hand.
\code
apop_model *beta = apop_model_set_parameters(apop_beta, 0.5, 0.25);
Apop_settings_add_group(beta, apop_mcmc, .burnin = 0.2, .periods =1e5);
apop_model *my_pmf = apop_estimate(your_data, apop_pmf);
apop_model *posterior = apop_update(.prior= beta, .likelihood = my_pmf);
\endcode
<em> Databases and models</em>
Returning to the introductory example, you saw that (1) the
library expects you to keep your data in a database, pulling out the
data as needed, and (2) that the workflow is built around
\ref apop_model structures.
Starting with (2),
if a stats package has something called a <em>model</em>, then it is
probably of the form Y = [an additive function of <b>X</b>], such as \f$y = x_1 +
\log(x_2) + x_3^2\f$. Trying new models means trying different
functional forms for the right-hand side, such as including \f$x_1\f$ in
some cases and excluding it in others. Conversely, Apophenia is designed
to facilitate trying new models in the broader sense of switching out a
linear model for a hierarchical, or a Bayesian model for a simulation.
A formula syntax makes little sense over such a broad range of models.
As a result, the right-hand side is not part of
the \ref apop_model. Instead, the data is assumed to be correctly formatted, scaled, or logged
before being passed to the model. This is where part (1), the database,
comes in, because it provides a proxy for the sort of formula specification language above:
\code
apop_data *testme= apop_query_to_data("select y, x1, log(x2), pow(x3, 2) from data");
apop_model *est = apop_estimate(testme, apop_ols);
\endcode
Generating factors and dummies is also considered data prep, not model
internals. See \ref apop_data_to_dummies and \ref apop_data_to_factors.
Now that you have \c est, an estimated model, you can interrogate it. This is where Apophenia and its encapsulated
model objects shine, because you can do more than just admire the parameter estimates on
the screen: you can take your estimated data set and fill in or generate new data, use it
as an input to the parent distribution of a hierarchical model, et cetera. Some simple
examples:
\code
//If you have a new data set with missing elements (represented by NaN), you can fill in predicted values:
apop_predict(new_data_set, est);
apop_data_print(new_data_set);
//Fill a matrix with random draws.
apop_data *d = apop_model_draws(est, .count=1000);
//How does the AIC_c for this model compare to that of est2?
printf("ΔAIC_c=%g\n", apop_data_get(est->info, .rowname="AIC_c")
- apop_data_get(est2->info, .rowname="AIC_c"));
\endcode
\section gentle_end Conclusion
This introduction has shown you the \ref apop_data set and some of the functions
associated, which might be useful even if you aren't formally doing statistical
work but do have to deal with data with real-world elements like column names and
mixed numeric/text values. You've seen how Apophenia encapsulates many of a model's
characteristics into a single \ref apop_model object, which you can send with data to
functions like \ref apop_estimate, \ref apop_predict, or \ref apop_draw. Once you've
got your data in the right form, you can use this to simply estimate model parameters,
or as an input to later analysis.
What's next?
\li Check out the system for hypothesis testing, both with traditional known
distributions (using \ref apop_test for dealing with Normal-, \f$t\f$-,
\f$\chi^2\f$-distributed statistics); and for the parameters of any model; in \ref testpage.
\li Try your own hand at putting new models into the \ref apop_model framework,
as discussed in \ref modeldetails.
\li For example, have a look at <a href="http://modelingwithdata.org/arch/00000154.htm">this blog</a>
and its subsequent posts, which wrap a microsimulation into an \ref apop_model, so
that its parameters can be estimated and confidence intervals set around them.
\li See the \ref maxipage page for discussion of the many features the optimization
system has. It allows you to use a diverse set of search types on constrained or
unconstrained models.
\li Skim through <a href="http://apophenia.info/group__all__public.html">the full list
of macros and functions</a>---there are hundreds---to get a sense of what else
Apophenia offers.
*/
/** \page modeldetails Writing new models
The \ref apop_model is intended to provide a consistent expression of <em>any</em>
model that (implicitly or explicitly) expresses a likelihood of data given parameters,
including traditional linear models, textbook distributions, Bayesian hierarchies,
microsimulations, and any combination of the above. The unifying feature is that
all of the models act over some data space and some parameter space (in some cases
one or both is the empty set), and can assign a likelihood for a fixed pair of
parameters and data given the model. This is a very broad requirement, often used
in the statistical literature. For discussion of the theoretical structures, see <a
href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf"><em>A Useful Algebraic System
of Statistical Models</em></a> (PDF).
This page is about writing new models from scratch, beginning with basic models and on
up to models with arbitrary internal settings, specific methods of Bayesian updating
using your model as a prior or likelihood, and so on. I assume you have already read
\ref modelsec on using models and have tried a few things with the
canned models that come with Apophenia, so you already know how a user handles basic
estimation, adding a settings group, and so on.
This page includes:
\li \ref write_likelihoods of writing a new model from scratch.
\li \ref settingswriting, covering the writing of <em>ad hoc</em> structures to hold model- or method-specific details, like the number of periods for burning in an MCMC run or the number of bins in a histogram.
\li \ref vtables, covering the means of writing special-case routines for functions that are not part of the \ref apop_model itself, including the score or conjugate prior/likelihood pairs for \ref apop_update.
\li \ref modeldataparts, a detailed list of the requirements for the non-function elements of an \ref apop_model.
\li \ref methodsection, a detailed list of requirements for the function elements of an \ref apop_model.
\section write_likelihoods A walkthrough
Users are encouraged to always use models via the helper functions, like
\ref apop_estimate or \ref apop_cdf. The helper functions do some boilerplate error
checking, and call defaults as needed. For example, if your model has a \c log_likelihood
method but no \c p method, then \ref apop_p will use exp(\c log_likelihood). If you don't
give an \c estimate method, then \c apop_estimate will call \ref apop_maximum_likelihood.
So the game in writing a new model is to write just enough internal methods to give the helper functions what they need.
In the not-uncommon best case, all you need to do is write a log likelihood function or an RNG.
Here is how one would set up a model that could be estimated using maximum likelihood:
\li Write a likelihood function. Its header will look like this:
\code
long double new_log_likelihood(apop_data *data, apop_model *m);
\endcode
where \c data is the input data, and \c m is the parametrized model (i.e. your model
with a \c parameters element already filled in by the caller). This function will
return the value of the log likelihood function at the given parameters.
\li Write the object:
\code
apop_model *your_new_model = &(apop_model){"The Me distribution",
.vsize=n0, .msize1=n1, .msize2=n2, .dsize=nd,
.log_likelihood = new_log_likelihood };
\endcode
\li The first element is the <tt>.name</tt>, a human-language name for your model.
\li The \c vsize, \c msize1, and \c msize2 elements specify the shape of the parameter
set. For example, if there are three numbers in the vector, then set <tt>.vsize=3</tt>
and omit the matrix sizes. The default model prep routine will call
<tt>new_est->parameters = apop_data_alloc(vsize, msize1, msize2)</tt>.
\li The \c dsize element is the size of one random draw from your model.
\li It's common to have [the number of columns in your data set] parameters; this
count will be filled in if you specify \c -1 for \c vsize, <tt>msize(1|2)</tt>, or
<tt>dsize</tt>. If the allocation is exceptional in a different way, then you will
need to allocate parameters by writing a custom \c prep method for the model.
\li Is this a constrained optimization? Add a <tt>.constraint</tt> element for those too. See \ref constr for more.
You already have more than enough that something like this will work (the \c dsize is used for random draws):
\code
apop_model *estimated = apop_estimate(your_data, your_new_model);
\endcode
Once that baseline works, you can fill in other elements of the \ref apop_model as needed.
For example, if you are using a maximum likelihood method to estimate parameters, you can get much faster estimates and better covariance estimates by specifying the dlog likelihood function (aka the score):
\code
void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){
//do algebra here to find df/dp0, df/dp1, df/dp2....
gsl_vector_set(gradient, 0, d_0);
gsl_vector_set(gradient, 1, d_1);
}
\endcode
The score is not part of the model object, but is registered (see below) using
\code
apop_score_insert(apop_new_dlog_likelihood, your_new_model);
\endcode
\subsection On Threading
Many procedures in Apophenia use OpenMP to thread operations, so assume your functions are running in a threaded environment. If a method can not be threaded, wrap it in an OpenMP critical region. E.g.,
\code
void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){
#pragma omp critical (newdlog)
{
//un-threadable algebra here
}
gsl_vector_set(gradient, 0, d_0);
gsl_vector_set(gradient, 1, d_1);
}
\endcode
\section settingswriting Writing new settings groups
Your model may need additional settings or auxiliary information to function, which
would require associating a model-specific struct with the model.
A method associated with a model that uses such a struct usually begins with calls like
\code
long double ysg_ll(apop_data *d, apop_model *m){
ysg_settings *sets = apop_settings_get(m, ysg);
...
}
\endcode
These model-specific structs are handled as expected by \ref apop_model_copy and \ref
apop_model_free, and many functions that modify or transform an \ref apop_model try to
handle settings groups as expected. This section describes how to build a settings
group so all these automatic steps happen as expected, and your methods can reliably retrieve settings as needed.
But before getting into the detail of how to make model-specific groups of settings
work, note that there's a lightweight method of storing sundry settings, so in many
cases you can bypass all of the following.
The \ref apop_model structure has a \c void pointer named \c more which you can use to
point to a model-specific struct. If \c more_size is larger than zero (i.e. you set
it to <tt>your_model.more_size=sizeof(your_struct)</tt>), then it will be copied via \c
memcpy by \ref apop_model_copy, and freed by \ref apop_model_free. Apophenia's
routines will never impinge on this item, so do what you wish with it.
The remainder of this subsection describes the information you'll have to provide to make
use of the conveniences described to this point: initialization of defaults, smarter
copying and freeing, and adding to an arbitrarily long list of settings groups attached
to a model. You will need four items: a typedef for the structure itself, plus init, copy, and
free functions. This is the sort of boilerplate that will be familiar to users of
object-oriented languages in the style of C++ or Java, but it's really a list of
arbitrarily-typed elements, which makes this feel more like LISP. [And being a
reimplementation of an existing feature of LISP, this section will be macro-heavy.]
The settings struct will likely go into a header file, so here is a sample header
for a new settings group named \c ysg_settings, with a dataset, two sizes, and
a reference counter. <tt>ysg</tt> stands for Your Settings Group; replace that
substring with your preferred name in every instance to follow.
\code
typedef struct {
int size1, size2;
char *refs;
apop_data *dataset;
} ysg_settings;
Apop_settings_declarations(ysg)
\endcode
The first item is a familiar structure definition. The last line is a macro that declares the
init, copy, and free functions discussed below. This is everything you would
need in a header file, should you need one. These are just declarations; we'll write
the actual init/copy/free functions below.
The structure itself gets the full name, \c ysg_settings. Everything else is a macro
keyed on \c ysg, without the \c _settings part. Because of these macros, your struct
name must end in \c _settings.
If you have an especially simple structure, then you can generate the three functions with these three macros in your <tt>.c</tt> file:
\code
Apop_settings_init(ysg, )
Apop_settings_copy(ysg, )
Apop_settings_free(ysg, )
\endcode
These macros generate appropriate functions to do what you'd expect: allocating the
main structure, copying one struct to another, freeing the main structure.
The spaces after the commas indicate that in these cases no special code gets added to
the functions that these macros expand into.
You'll never call the generated functions directly; they are called by \ref Apop_settings_add_group,
\ref apop_model_free, and other model or settings-group handling functions.
Now that initializing/copying/freeing of
the structure itself is handled, the remainder of this section will be about how to
add instructions for the structure internals, like data that is pointed to by the structure elements.
\li For the allocate function, use the above form if everything in your code defaults to zero/\c NULL.
Otherwise, you will need a new line declaring a default for every element in your structure. There is a macro to help with this too.
These macros will define for your use a structure named \c in, and an output pointer-to-struct named \c out.
Continuing the above example:
\code
Apop_settings_init (ysg,
Apop_stopif(!in.size1, return NULL, 0, "I need you to give me a value for size1.");
Apop_varad_set(size2, 10);
Apop_varad_set(dataset, apop_data_alloc(out->size1, out->size2));
Apop_varad_set(refs, malloc(sizeof(int)));
*refs=1;
)
\endcode
Now, <tt>Apop_settings_add(a_model, ysg, .size1=100)</tt> would set up a group with
a 100-by-10 data set, and set the reference counter allocated and to one.
\li Some functions do extensive internal copying, so you will need a copy function even
if your code has no explicit calls to \ref apop_model_copy. The default above simply
copies every element in the structure. Pointers are copied, giving you two pointers
pointing to the same data. We have to be careful to prevent double-freeing later.
\code
//The elements of the set to copy are all copied by the function's boilerplate,
//and then make one additional modification:
Apop_settings_copy (ysg,
#pragma omp critical (ysg_refs)
(*refs)++;
)
\endcode
\li The struct itself is freed by boilerplate code, but add code in the free function
to free data pointed to by pointers in the main structure. The macro defines a
pointer-to-struct named \c in for your use. Continuing the example:
\code
Apop_settings_free (ysg,
#pragma omp critical (ysg_refs)
if (!(--in->refs)) {
free(in->dataset);
free(in->refs);
}
)
\endcode
With those three macros in place and the header as above, Apophenia will treat your
settings group like any other, and users can use \ref Apop_settings_add_group to
populate it and attach it to any model.
\section vtables Registering new methods in vtables
The settings groups are for adding arbitrary model-specific nouns; vtables are for
adding arbitrary model-specific verbs.
Many functions (e.g., entropy, the dlog likelihood, Bayesian updating) have
special cases for well-known models like the Normal distribution.
Any function may maintain a registry of models and associated special-case procedures, aka a vtable.
Lookups happen based on a hash that takes into account the elements of the model
that will be used in the calculation. For example, the \c apop_update_hash takes in two
models and calculates the hash based on the address of the prior's \c draw method and
the likelihood's \c log_likelihood or \c p method. Thus, a vtable lookup for new models
that re-use the same methods (at the same addresses in memory) will still find the
same special-case function.
If you need to deregister the function, use the associated deregister function,
e.g. <tt>apop_update_vtable_drop(apop_beta, apop_binomial)</tt>. You can guarantee
that a method will not be re-added by following up the <tt>_drop</tt> with, e.g.,
<tt>apop_update_vtable_add(NULL, apop_beta, apop_binomial)</tt>.
The steps for adding a function to an existing vtable:
\li See \ref apop_update, \ref apop_score, \ref apop_predict, \ref apop_model_print, and \ref
apop_parameter_model for examples and procedure-specific details.
\li Write a function following the given type definition, as listed in the function's documentation.
\li Use the associated <tt>_vtable_add</tt> function to add the function and associate it
with the given model. For example, to add a Beta-binomial routine named \c betabinom
to the registry of Bayesian updating routines, use <tt>apop_update_vtable_add(betabinom,
apop_beta, apop_binomial)</tt>.
\li Place a call to <tt>..._vtable_add</tt> in the \c prep method of the given model, thus ensuring that the auxiliary functions are registered after the first time the model is sent to \ref apop_estimate.
The easiest way to set up a new vtable is to copy/paste/modify an existing one. Briefly:
\li See the existing setups in the vtables portion of <tt>apop.h</tt>.
\li Cut/paste one and do a search and replace to change the name to match your desired use.
\li Set the typedef to describe the functions that get added to the vtable.
\li Rewrite the hash function to check the part of the inputs that interest you. For
example, the update vtable associates functions with the \c draw, \c log_likelihood,
and \p methods of the model. A model where these elements are identical
will still match even if other elements are different.
\section modeldataparts The data elements
The remainder of this section covers the detailed expectations regarding the elements
of the \ref apop_model structure. I begin with the data (non-function) elements,
and then cover the method (function) elements. Some of the following will be
requirements for all models and some will be advice to authors; I use the accepted
definitions of <a href="http://tools.ietf.org/html/rfc2119">"must", "shall", "may"</a>
and related words.
\subsection datasubsec Data
\li Each row of the \c data element is treated as a single observation by many functions.
For example, \ref apop_bootstrap_cov depends on each row being an iid observation to function correctly.
Calculating the Bayesian Information Criterion (BIC) requires knowing
the number of observations in the data, and assumes that row count==observation count.
For complex data, the \ref apop_data_pack and \ref apop_data_unpack functions can help with this.
\li Some functions (bootstrap again, or many uses of \ref apop_kl_divergence) use \ref
apop_draw to use your model's RNG (or a default) to draw a
value, write it to the matrix element of the data set, and then move on to an
estimation or other step. In this case, the data sent in will be entirely in the \c
->matrix element of the \ref apop_data set sent to model methods. Your \c likelihood, \c p, \c cdf, and \c estimate routines
must accept data as a single row of the matrix of the \ref apop_data set for such functions to work.
They may accept other formats. Tip: you can use \ref apop_data_pack and \ref apop_data_unpack to convert a structured set to a single row and back again.
\li Your routines may accept other data formats, as per contract with the user.
For example, regression-type functions use a function named \c ols_shuffle
to convert a matrix where the first column is the dependent variable to a data
set with dependent variable in the vector and a column of ones in the first
matrix column.
\subsection paramsubsec Parameters, vsize, msize1, msize2
\li The sizes will be used by the \c prep method of the model; see below. Given the model \c m and its elements \c m.vsize, \c m.msize1, \c m.msize2,
functions that need to allocate a parameter set will do so via <tt>apop_data_alloc(m.vsize, m.msize1, m.msize2)</tt>.
\subsection infosubsec Info
\li The first page, which should be named \c <info>, is typically a list of scalars. Nothing is guaranteed, but the elements may include:
\li AIC: <a href="https://en.wikipedia.org/wiki/Akaike's_Information_Criterion">Aikake Information Criterion</a>
\li AIC_c: AIC with a finite sample correction. ``<em>Generally, we advocate the use of AIC_c when the ratio \f$n/K\f$ is small (say \f$<\f$ 40)</em>'' [Kenneth P. Burnham, David R. Anderson: <em>Model Selection and Multi-Model Inference</em>, p 66, emphasis in original.]
\li BIC: <a href="https://en.wikipedia.org/wiki/Bayesian_information_criterion">Bayesian Information Criterion</a>
\li R squared
\li R squared adj
\li log likelihood
\li status [0=OK, nozero=other].
For those elements that require a count of input data, the calculations assume each row in the input \ref apop_data set is a single datum.
Get these via, e.g., <tt>apop_data_get(your_model->info, .rowname="log likelihood")</tt>.
When writing for any arbitrary function, be prepared to handle \c NaN, indicating that the element is not calculated or saved in the info page by the given model.
For OLS-type estimations, each row corresponds to the row in the original data. For
filling in of missing data, the elements may appear anywhere, so the row/col indices are
essential.
\subsection settingsgroupmention settings, more
In object-oriented jargon, settings groups are the private elements of the data set,
to be pulled out in certain contexts, and ignored in all others. Therefore, there are
no rules about internal use. The \c more element of the \ref apop_model provides a lightweight
means of attaching an arbitrary struct to a model. See \ref settingswriting above for details.
\li As many settings groups of different types as desired can be added to a single \ref apop_model.
\li One \ref apop_model can not hold two settings groups of the same type. Re-additions cause the removal of the previous version of the group.
\li If the \c more pointer points to a structure or value (let it be \c ss), then \c more_size must be set to <tt>sizeof(ss)</tt>.
\section methodsection Methods
\subsection psubsection p, log_likelihood
\li Function headers look like <tt>long double your_p_or_ll(apop_data *d, apop_model *params)</tt>.
\li The inputs are an \ref apop_data set and an \ref apop_model, which should include the elements needed to fully estimate the probability/likelihood (probably a filled <tt>->parameters</tt> element, possibly a settings group added by the user).
\li Assume that the parameters have been set, by users via \ref apop_estimate or \ref apop_model_set_parameters, or by \ref apop_maximum_likelihood by its search algorithms. If the parameters are necessary, the function shall check that the parameters are not \c NULL and set the model's \c error element to \c 'p' if they are missing.
\li Return \c NaN on errors. If an error in the input model is found, the function may set the input model's \c error element to an appropriate \c char value.
\li If your model includes both \c log_likelihood and \c p methods, it must be the case that <tt>log(p(d, m))</tt> equals <tt>log_likelihood(d, m)</tt> for all \c d and \c m. This implies that \c p must return a value \f$\geq 0\f$. Note that \ref apop_maximum_likelihood will accept functions where \c p returns a negative value, but diagonstics that depend on log likelihood like AIC will return NaN.
\li If observations are assumed to be iid, you may be able to use \ref apop_map_sum to write the core of the log likelihood function.
\subsection prepsubsection prep
\li Function header looks like <tt>void your_prep(apop_data *data, apop_model *params)</tt>.
\li Re-prepping a model after it has already been prepped shall have no effect. Where there is ambiguity with the other requirements, this takes precedence.
\li The model's <tt>data</tt> pointer shall be set to point to the input data.
\li The \c info element shall be allocated and its title set to <tt>\<Info\></tt>.
\li If \c vsize, \c msize1, or \c msize2 are -1, then the prep function shall set them to the width of the input data.
\li If \c dsize is -1, then the prep function shall set it to the width of the input data.
\li If the \c parameters element is not allocated, the function shall allocate it via <tt>apop_data_alloc(vsize, msize1, msize2)</tt> (or equivalent).
\li The default is \ref apop_model_clear. It does all of the above.
\li The input data may be modified by the prep routine. For example, the \ref apop_ols prep routine shuffles a single input matrix as described above under \c data, and the \ref apop_pmf prep routine calls \ref apop_data_pmf_compress on the input data.
\li The prep routine may initialize any desired settings groups. Unless otherwise
stated, these should not be removed if they are already there, so that users can override defaults by adding a settings group before starting an estimation.
\li If any functions associated with the model need to be added to
a vtable (see above), the registration shall happen here. Registration may also happen elsewhere.
\subsection estimatesubsection estimate
\li Function header looks like <tt> void your_estimate(apop_data *data, apop_model *params)</tt>.
It modifies the input model, and returns nothing. Note that this is different from the wrapper function, \ref apop_estimate, which makes a copy of its input model, preps it, and then calls the \c estimate function with the prepeped copy.
\li Assume that the prep routine has already been run. Notably, this means that parameters have been allocated.
\li Assume that the \c parameters hold garbage (as in a \c malloc without a subsequent assignment to the <tt>malloc</tt>-ed space).
\li The function shall set the \c parameters of the input model. For consistency with other models, the estimate should be the maximum likelihood estimate, unless otherwise documented.
\li Additional settings may be set.
\li The model's \c <Info> page may be filled with statistics, as discussed at infosubsec. For scalars like log likelihood and AIC, use \ref apop_data_add_named_elmt.
\li Data should not be modified by the \c estimate routine; any changes to the data made by \c estimate must be documented.
\li The default called by \ref apop_estimate is \ref apop_maximum_likelihood.
\li If errors occur during processing, set the model's \c error element to a single character. Documentation should include the list of error characters and their meaning.
\subsection drawsubsection draw
\li Function header looks like <tt>void your_draw(double *out, gsl_rng* r, apop_model *params)</tt>
\li Assume that model \c paramters are set, via \ref apop_estimate or \ref apop_model_set_parameters. The author of the draw method should check that \c parameters are not \c NULL if needed and fill the output with NaNs if necessary parameters are not set.
\li Caller inputs a pointer-to-<tt>double</tt> of length \c dsize; user is expected to make sure that there is adequate space. Caller also inputs a \c gsl_rng, already allocated (probably via \ref apop_rng_alloc, possibly from \ref apop_rng_get_thread).
\li The function shall fill the space pointed to by the input pointer with a random draw from the data space, where the likelihood of any given observation is proportional to its likelihood as given by the \c p method. Data shall be reduced to a single vector via \ref apop_data_pack if it is not already a single vector.
\subsection cdfsubsection cdf
\li Function header looks like <tt>long double your_cdf(apop_data *d, apop_model *params)</tt>.
\li Assume that \c parameters are set, via \ref apop_estimate or \ref apop_model_set_parameters. The author of the CDF method should check that \c parameters are not \c NULL and return NaN if necessary parameters are not set.
\li The CDF method must accept data as a single row of data in the \c matrix of the input \ref apop_data set (as per a draw produced using the \c draw method). May accept other formats.
\li Returns the percentage of the likelihood function \f$\leq\f$ the first row of the input data. The definition of \f$\leq\f$ is chosen by the model author.
\li If one is not already present, an \c apop_cdf_settings group may be added to the model to store temp data. See the \ref apop_cdf function for details.
\subsection constraintsubsection constraint
\li Function header looks like <tt>long double your_constraint(apop_data *data, apop_model *params)</tt>.
\li Assume that \c parameters are set, via \ref apop_estimate, \ref apop_model_set_parameters, or the internals of an MLE search. The author of the constraint method should check that \c parameters are not \c NULL and return NaN if necessary parameters are not set.
\li See \ref apop_linear_constraint for a useful basis and/or example. Many constraints can be written as wrappers for this function.
\li If the constraint is met, then return zero.
\li If the constraint fails, then (1) move the \c parameters in the input model to a
constraint-satisfying value, and (2) return the distance between the input parameters and
what you've moved the parameters to. The choice of within-bounds parameters and distance function is left to the author of the constraint function.
*/
/**\defgroup models
This section is a detailed description of the stock models that ship with Apophenia.
It is a reference. For an explanation of what to do with an \ref apop_model, see \ref modelsec.
The primary questions one has about a model in practice are what format the input data should take and what to expect of an estimated output.
Generally, the input data consists of an \ref apop_data set where each row is a single observation. Details beyond that are listed below.
The output after running \ref apop_estimate to produce a fitted model are generally
found in three places: the vector of the output parameter set, its matrix, or a new
settings group. The basic intuition is that
if the parameters are always a short list of scalars, they are in the vector;
if there exists a situation where they could take matrix form, the parameters will be in the matrix;
if they require more structure than that, they will be a settings group.
If the basic structure of the \ref apop_data set is unfamiliar to you, see \ref
dataoverview, which will discuss the basic means of getting data out of a struct. For
example, the estimated \ref apop_normal distribution has the mean in position zero of
the vector and the standard deviation in position one, so they could be extracted as follows:
\code
apop_data *d = apop_text_to_data("sample data from before")
apop_model *out = apop_estimate(d, apop_normal);
double mu = apop_data_get(out>parameters, 0);
double sigma = apop_data_get(out>parameters, 1);
//What is the p-value of test whose null hypothesis is that μ=3.3?
printf ("pval=%g\n", apop_test(3.3, "normal", mu, sigma);
\endcode
See \ref modelsec for discussion of how to pull settings groups using \ref
Apop_settings_get (for one item) or \ref apop_settings_get_group (for a full settings
group).
*/
|