File: documentation.h

package info (click to toggle)
apophenia 1.0%2Bds-8
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye, trixie
  • size: 2,968 kB
  • sloc: ansic: 19,483; makefile: 372; awk: 124; sh: 105; sed: 32
file content (3286 lines) | stat: -rw-r--r-- 162,928 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
/* Apophenia's narrative documentation
Copyright (c) 2005--2013 by Ben Klemens.  Licensed under the GPLv2; see COPYING.  */

/**  \mainpage Welcome

Apophenia is an open statistical library for working with data sets and statistical
models. It provides functions on the same level as those of the typical stats package
(such as OLS, Probit, or singular value decomposition) but gives the user more
flexibility to be creative in model-building.  The core functions are written in C,
but experience has shown them to be easy to bind to in Python/Julia/Perl/Ruby/&c.

It is written to scale well, to comfortably work with gigabyte data sets, million-step simulations, or
computationally-intensive agent-based models.

<em>The goods</em> 

The library has been growing and improving since 2005, and has been downloaded well over 1e4 times. To date, it has over two hundred functions and macros to facilitate statistical computing, such as:

\li OLS and family, discrete choice models like Probit and Logit, kernel density estimators, and other common models.
\li Functions for transforming models (like Normal \f$\rightarrow\f$ truncated Normal)
    and combining models (produce the cross-product of that truncated Normal with three others,
    or use Bayesian updating to combine that cross-product prior with an OLS likelihood
    to produce a posterior distribution over the OLS parameters).
\li Database querying and maintenance utilities.
\li Data manipulation tools for splitting, stacking, sorting, and otherwise shunting data sets.
\li Moments, percentiles, and other basic stats utilities.
\li \f$t\f$-tests, \f$F\f$-tests, et cetera.
\li Several optimization methods available for your own new models.
\li It does <em>not</em> re-implement basic matrix operations or build yet another database
engine. Instead, it builds upon the excellent <a href="http://www.gnu.org/software/gsl/">GNU
Scientific</a> and <a href="http://www.sqlite.org/">SQLite</a> libraries. MySQL/mariaDB is also supported.

For the full list of macros, functions, and prebuilt models, check the <a href="group__all__public.html">index</a>.

<em><a href="https://github.com/b-k/apophenia/archive/pkg.zip">Download Apophenia here</a>.</em>

Most users will want to download the latest packaged version linked from the <a
href="https://github.com/b-k/apophenia/archive/pkg.zip">Download
Apophenia here</a> header. See the \ref setup page for detailed setup instructions,
including how to use your package manager to install the Debian package.

<em>The documentation</em>

To start off, have a look at this \ref gentle "Gentle Introduction" to the library.

<a href="outline.html">The outline</a> gives a more detailed narrative.

The <a href="group__all__public.html">index</a> lists every function in the
library, with detailed reference
information. Notice that the header to every page has a link to the outline and the index.

To really go in depth, download or pick up a copy of <a
href="http://modelingwithdata.org">Modeling with Data</a>, which discusses general
methods for doing statistics in C with the GSL and SQLite, as well as Apophenia
itself. <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf"><em>A Useful
Algebraic System of Statistical Models</em></a> (PDF) discusses some of the theoretical
structures underlying the library.

There is a <a href="https://github.com/b-k/apophenia/wiki">wiki</a> with some convenience
functions, tips, and so on.

<em>Notable features</em> 
Much of what Apophenia does can be done in any typical statistics package. The \ref
apop_data element is much like an R data frame, for example, and there is nothing special
about being able to invert a matrix or take the product of two matrices with a single
function call (\ref apop_matrix_inverse and \ref apop_dot, respectively). 
Even more advanced features like Loess smoothing (\ref apop_loess) and the Fisher Exact
Test (\ref apop_test_fisher_exact) are not especially Apophenia-specific. But here are
some things that are noteworthy.

  \li It's a C library! You can build applications using Apophenia for the data-processing
back-end of your program, and not worry about the overhead associated with scripting
languages. For example, it is currently used in production for certain aspects of
processing for the U.S. Census Bureau's American Community Survey. And the numeric
routines in your favorite scripting language typically have a back-end in plain C;
perhaps Apophenia can facilitate writing your next one.
  \li The \ref apop_model object allows for consistent treatment of distributions, regressions,
simulations, machine learning models, and who knows what other sorts of models you can
dream up. By transforming and combining existing models, it is easy to build complex
models from simple sub-models.
  \li For example, the \ref apop_update function does Bayesian updating on any two
well-formed models. If they are on the table of conjugates, that is correctly
handled, and if they are not, an appropriate variant of MCMC produces an empirical
distribution. The output is yet another model, from which you can make random draws,
or which you can use as a prior for another round of Bayesian updating. Outside of
Bayesian updating, the \ref apop_model_metropolis function is good for approximating
other complex models.
  \li The maximum likelihood system combines several subsystems into one
form: it will do a few flavors of conjugate gradient search, Nelder-Mead Simplex,
Newton's Method, or Simulated Annealing. You pick the method by a setting attached to
your model. If you want to use a method that requires derivatives and you don't have
a closed-form derivative, the ML subsystem will estimate a numerical gradient for
you. If you would like to do EM-style maximization (all but the first parameter are
fixed, that parameter is optimized, then all but the second parameter are fixed, that
parameter is optimized, ..., looping through dimensions until the change in objective
across cycles is less than <tt>eps</tt>), add a settings group specifying the
tolerance at which the cycle should stop: <tt>Apop_settings_add_group(your_model,
apop_mle, .dim_cycle_tolerance=eps)</tt>.
  \li The Iterative Proportional Fitting algorithm, \ref apop_rake, is best-in-breed,
designed to handle large, sparse matrices.



<em>Contribute!</em> 

\li Develop a new model object.
\li Contribute your favorite statistical routine.
\li Package Apophenia into an RPM, portage, cygwin package.
\li Report bugs or suggest features.
\li Write bindings for your preferred language. For example, here are 
a <a href="https://github.com/swuecho/apophenia-perl">Perl wrapper</a> and early versions
of <a href="http://modelingwithdata.org/arch/00000173.htm"> a Julia wrapper</a> and <a
href="https://github.com/b-k/Rapophenia/">an R wrapper</a> which you could expand upon.

If you're interested,  <a href="mailto:fluffmail@f-m.fm">write to the maintainer</a> (Ben Klemens), or join the
<a href="https://github.com/b-k/apophenia">GitHub</a> project.
*/

/** \page eg Some examples
Here are a few pieces of sample code for testing your installation or to give you a
sense of what code with Apophenia's tools looks like.

<em> Two data streams</em>

The sample program here is intended to show how one would integrate Apophenia into an
existing program. For example, say that you are running a simulation of two different
treatments, or say that two sensors are posting data at regular intervals. The goal is to
gather the data in an organized form, and then ask questions of the resulting data set.
Below, a thousand draws are made from the two processes and put into a database. Then,
the data is pulled out, some simple statistics are compiled, and the data is written to
a text file for inspection outside of the program.  This program will compile cleanly
with the sample \ref makefile.

\include draw_to_db.c

<em> Run a regression</em>

See \ref gentle for an example of loading a data set and running a simple regression.

<em> A sequence of t-tests</em>

In \ref mapply "The section on map/apply", a new \f$t\f$-test on every row, with all
operations acting on entire rows rather than individual data points:

\include t_test_by_rows.c

In the documentation for \ref apop_query_to_text, a program to list all the tables in an SQLite database.
\include ls_tables.c

<em>Marginal distribution</em>

A demonstration of fixing parameters to create a marginal distribution, via \ref apop_model_fix_params
\include fix_params.c
*/


/** \page setup Setting up
\section cast The supporting cast 
To use Apophenia, you will need to have a working C compiler, the GSL (v1.7 or higher) and SQLite installed. mySQL/mariaDB is optional.

\li Some readers may be unfamiliar with modern package managers and common methods for setting up a C development environment; see 
<a href="http://modelingwithdata.org/appendix_o.html">Appendix O</a> of <em> Modeling with Data</em> for an introduction.

\li Other pages in this documentation have a few more notes for \ref windows "Windows" users, including \ref mingw users.

\li Install the basics using your package manager. E.g., try

\code
sudo apt-get install make gcc libgsl0-dev libsqlite3-dev
\endcode

or

\code
sudo yum install make gcc gsl-devel libsqlite3x-devel
\endcode

\li If you use a Debian-based system (including Ubuntu), try the version in Debian's Stretch distribution:

\code
#Add Stretch to your sources list
sudo sed -i '$a
deb http://httpredir.debian.org/debian stretch main
deb-src http://httpredir.debian.org/debian stretch main' /etc/apt/sources.list
sudo apt-get update

#Get Apophenia
sudo apt-get install apophenia-bin

#Optional: remove Stretch from your sources list.
sudo sed -i -e '$d' -e '$d' /etc/apt/sources.list
\endcode

Thanks to Jerome Benoit for getting the library packaged and adjusted to Debian standards.

\li If the Debian package is not for you, <a href="https://github.com/b-k/apophenia/archive/pkg.zip">Download Apophenia here</a>.
Once you have the library downloaded, compile it using

\code
tar xvzf apop*tgz && cd apophenia-0.999
./configure && make && make check && sudo make install
\endcode

If you decide not to keep the library on your system, run <tt>sudo make uninstall</tt>
from the source directory to remove it.

\li If you need to install packages in your home directory because you don't have root
permissions, see the \ref notroot page.

\li A \ref makefile will help immensely when you want to compile your program.

\li You can verify that your setup works by trying some \ref eg "sample programs".

\subpage notroot

\subpage makefile

\subpage windows


\li Those who would like to work on a cutting-edge copy of the source code
can get the latest version by cutting and pasting the following onto
the command line. If you follow this route, be sure to read the development README in the
<tt>apophenia</tt> directory this command will create.

\code
git clone https://github.com/b-k/apophenia.git
\endcode

<!--git clone git://apophenia.git.sourceforge.net/gitroot/apophenia/apophenia
cvs -z3 -d:ext:<i>(your sourceforge login)</i>@cvs.sourceforge.net:/cvsroot/apophenia co -P apophenia
cvs -z3 -d:pserver:anonymous@cvs.sf.net:/cvsroot/apophenia checkout -P apophenia
svn co https://apophenia.svn.sourceforge.net/svnroot/apophenia/trunk/apophenia -->

*/

/** \page windows Windows

\ref mingw users, see that page.

If you have a choice, <a href="http://www.cygwin.com">Cygwin</a> is strongly recommended. The setup program is
very self-explanatory. As a warning, it will probably take up &gt;300MB on
your system. You should install at least the following programs:
\li autoconf/automake
\li binutils
\li gcc
\li gdb
\li gnuplot -- for plotting data
\li groff -- needed for the man program, below
\li gsl -- the engine that powers Apophenia
\li less -- to read text files
\li libtool -- needed for compiling programs
\li make
\li man -- for reading help files
\li more -- not as good as less but still good to have
\li sqlite3 -- a simple database engine, a requisite for Apophenia

If you are missing anything else, the program will probably tell you.
The following are not necessary but are good to have on hand as long as you are going to be using Unix and programming.

\li git -- to partake in the versioning system
\li emacs -- steep learning curve, but people love it
\li ghostscript (for reading .ps/.pdf files)
\li openssh -- needed for git
\li perl, python, ruby -- these are other languages that you might also be interested in
\li tetex -- write up your documentation using the nicest-looking formatter around
\li X11 -- a windowing system

X-Window will give you a nicer environment in which to work.  After you start Cygwin, type <tt>startx</tt> to bring up a more usable, nice-looking terminal (and the ability to do a few thousand other things which are beyond the scope of this documentation).  Once you have Cygwin installed and a good terminal running, you can follow along with the remainder of the discussion without modification.

Some older versions of Cygwin have a \c search.h file which doesn't include the function <tt>lsearch()</tt>.  If this is the case on your system, you will have to update your Cygwin installation.

Finally, windows compilers often spit out lines like:
\code
Info: resolving _gsl_rng_taus by linking to __imp__gsl_rng_taus (auto-import)
\endcode
These lines are indeed just information, and not errors. Feel free to ignore them.

[Thanks to Andrew Felton and Derrick Higgins for their Cygwin debugging efforts.]

\subpage mingw
*/

/** \page notroot  Not root? 
If you aren't root, then the common procedure for installing a library is to create a subdirectory in your home directory in which to install packages. The key is the <tt>--prefix</tt> addition to the <tt>./configure</tt> command.
\code
export MY_LIBS = myroot   #choose a directory name to be created in your home directory.
mkdir $HOME/$MY_LIBS

# From Apophenia's package directory:
./configure --prefix $HOME/$MY_LIBS
make
make install   #Now you don't have to be root.

# Adjust your paths so the compiler and the OS can find the library.
# These are environment variables, and they are usually set in the 
# shell's startup files. I assume you are using bash here.

echo "export PATH=$HOME/$MY_LIBS/include:\$PATH" >> ~/.bashrc
echo "export CPATH=$HOME/$MY_LIBS/include:\$CPATH" >> ~/.bashrc
echo "export LIBRARY_PATH=$HOME/$MY_LIBS:\$LIBRARY_PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=$HOME/$MY_LIBS:\$LD_LIBRARY_PATH" >> ~/.bashrc
\endcode

Once you have created this local root directory, you can use it to install as many new libraries as desired, and your paths will already be set up to find them.
*/


/** \page makefile Makefile
 
Instead of giving lengthy compiler commands at the command prompt, you can use a Makefile to do most of the work. How to:
\li Copy and paste the following into a file named \c makefile.
\li Change the first line to the name of your program (e.g., if you have written <tt>census.c</tt>, then the first line will read <tt>PROGNAME=census</tt>). 
\li If your program has multiple <tt>.c</tt> files, add a corresponding <tt>.o</tt> to the currently blank <tt>objects</tt> variable, e.g. <tt>objects=sample2.o sample3.o</tt>
\li One you have a Makefile in the directory, simply type <tt>make</tt> at the command prompt to generate the executable.

 \code
PROGNAME = your_program_name_here
objects =
CFLAGS = -g -Wall -O3
LDLIBS = -lapophenia -lgsl -lgslcblas -lsqlite3

$(PROGNAME): $(objects)
\endcode

\li If your system has \c pkg-config, then you can use it for a slightly more robust and readable makefile. Replace the above C and link flags with:
\code
CFLAGS = -g -Wall `pkg-config --cflags apophenia` -O3
LDLIBS = `pkg-config --libs apophenia`
\endcode
The \c pkg-config program will then fill in the appropriate directories and libraries. Pkg-config knows Apophenia depends on the GSL and database libraries, so you need only list the most-dependent library.

\li The <tt>-O3</tt> flag is optional, asking the compiler to run its highest level of optimization (for speed).

\li GCC users may need the <tt>--std=gnu99</tt> or <tt>--std=gnu11</tt> flag to use post-1989 C standards.

\li Order matters in the linking list: the files a package depends on should be listed after the package. E.g., since sample.c depends on Apophenia, <tt>gcc sample.c -lapophenia</tt> will work, while <tt>gcc -lapophenia sample.c</tt> is likely to give you errors. Similarly, list <tt>-lapophenia</tt> before <tt>-lgsl</tt>, which comes before <tt>-lgslcblas</tt>.

*/

/** \page designated Designated initializers

Functions so marked in this documentation use standard C designated initializers and compound literals to allow you to omit, call by name, or change the order of inputs. The following examples are all equivalent.

The standard format:
\code
apop_text_to_db("infile.txt", "intable", 0, 1, NULL);
\endcode

Omitted arguments are left at their default vaules:
\code
apop_text_to_db("infile.txt", "intable");
\endcode

You can use the variable's name, if you forget its ordering:
\code
apop_text_to_db("infile.txt", "intable", .has_col_name=1, .has_row_name=0);
\endcode

If an un-named element follows a named element, then that value is given to the next variable in the standard ordering:
\code
apop_text_to_db("infile.txt", "intable", .has_col_name=1, NULL);
\endcode

\li There may be cases where you can not use this form (it relies on a macro, which
may not be available). You can always call the underlying function directly, by adding
\c _base to the name and giving all arguments:

\code
apop_text_to_db_base("infile.txt", "intable", 0, 1, NULL);
\endcode

\li If one of the optional elements is an RNG and you do not provide one, I use one
from \ref apop_rng_get_thread.
*/


/** \page preliminaries Getting started

If you are entirely new to Apophenia, \ref gentle "have a look at the Gentle Introduction here".

As well as the information in this outline, there is a separate page covering the details of 
 \ref setup "setting up a computing environment" and another page with \ref eg "some sample code" for your perusal.

\subpage gentle

\subpage setup

\subpage eg

\subpage refstatusext
*/

/** \page refstatusext References and extensions

\section mwd The book version

Apophenia co-evolved with <em>Modeling with Data: Tools and Techniques for Statistical Computing</em>. You can read about the book, or download a free PDF copy of the full text, at <a href="http://modelingwithdata.org">modelingwithdata.org</a>.

As with many computer programs, the preferred manner of citing Apophenia is to cite its related book.
Here is a BibTeX-formatted entry giving the relevant information:

\code 
@book{klemens:modeling,
    title = "Modeling with Data: Tools and Techniques for Statistical Computing",
    author="Ben Klemens",
    year=2008,
    publisher="Princeton University Press"
}
\endcode

The rationale for the \ref apop_model struct, based on an algebraic system of models, is detailed in a <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf">U.S. Census Bureau research report</a>:

\code 
@techreport{klemens:algebra,
    title = "A Useful Algebraic System of Statistical Models",
    author="Ben Klemens",
    month=jul,
    year=2014,
    institution="U.S.\ Census Bureau",
    number="06"
}
\endcode

\section ext How do I write extensions?

The system is written to not require a registration or initialization step to add a new
model or other such parts.  Just write your code and <tt>include</tt> it like any other
C code.  A new \ref apop_model has to conform to some rules if it is to play well with
\ref apop_estimate, \ref apop_draw, and so forth. See the notes at \ref modeldetails.
Once your new model or function is working, please post the code or a link to the code
on the <a href="https://github.com/b-k/apophenia/wiki">Apophenia wiki</a>.

\subpage c

\section links Further references

For your convenience, here are links to some other libraries you are probably using.

\li <a href="http://www.gnu.org/software/libc/manual/html_node/index.html">The standard C library</a>
\li <a href="http://www.gnu.org/software/gsl/manual/html_node/index.html">The
GSL documentation</a>, and <a href="http://www.gnu.org/software/gsl/manual/html_node/Function-Index.html">its index</a>
\li <a href="http://sqlite.org/lang.html">SQL understood by SQLite</a>


*/

/** \page outline An outline of the library

The narrative in this section goes into greater detail on how to use the 
components of Apophenia. You are encouraged to read \ref gentle first.

This overview begins with the \ref apop_data set, which is the central data structure
used throughout the system. Section \ref dbs covers the use of the database interface,
because there are a lot of things that a database will do better than a matrix
structure like the \ref apop_data struct.

Section \ref modelsec covers statistical models, in the form of the \ref apop_model structure.
This part of the system is built upon the \ref apop_data set to hold parameters, statistics, data sets, and so on.

\ref Histosec covers probability mass functions, which are statistical models
built directly around a data set, where the chance of drawing a given observation is
proportional to how often that observation appears in the source data. There are many
situations where one would want to treat a data set as a probability distribution,
such as using \ref apop_kl_divergence to find the information loss from an observed
data set to a theoretical model fit to that data.

Section \ref testpage covers traditional hypothesis testing, beginning with common
statistics that take an \ref apop_data set or two as input, and continuing on to
generalized hypothesis testing for any \ref apop_model.

Because estimation in the \ref apop_model relies heavily on maximum likelihood
estimation, Apophenia's optimizer subsystem is extensive.  \ref maxipage offers
some additional notes on optimization and how it can be used in non-statistical contexts.

\subpage dataoverview

\subpage dbs

\subpage modelsec

\subpage Histosec

\subpage testpage

\subpage maxipage

\subpage moreasst
*/

/** \page c C, SQL and coding utilities
 
<em>Learning C</em>

<a href="http://modelingwithdata.org">Modeling with Data</a> has a full tutorial for C, oriented at users of standard stats packages. More nuts-and-bolts tutorials are <a href="http://www.google.com/search?hl=es&amp;c2coff=1&amp;q=c+tutorial">in abundance</a>.  Some people find pointers to be especially difficult; fortunately, there's a <a href="http://www.youtube.com/watch?v=6pmWojisM_E">claymation cartoon</a> which clarifies everything.

<em>Header aggregation</em>

There is only one header. Put
\code
#include <apop.h>
\endcode
at the top of your file, and you're done. Everything declared in that file starts with \c apop_ or \c Apop_. It also includes 
\c assert.h, \c math.h, \c signal.h, and \c string.h.

<em>Linking</em>

You will need to link to the Apophenia library, which involves adding the <tt>-lapophenia</tt> flag to your compiler. Apophenia depends on SQLite3 and the GNU Scientific Library (which depends on a BLAS), so you will probably need something like:

\code
gcc sample.c -lapophenia -lsqlite3 -lgsl -lgslcblas -o run_me -g -Wall -O3
\endcode

Your best bet is to encapsulate this mess in a \ref makefile "Makefile". Even if you are using an IDE and its command-line management tools, see the Makefile page for notes on useful flags.

<em>Standards compliance</em>

To the best of our abilities, Apophenia complies to the C standard (ISO/IEC
9899:2011). As well as relying on the GSL and SQLite, it uses some POSIX function calls,
such as \c strcasecmp and \c popen.

\subpage designated

\section debugging  Errors, logging, debugging and stopping

<em>The \c error element</em> 

The \ref apop_data set and the \ref apop_model both include an element named \c error. It is normally \c 0, indicating no (known) error. 

For example, \ref apop_data_copy detects allocation errors and some circular links
(when <tt>Data->more == Data</tt>) and fails in those cases. You could thus use the
function with a form like

\code
apop_data *d = apop_text_to_data("indata");
apop_data *cp = apop_data_copy(d);
if (cp->error) {printf("Couldn't copy the input data; failing.\n"); return 1;}
\endcode

There is sometimes (but not always) benefit to handling specific error codes, which are listed in the documentation of those functions that set the \c error element. E.g.,

\code
apop_data *d = apop_text_to_data("indata");
apop_data *cp = apop_data_copy(d);
if (cp->error == 'a') {printf("Couldn't allocate space for the copy; failing.\n"); return 1;}
if (cp->error == 'c') {printf("Circular link in the data set; failing.\n"); return 2;}
\endcode

The end of <a href="http://modelingwithdata.org/appendix_o.html">Appendix O</a>
of <em>Modeling with Data</em> offers some GDB macros which can make dealing with
Apophenia from the GDB command line much more pleasant. As discussed below, it
also helps to set <tt>apop_opts.stop_on_warning='v'</tt> or <tt>'w'</tt> when running
under the debugger.


\section verbsec Verbosity level and logging

The global variable <tt>apop_opts.verbose</tt> determines how many notifications and warnings get printed by Apophenia's warning mechanism:

-1: turn off logging, print nothing (ill-advised) <br>
0: notify only of failures and clear danger <br>
1: warn of technically correct but odd situations that might indicate, e.g., numeric instability <br>
2: debugging-type information; print queries  <br>
3: give me everything, such as the state of the data at each iteration of a loop.

These levels are of course subjective, but should give you some idea of where to place the
verbosity level. The default is 1.

The messages are printed to the \c FILE* handle at <tt>apop_opts.log_file</tt>. If
this is blank (which happens at startup), then this is set to \c stderr. This is the
typical behavior for a console program. Use

\code
apop_opts.log_file = fopen("mylog", "w");
\endcode

to write to the \c mylog file instead of \c stderr.

As well as the error and warning messages, some functions can also print diagnostics,
using the \ref Apop_notify macro.  For example, \ref apop_query and friends will print the
query sent to the database engine iff <tt>apop_opts.verbose >=2</tt> (which is useful
when building complex queries). The diagnostics attempt to follow
the same verbosity scale as the warning messages.

\section Stopping

By default, warnings and errors never halt processing. It is up to the calling function
to decide whether to stop.

When running the program under a debugger, this is an annoyance: we want to stop as
soon as a problem turns up.

The global variable <tt>apop_opts.stop_on_warning</tt> changes when the system halts:

\c 'n': never halt. If you were using Apophenia to support a user-friendly GUI, for example, you would use this mode.<br>
The default: if the variable is <tt>'\0'</tt> (the default), halt on severe errors, continue on all warnings.<br>
\c 'v': If the verbosity level of the warning is such that the warning would print to screen, then halt;
if the warning message would be filtered out by your verbosity level, continue.<br>
\c 'w': Halt on all errors or warnings, including those below your verbosity threshold.

See the documentation for individual functions for details on how each reports errors to the caller and the level at which warnings are posted.

\section Legi Legible output

The output routines handle four sinks for your output. There is a global variable that
you can use for small projects where all data will go to the same place.

\code 
apop_opts.output_type = 'f'; //named file
apop_opts.output_type = 'p'; //a pipe or already-opened file
apop_opts.output_type = 'd'; //the database
\endcode

You can also set the output type, the name of the output file or table, and other options
via arguments to individual calls to output functions. See \ref apop_prep_output for the list of options.

C makes minimal distinction between pipes and files, so you can set a
pipe or file as output and send all output there until further notice:

\code
apop_opts.output_type = 'p';
apop_opts.output_pipe = popen("gnuplot", "w");
apop_plot_lattice(...); //see https://github.com/b-k/Apophenia/wiki/gnuplot_snippets
fclose(apop_opts.output_pipe);
apop_opts.output_pipe = fopen("newfile", "w");
apop_data_print(set1);
fprintf(apop_opts.output_pipe, "\nNow set 2:\n");
apop_data_print(set2);
\endcode

Continuing the example, you can always override the global data with
a specific request:
\code
apop_vector_print(v, "vectorfile"); //put vectors in a separate file
apop_matrix_print(m, "matrix_table", .output_type = 'd'); //write to the db
apop_matrix_print(m, .output_pipe = stdout);  //now show the same matrix on screen
\endcode

I will first look to the input file name, then the input pipe, then the
global \c output_pipe, in that order, to determine to where I should
write.  Some combinations (like output type = \c 'd' and only a pipe) don't
make sense, and I'll try to warn you about those. 

What if you have too much output and would like to use a pager, like \c less or \c more?
In C and POSIX terminology, you're asking to pipe your output to a paging program. Here is
the form:
\code
FILE *lesspipe = popen("less", "w");
assert(lesspipe);
apop_data_print(your_data_set, .output_pipe=lesspipe);
pclose(lesspipe);
\endcode
\c popen will search your usual program path for \c less, so you don't have to give a full path.

\li\ref apop_data_print  
\li\ref apop_matrix_print
\li\ref apop_vector_print

\section sqlsec About SQL, the syntax for querying databases

For a reference, your best bet is the <a href="http://www.sqlite.org/lang.html">Structured Query Language reference</a> for SQLite.  For a tutorial; there is an abundance of <a href="http://www.google.com/search?q=sql+tutorial">tutorials online</a>.  Here is a nice blog <a href="http://fluff.info/blog/arch/00000118.htm">entry</a> about complementaries between SQL and matrix manipulation packages.

Apophenia currently supports two database engines: SQLite and mySQL/mariaDB. SQLite is the default, because it is simpler and generally more easygoing than mySQL, and supports in-memory databases.

The global <tt>apop_opts.db_engine</tt> is initially \c NULL, indicating no preference
for a database engine. You can explicitly set it:

\code
apop_opts.db_engine='s' //use SQLite
apop_opts.db_engine='m' //use mySQL/mariaDB
\endcode

If \c apop_opts.db_engine is still \c NUL on your first database operation, then I will check
for an environment variable <tt>APOP_DB_ENGINE</tt>, and set 
<tt>apop_opts.db_engine='m'</tt> if it is found and matches (case insensitive) \c mariadb or \c mysql.

\code
export APOP_DB_ENGINE=mariadb
apop_text_to_db indata mtab db_for_maria

unset APOP_DB_ENGINE
apop_text_to_db indata stab db_for_sqlite.db
\endcode


Write \ref apop_data sets to the database using \ref apop_data_print, with <tt>.output_type='d'</tt>.

\li Column names are inserted if there are any. If there are, all dots are converted
    to underscores.  Otherwise, the columns will be named \c c1, \c c2, \c c3, &c.
\li If \ref apop_opts_type "apop_opts.db_name_column" is not blank (the default is
    <tt>"row_name"</tt>), then a so-named column is created, and the row names are placed there.
\li If there are weights, they will be the last column of the table, and the column will be named \c weights.
\li  If the table does not exist, create it. Use \ref apop_data_print <tt>(data, "tabname", .output_type='d', .output_append='w')</tt>
    to overwrite an existing table or with <tt>.output_append='a'</tt> to append.
    Appending is the default. Or,
    call \ref apop_table_exists <tt>("tabname", 'd')</tt> to ensure that the table is
    removed ahead of time.

\li If your data set has zero data (i.e., is just a list of column names or is entirely
    blank), \ref apop_data_print returns without creating anything in the database.
\li Especially if you are using a pre-2007 version of SQLite, there may be a speed
    gain to wrapping the call to this function in a begin/commit pair:

\code
apop_query("begin;");
apop_data_print(dataset, .output_name="dbtab", .output_type='d');
apop_query("commit;");
\endcode


Finally, Apophenia provides a few nonstandard SQL functions to facilitate math via database; see \ref db_moments.


\section threads Threading

Apophenia uses OpenMP for threading. You generally do not need to know how OpenMP works
to use Apophenia, and many points of work will thread without your doing anything.

\li All functions strive to be thread-safe. Part of how this is achieved is that static
variables are marked as thread-local or atomic, as per the C standard. There still
exist compilers that can't implement thread-local or atomic variables, in which case
your safest bet is to set OMP's thread count to one as below (or get a new compiler).

\li Some functions modify their inputs. It is up to you to use those functions in
a thread-safe manner. The \ref apop_matrix_realloc handles states and global variables
correctly in a threaded environment, but if you have two threads resizing the same \c
gsl_matrix at the same time, you're going to have problems.

\li There are few compilers that don't support OpenMP. When compiling on such a system all work will be single-threaded.

\li Set the maximum number of threads to \c N with the environment variable

\code
export OMP_NUM_THREADS N
\endcode

or the C function

\code
#include <omp.h>
omp_set_num_threads(N);
\endcode

Use one of these methods with <tt>N=1</tt> if you want a single-threaded program.
You can return later to using all available threads via <tt>omp_set_num_threads(omp_get_num_procs())</tt>.

\li \ref apop_map and friends distribute their \c for loop over the input \ref apop_data
set across multiple threads. Therefore, be careful to send thread-unsafe functions to
it only after calling \c omp_set_num_threads(1).

\li There are a few functions, like \ref apop_model_draws, that rely on \ref apop_map, and
therefore also thread by default.

\li The function \ref apop_rng_get_thread retrieves a statically-stored RNG specific
to a given thread. Therefore, if you use that function in the place of a \c gsl_rng,
you can parallelize functions that make random draws.

\li \ref apop_rng_get_thread allocates its store of threads using <tt>apop_opts.rng_seed</tt>,
then incrementing that seed by one. You thus probably have threads with seeds 479901,
479902, 479903, .... [If you have a better way to do it, please feel free to modify the
code to implement your improvement and submit a pull request on Github.]

See <a href="http://modelingwithdata.org/arch/00000175.htm">this tutorial on C
threading</a> if you would like to know more, or are unsure about whether your functions
are thread-safe or not.
*/


/** \page dataoverview Data sets

The \ref apop_data structure represents a data set.  It joins together a \c gsl_vector,
a \c gsl_matrix, an \ref apop_name, and a table of strings. It tries to be lightweight,
so you can use it everywhere you would use a \c gsl_matrix or a \c gsl_vector.

Here is a diagram showing a sample data set with all of the elements in place. Together,
they represent a data set where each row is an observation, which includes both numeric
and text values, and where each row/column may be named.

\htmlinclude apop_data_fig.html
\latexinclude apop_data_fig.tex

In a regression, the vector would be the dependent variable, and the other columns
(after factor-izing the text) the independent variables. Or think of the \ref apop_data
set as a partitioned matrix, where the vector is column -1, and the first column of
the matrix is column zero. Here is some sample code to print the vector and matrix,
starting at column -1 (but you can use \ref apop_data_print to do this).

\code
for (int j = 0; j< data->matrix->size1; j++){
    printf("%s\t", apop_name_get(data->names, j, 'r'));
    for (int i = -1; i< data->matrix->size2; i++)
        printf("%g\t", apop_data_get(data, j, i));
    printf("\n");
}
\endcode

Most functions assume that each row represents one observation, so the data vector,
data matrix, and text have the same row count: \c data->vector->size==data->matrix->size1
and \c data->vector->size==*data->textsize. This means that the \ref apop_name structure
doesn't have separate \c vector_names, \c row_names, or \c text_row_names elements:
the \c rownames are assumed to apply for all.

See below for notes on managing the \c text element and the row/column names.

\section pps Pages

The \ref apop_data set includes a \c more pointer, which will typically be \c NULL,
but may point to another \ref apop_data set. This is intended for a main data set
and a second or third page with auxiliary information, such as estimated parameters
on the front page and their covariance matrix on page two, or predicted data on the
front page and a set of prediction intervals on page two.

The \c more pointer is not intended for a linked list for millions of data points. In
such situations, you can often improve efficiency by restructuring your data to use
a single table (perhaps via \ref apop_data_pack and \ref apop_data_unpack).

Most functions, such as \ref apop_data_copy and \ref apop_data_free, will handle all
the pages of information. For example, an optimization search over multi-page parameter
sets would search the space given by all pages.

Pages may also be appended as output or auxiliary information, such as
covariances, and an MLE would not search over these elements. Any page with a name in
XML-ish brackets, such as <tt>\<Covariance\></tt>, is considered information about the
data, not data itself, and therefore ignored by search routines, missing data routines,
et cetera. This is achieved by a rule in \ref apop_data_pack and \ref apop_data_unpack.

Here is a toy example that establishes a baseline data set, adds a page,
modifies it, and then later retrieves it.
\code
apop_data *d = apop_data_alloc(10, 10, 10); //the base data set, a 10-item vector + 10x10 matrix
apop_data *a_new_page = apop_data_add_page(d, apop_data_alloc(2,2), "new 2 x 2 page");
gsl_vector_set_all(a_new_page->matrix, 3);

//later:
apop_data *retrieved = apop_data_get_page(d, "new", 'r'); //'r'=search via regex, not literal match.
apop_data_print(retrieved); //print a 2x2 grid of 3s.
\endcode

\section datafns Functions for using apop_data sets

There are a great many functions to collate, copy, merge, sort, prune, and otherwise
manipulate the \ref apop_data structure and its components.

\li\ref apop_data_add_named_elmt
\li\ref apop_data_copy
\li\ref apop_data_fill
\li\ref apop_data_memcpy
\li\ref apop_data_pack
\li\ref apop_data_rm_columns
\li\ref apop_data_sort
\li\ref apop_data_split
\li\ref apop_data_stack
\li\ref apop_data_transpose : transpose matrices (square or not) and text grids
\li\ref apop_data_unpack
\li\ref apop_matrix_copy
\li\ref apop_matrix_realloc
\li\ref apop_matrix_stack
\li\ref apop_text_set
\li\ref apop_text_paste
\li\ref apop_text_to_data
\li\ref apop_vector_copy
\li\ref apop_vector_fill
\li\ref apop_vector_stack
\li\ref apop_vector_realloc
\li\ref apop_vector_unique_elements

Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate
the <a href="http://www.gnu.org/software/gsl/manual/html_node/index.html">GSL's documentation</a> here.
Meanwhile, here are prototypes for a few common functions. The GSL's
naming scheme is very consistent, so a simple reminder of the function name may be
sufficient to indicate how they are used.

\li <tt>gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>gsl_matrix_transpose_memcpy (gsl_matrix * dest, const gsl_matrix * src)</tt>
\li <tt>gsl_matrix_transpose (gsl_matrix * m)</tt> : square matrices only
\li <tt>gsl_matrix_set_all (gsl_matrix * m, double x)</tt>
\li <tt>gsl_matrix_set_zero (gsl_matrix * m)</tt>
\li <tt>gsl_matrix_set_identity (gsl_matrix * m)</tt>
\li <tt>gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)</tt>
\li <tt>void gsl_vector_set_all (gsl_vector * v, double x)</tt>
\li <tt>void gsl_vector_set_zero (gsl_vector * v)</tt>
\li <tt>int gsl_vector_set_basis (gsl_vector * v, size_t i)</tt>: set all elements to zero, but set item \f$i\f$ to one.
\li <tt>gsl_vector_reverse (gsl_vector * v)</tt>: reverse the order of your vector's elements
\li <tt>gsl_vector_ptr</tt> and <tt>gsl_matrix_ptr</tt>. To increment an element in a vector use, e.g., <tt>*gsl_vector_ptr(v, 7) += 3;</tt> or <tt>(*gsl_vector_ptr(v, 7))++</tt>.
\li <tt>gsl_vector_memcpy (gsl_vector * dest, const gsl_vector * src)</tt>

\subsection readin Reading from text files

The \ref apop_text_to_data() function takes in the name of a text file with a grid of data in (comma|tab|pipe|whatever)-delimited format and reads it to a matrix. If there are names in the text file, they are copied in to the data set. See \ref text_format for the full range and details of what can be read in.

If you have any columns of text, then you will need to read in via the database: use
\ref apop_text_to_db() to convert your text file to a database table, 
do any database-appropriate cleaning of the input data, then use \ref
apop_query_to_data() or \ref apop_query_to_mixed_data() to pull the data to an \ref apop_data set.

\subpage text_format

\section datalloc Alloc/free

You may not need to use these functions often, given that \ref apop_query_to_data, \ref apop_text_to_data, and many transformation functions will auto-allocate \ref apop_data sets for you.

The \ref apop_data_alloc function allocates a vector, a matrix, or both. After this call, the structure will have blank names, \c NULL \c text element, and \c NULL \c weights.  See \ref names for discussion of filling the names. Use \ref apop_text_alloc to allocate the \c text grid. The \c weights are a simple \c gsl_vector, so allocate a 100-unit weights vector via <tt>allocated_data_set->weights = gsl_vector_alloc(100)</tt>.

Examples of use can be found throughout the documentation; for example, see \ref gentle.

\li\ref apop_data_alloc
\li\ref apop_data_calloc
\li\ref apop_data_free
\li\ref apop_text_alloc : allocate or resize the text part of an \ref apop_data set.
\li\ref apop_text_free

See also:

\li <tt>gsl_matrix * gsl_matrix_alloc (size_t n1, size_t n2)</tt>
\li <tt>gsl_matrix * gsl_matrix_calloc (size_t n1, size_t n2)</tt>
\li <tt>void gsl_matrix_free (gsl_matrix * m)</tt>
\li <tt>gsl_vector * gsl_vector_alloc (size_t n)</tt>
\li <tt>gsl_vector * gsl_vector_calloc (size_t n)</tt>
\li <tt>void gsl_vector_free (gsl_vector * v)</tt>


\section gslviews 	Using views

There are several macros for the common task of viewing a single row or column of a \ref
apop_data set.

\code
apop_data *d = apop_query_to_data("select obs1, obs2, obs3 from a_table");

//Get a column using its name. Note that the generated view, ov, is the
//last item named in the call to the macro.
Apop_col_t(d, "obs1", ov);
double obs1_sum = apop_vector_sum(ov);

//Get row zero of the data set's matrix as a vector; get its sum
double row_zero_sum = apop_vector_sum(Apop_rv(d, 0));

//Get a row or rows as a standalone one-row apop_data set
apop_data_print(Apop_r(d, 0));

//ten rows starting at row 3:
apop_data *d10 = Apop_rs(d, 3, 10);
apop_data_print(d10);

//Column zero's sum
gsl_vector *cv = Apop_cv(d, 0);
double col_zero_sum = apop_vector_sum(cv);
//or one one line:
double col_zero_sum = apop_vector_sum(Apop_cv(d, 0));

//Pull a 10x5 submatrix, whose origin element is the (2,3)rd
//element of the parent data set's matrix

double sub_sum = apop_matrix_sum(Apop_subm(d, 2,3, 10,5));
\endcode

Because these macros can be used as arguments to a function, these macros have abbreviated names to save line space.

\li\ref Apop_r : get row as one-observation \ref apop_data set
\li\ref Apop_c : get column as \ref apop_data set
\li\ref Apop_cv : get column as \ref gsl_vector
\li\ref Apop_rv : get row as \ref gsl_vector
\li\ref Apop_cs : get columns as \ref apop_data set
\li\ref Apop_rs : get rows as \ref apop_data set
\li\ref Apop_mcv : matrix column as vector
\li\ref Apop_mrv : matrix row as vector
\li\ref Apop_subm : get submatrix of a \ref gsl_matrix

A second set of macros have a slightly different syntax, taking the name of the object to be declared as the last argument. These can not be used as expressions such as function arguments.

\li\ref Apop_col_t
\li\ref Apop_row_t
\li\ref Apop_col_tv
\li\ref Apop_row_tv

The view is an automatic variable, not a pointer, and therefore disappears at the end
of the scope in which it is declared. If you want to retain the data after the function
exits, copy it to another vector:

\code
return apop_vector_copy(Apop_rv(d, 2)); //return a gsl_vector copy of row 2
\endcode

Curly braces always delimit scope, not just at the end of a function. 
When program evaluation exits a given block, all variables in that block are
erased. Here is some sample code that won't work:

\code
apop_data *outdata;
if (get_odd){
    outdata = Apop_r(data, 1);
} else {
    outdata = Apop_r(data, 0);
}
apop_data_print(outdata); //breaks: outdata points to out-of-scope variables.
\endcode

For this if/then statement, there are two sets of local variables
generated: one for the \c if block, and one for the \c else block. By the last line,
neither exists. You can get around the problem here by making sure to not put the macro
declaring new variables in a block. E.g.:

\code
apop_data *outdata = Apop_r(data, get_odd ? 1 : 0);
apop_data_print(outdata);
\endcode


\section data_set_get Set/get

First, some examples:

\code
apop_data *d = apop_data_alloc(10, 10, 10);
apop_name_add(d->names, "Zeroth row", 'r');
apop_name_add(d->names, "Zeroth col", 'c');

//set cell at row=8 col=0 to value=27
apop_data_set(d, 8, 0, .val=27);
assert(apop_data_get(d, 8, .colname="Zeroth") == 27);
double *x = apop_data_ptr(d, .col=7, .rowname="Zeroth");
*x = 270;
assert(apop_data_get(d, 0, 7) == 270);

// This is invalid---the value doesn't follow the colname. Use .val=5.
// apop_data_set(d, .row = 3, .colname="Column 8", 5);  

// This is OK, to set (3, 8) to 5:
apop_data_set(d, 3, 8, 5);


//apop_data set holding a scalar:
apop_data *s = apop_data_alloc(1);
apop_data_set(s, .val=12);
assert(apop_data_get(s) == 12);

//apop_data set holding a vector:
apop_data *v = apop_data_alloc(12);
for (int i=0; i< 12; i++) apop_data_set(s, i, .val=i*10);
assert(apop_data_get(s,3) == 30);

//This is a common form from pulling from a list of named scalars, 
//produced via apop_data_add_named_elmt
double AIC = apop_data_get(your_model->info, .rowname="AIC");
\endcode

\li The versions that take a column/row name use  \ref apop_name_find
    for the search; see notes there on the name matching rules.
\li For those that take a column number, column -1 is the vector element. 
\li For those that take a column name, I will search the vector last---if I don't find the name among the matrix columns, but the name matches the vector name, I return column -1.
\li If you give me both a <tt>.row</tt> and a <tt>.rowname</tt>, I go with the name; similarly for <tt>.col</tt> and
<tt>.colname</tt>.
\li You can give the name of a page, e.g.
\code
double AIC = apop_data_get(data, .rowname="AIC", .col=-1, .page="<Info>");
\endcode

\li Numeric values default to zero, which is how the examples above that treated the \ref apop_data set as a vector or scalar could do so relatively gracefully.
So <tt>apop_data_get(dataset, 1)</tt> gets item (1, 0) from the matrix
element of \c dataset. But as a do-what-I-mean exception, if there is no matrix element
but there is a vector, then this form will get vector element 1. Relying on this DWIM
exception is useful iff you can guarantee that a data set will have only a vector or
a matrix but not both. Otherwise, be explicit and use <tt>apop_data_get(dataset, 1, -1)</tt>.

The \ref apop_data_ptr function follows the lead of \c gsl_vector_ptr and \c
gsl_matrix_ptr, and like those functions, returns a pointer to the appropriate \c double.
For example, to increment the (3,7)th element of an \ref apop_data set:

\code
(*apop_data_ptr(dataset, 3, 7))++;
\endcode

\li\ref apop_data_get
\li\ref apop_data_set
\li\ref apop_data_ptr : returns a pointer to the element.
\li\ref apop_data_get_page : retrieve a named page from a data set. If you only need a few items, you can specify a page name to \c apop_data_(get|set|ptr).

    See also:

\li <tt>double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>double gsl_vector_get (const gsl_vector * v, size_t i)</tt>
\li <tt>void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x)</tt>
\li <tt>void gsl_vector_set (gsl_vector * v, size_t i, double x)</tt>
\li <tt>double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>double * gsl_vector_ptr (gsl_vector * v, size_t i)</tt>
\li <tt>const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j)</tt>
\li <tt>const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i)</tt>
\li <tt>gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i)</tt>
\li <tt>gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j)</tt>
\li <tt>gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v)</tt>
\li <tt>gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)</tt>


\section mapply   Map/apply

\anchor outline_mapply
These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply).  The \c ..._sum functions return the sum of the mapped output.

There are two types, which were developed at different times. The \ref apop_map and
\ref apop_map_sum functions use variadic function inputs to cover a lot of different
types of process depending on the inputs. Other functions with types in their names,
like \ref apop_matrix_map and \ref apop_vector_apply, may be easier to use in some
cases. They use the same routines internally, so use whichever type is convenient.

You can do many things quickly with these functions.

Get the sum of squares of a vector's elements:

\code
  //given apop_data *dataset and gsl_vector *v:
double sum_of_squares = apop_map_sum(dataset, gsl_pow_2);
double sum_of_sqvares = apop_vector_map_sum(v, gsl_pow_2);
\endcode

Create an index vector [\f$0, 1, 2, ...\f$].

\code
double index(double in, int index){return index;}
apop_data *d = apop_map(apop_data_alloc(100), .fn_di=index, .inplace='y');
\endcode

Given your log likelihood function, which acts on a \ref apop_data set with only one
row, and a data set where each row of the matrix is an observation, find the total
log likelihood via:

\code
static double your_log_likelihood_fn(apop_data * in)
     {[your math goes here]}

double total_ll = apop_map_sum(dataset, .fn_r=your_log_likelihood_fn);
\endcode

How many missing elements are there in your data matrix? 

\code
static double nan_check(const double in){ return isnan(in);}

int missing_ct = apop_map_sum(in, nan_check, .part='m');
\endcode

Get the mean of the not-NaN elements of a data set:

\code
static double no_nan_val(const double in){ return isnan(in)? 0 : in;}
static double not_nan_check(const double in){ return !isnan(in);}

static double apop_mean_no_nans(apop_data *in){
    return apop_map_sum(in, no_nan_val)/apop_map_sum(in, not_nan_check);
}
\endcode

The following program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the \f$t\f$ statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.

Notice how the older \ref apop_vector_apply uses file-global variables to pass information into the functions, while the \ref apop_map uses a pointer to send parameters to the functions.

\include t_test_by_rows.c

One more toy example demonstrating the use of \ref apop_map and \ref apop_map_sum :

\include apop_map_row.c


\li If the number of threads is greater than one, then the matrix will be broken
into chunks and each sent to a different thread. Notice that the GSL is generally
threadsafe, and SQLite is threadsafe conditional on several commonsense caveats that
you'll find in the SQLite documentation. See \ref apop_rng_get_thread() to use the GSL's RNGs in a threaded environment.

\li The \c ...sum functions are convenience functions that call \c ...map and then add up the contents. Thus, you will need to have adequate memory for the allocation of the temp matrix/vector.

\li\ref apop_map
\li\ref apop_map_sum
\li\ref apop_matrix_apply
\li\ref apop_matrix_map
\li\ref apop_matrix_map_all_sum
\li\ref apop_matrix_map_sum
\li\ref apop_vector_apply
\li\ref apop_vector_map
\li\ref apop_vector_map_sum



\section  matrixmathtwo  Basic Math

\li\ref apop_vector_exp : exponentiate every element of a vector
\li\ref apop_vector_log : take the natural log of every element of a vector
\li\ref apop_vector_log10 : take the log (base 10) of every element of a vector
\li\ref apop_vector_distance : find the distance between two vectors via various metrics
\li\ref apop_vector_normalize : scale/shift a matrix to have mean zero, sum to one, have a range of exactly \f$[0, 1]\f$, et cetera
\li\ref apop_vector_entropy : calculate the entropy of a vector of frequencies or probabilities

See also:

\li <tt>int gsl_matrix_add (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_sub (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_mul_elements (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_div_elements (gsl_matrix * a, const gsl_matrix * b)</tt>
\li <tt>int gsl_matrix_scale (gsl_matrix * a, const double x)</tt>
\li <tt>int gsl_matrix_add_constant (gsl_matrix * a, const double x)</tt>
\li <tt>gsl_vector_add (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_sub (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_mul (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_div (gsl_vector * a, const gsl_vector * b)</tt>
\li <tt>gsl_vector_scale (gsl_vector * a, const double x)</tt>
\li <tt>gsl_vector_add_constant (gsl_vector * a, const double x)</tt>

            
\section  matrixmath  Matrix math

\li\ref apop_dot : matrix \f$\cdot\f$ matrix, matrix \f$\cdot\f$ vector, or vector \f$\cdot\f$ matrix
\li\ref apop_matrix_determinant
\li\ref apop_matrix_inverse
\li\ref apop_det_and_inv : find determinant and inverse at the same time

See the GSL documentation for myriad further options.


\section  sumstats  Summary stats

\li\ref apop_data_summarize
\li\ref apop_vector_moving_average
\li\ref apop_vector_percentiles
\li\ref apop_vector_bounded

See also:

\li <tt>double gsl_matrix_max (const gsl_matrix * m)</tt>
\li <tt>double gsl_matrix_min (const gsl_matrix * m)</tt>
\li <tt>void gsl_matrix_minmax (const gsl_matrix * m, double * min_out, double * max_out)</tt>
\li <tt>void gsl_matrix_max_index (const gsl_matrix * m, size_t * imax, size_t * jmax)</tt>
\li <tt>void gsl_matrix_min_index (const gsl_matrix * m, size_t * imin, size_t * jmin)</tt>
\li <tt>void gsl_matrix_minmax_index (const gsl_matrix * m, size_t * imin, size_t * jmin, size_t * imax, size_t * jmax)</tt>
\li <tt>gsl_vector_max (const gsl_vector * v)</tt>
\li <tt>gsl_vector_min (const gsl_vector * v)</tt>
\li <tt>gsl_vector_minmax (const gsl_vector * v, double * min_out, double * max_out)</tt>
\li <tt>gsl_vector_max_index (const gsl_vector * v)</tt>
\li <tt>gsl_vector_min_index (const gsl_vector * v)</tt>
\li <tt>gsl_vector_minmax_index (const gsl_vector * v, size_t * imin, size_t * imax)</tt>


\section  moments  Moments

For most of these, you can add a weights vector for weighted mean/var/cov/..., such as
<tt>apop_vector_mean(d->vector, .weights=d->weights)</tt>

\li\ref apop_mean : the first three with short names operate on a vector.
\li\ref apop_sum
\li\ref apop_var
\li\ref apop_matrix_sum
\li\ref apop_data_correlation
\li\ref apop_data_covariance
\li\ref apop_data_summarize
\li\ref apop_matrix_mean
\li\ref apop_matrix_mean_and_var
\li\ref apop_vector_correlation
\li\ref apop_vector_cov
\li\ref apop_vector_kurtosis
\li\ref apop_vector_kurtosis_pop 
\li\ref apop_vector_mean
\li\ref apop_vector_skew
\li\ref apop_vector_skew_pop
\li\ref apop_vector_sum
\li\ref apop_vector_var
\li\ref apop_vector_var_m 


\section convsec   Conversion among types

There are no functions provided to convert from \ref apop_data to the constituent
elements, because you don't need a function.

If you need an individual element, you can use its pointer to retrieve it:

\code
apop_data *d = apop_query_to_mixed_data("vmmw", "select result, age, "
                                     "income, replicate_weight from data");
double avg_result = apop_vector_mean(d->vector, .weights=d->weights);
\endcode

In the other direction, you can use compound literals to wrap an \ref apop_data struct
around a loose vector or matrix:

\code
//Given:
gsl_vector *v;
gsl_matrix *m;

// Then this form wraps the elements into automatically-allocated apop_data structs.

apop_data *dv = &(apop_data){.vector=v}; 
apop_data *dm = &(apop_data){.matrix=m};

apop_data *v_dot_m = apop_dot(dv, dm);

//Here is a macro to hide C's ugliness:
#define As_data(...) (&(apop_data){__VA_ARGS__})

apop_data *v_dot_m2 = apop_dot(As_data(.vector=v), As_data(.matrix=m));

//The wrapped object is an automatically-allocated structure pointing to the
//original data. If it needs to persist or be separate from the original,
//make a copy:
apop_data *dm_copy = apop_data_copy(As_data(.vector=v, .matrix=m));
\endcode

\li\ref apop_array_to_vector : <tt>double*</tt>\f$\to\f$ <tt>gsl_vector</tt>
\li\ref apop_data_fill : <tt>double*</tt>\f$\to\f$  \ref apop_data
\li\ref apop_data_falloc : macro to allocate and fill a \ref apop_data set
\li\ref apop_text_to_data : delimited text file\f$\to\f$ \ref apop_data
\li\ref apop_text_to_db : delimited text file\f$\to\f$ database table
\li\ref apop_vector_to_matrix


\section names   Name handling

If you generate your data set via \ref apop_text_to_data or from the database via
\ref apop_query_to_data (or \ref apop_query_to_text or \ref apop_query_to_mixed_data)
then column names appear as expected.  Set <tt>apop_opts.db_name_column</tt> to the
name of a column in your query result to use that column name for row names.

Sample uses, given \ref apop_data set <tt>d</tt>:

\code
int row_name_count = d->names->rowct
int col_name_count = d->names->colct
int text_name_count = d->names->textct

//Manually add names in sequence:
apop_name_add(d->names, "the vector", 'v');
apop_name_add(d->names, "row 0", 'r');
apop_name_add(d->names, "row 1", 'r');
apop_name_add(d->names, "row 2", 'r');
apop_name_add(d->names, "numeric column 0", 'c');
apop_name_add(d->names, "text column 0", 't');
apop_name_add(d->names, "The name of the data set.", 'h');

//or append several names at once
apop_data_add_names(d, 'c', "numeric column 1", "numeric column 2", "numeric column 3");

//point to element i from the row/col/text names:

char *rowname_i = d->names->row[i];
char *colname_i = d->names->col[i];
char *textname_i = d->names->text[i];

//The vector also has a name:
char *vname = d->names->vector;
\endcode

\li\ref apop_name_add : add one name
\li\ref apop_data_add_names : add a sequence of names at once
\li\ref apop_name_stack : copy the contents of one name list to another
\li\ref apop_name_find : find the row/col number for a given name.
\li\ref apop_name_print : print the \ref apop_name struct, for diagnostic purposes.


\section textsec   Text data

The \ref apop_data set includes a grid of strings, named <tt>text</tt>, for holding text data. 

Text should be encoded in UTF-8. ASCII is a subset of UTF-8, so that's OK too.

There are a few simple forms for handling the \c text element of an \c apop_data set.

\li Use \ref apop_text_alloc to allocate the block of text. It is actually a realloc function, which you can use to resize an existing block without leaks. See the example below.
\li Use \ref apop_text_set to write text elements. It replaces any existing text in the given slot without memory leaks.
\li The number of rows of text data in <tt>tdata</tt> is
<tt>tdata->textsize[0]</tt>; 
the number of columns is <tt>tdata->textsize[1]</tt>.
\li Refer to individual elements using the usual 2-D array notation, <tt>tdata->text[row][col]</tt>.
\li <tt>x[0]</tt> can always be written as <tt>*x</tt>, which may save some typing. The number of rows is <tt>*tdata->textsize</tt>. If you have a single column of text data (i.e., all data is in column zero), then item \c i is <tt>*tdata->text[i]</tt>. If you know you have exactly one cell of text, then its value is <tt>**tdata->text</tt>.
\li After \ref apop_text_alloc, all elements are the empty string <tt>""</tt>, which
you can check via 
\code
if (!strlen(dataset->text[i][j])) printf("<blank>")
//or
if (!*dataset->text[i][j]) printf("<blank>")
\endcode
For the sake of efficiency when dealing with large, sparse data sets, all blank cells
point to <em>the same</em> static empty string, meaning that freeing cells must be
done with care. Your best bet is to rely on \ref apop_text_set, \ref apop_text_alloc,
and \ref apop_text_free to do the memory management for you.

Here is a sample program that uses these forms, plus a few text-handling functions.

\include eg/text_demo.c

\li\ref apop_data_transpose() : also transposes the text data. Say that you use
<tt>dataset = apop_query_to_text("select onecolumn from data");</tt> then you have a
sequence of strings, <tt>d->text[0][0], d->text[1][0], </tt>.... After <tt>apop_data
*dt = apop_data_transpose(dataset)</tt>, you will have a single list of strings,
<tt>dt->text[0]</tt>, which is often useful as input to list-of-strings handling
functions.

\li\ref apop_query_to_text
\li\ref apop_text_alloc : allocate or resize the text part of an \ref apop_data set.
\li\ref apop_text_set : replace a single cell of the text grid with new text.
\li\ref apop_text_paste : convert a table of strings into one long string.
\li\ref apop_text_unique_elements : get a sorted list of unique elements for one column of text.
\li\ref apop_text_free : you may never need this, because \ref apop_data_free calls it.
\li\ref apop_regex : friendlier front-end for POSIX-standard regular expression
            searching; pulls matches into an \ref apop_data set.
\li\ref apop_text_unique_elements

\subsection fact   Generating factors

\em Factor is jargon for a numbered category. Number-crunching programs prefer integers over text, so we need a function to produce a one-to-one mapping from text categories into numeric factors. 

A \em dummy is a variable that is either one or zero, depending on membership in a given
group. Some methods (typically when the variable is an input or independent variable
in a regression) prefer dummies; some methods (typically for outcome or dependent
variables) prefer factors. The functions that generate factors and dummies will add
an informational page to your \ref apop_data set with a name like <tt>\<categories
for your_column\></tt> listing the conversion from the artificial numeric factor to
the original data. Use \ref apop_data_get_factor_names to get a pointer to that page.

You can use the factor table to translate from numeric categories back to text (though
you probably have the original text column in your data anyway).

Having the factor list in an auxiliary table makes it easy to ensure that multiple
\ref apop_data sets use the same single categorization scheme. Generate factors in the
first set, then copy the factor list to the second, then run \ref apop_data_to_factors
on the second:

\code
apop_data_to_factors(d1);
d2->more = apop_data_copy(apop_data_get_factor_names(d1));
apop_data_to_factors(d2);
\endcode

See the documentation for \ref apop_logit for a sample linear model using a factor dependent variable and dummy independent variable.

\li\ref apop_data_to_dummies
\li\ref apop_data_to_factors
\li\ref apop_data_get_factor_names
*/

/** \page dbs Databases

These are convenience functions to handle interaction with SQLite or mySQL/mariaDB. They open one and only one database, and handle most of the interaction therewith for you.

You will probably first use \ref apop_text_to_db to pull data into the database, then \ref apop_query to clean the data in the database, and finally \ref apop_query_to_data to pull some subset of the data out for analysis.

\li In all cases, your query may be in <tt>printf</tt> form. For example:
\code
char tabname[] = "demographics";
char colname[] = "heights";
int min_height = 175;
apop_query("select %s from %s where %s > %i", colname, tabname, colname, min_height);
\endcode


See the \ref db_moments section below for not-SQL-standard math functions that you can
use when sending queries from Apophenia, such as \c pow, \c stddev, or \c sqrt.

\li \ref apop_text_to_db : Read a text file on disk into the database. Data analysis projects often start with a call to this.
\li \ref apop_data_print : If you include the argument <tt>.output_type='d'</tt>, this prints your \ref apop_data set to the database.
\li \ref apop_query : Manipulate the database, return nothing (e.g., insert rows or create table).
\li \ref apop_db_open : Optional, for when you want to use a database on disk.
\li \ref apop_db_close : A useful (and in some cases, optional) companion to \ref apop_db_open.
\li \ref apop_table_exists : Check to make sure you aren't reinventing or destroying data. Also, a clean way to drop a table.

\li Apophenia reserves the right to insert temp tables into the opened database. They
will all have names beginning with <tt>apop_</tt>, so the reader is advised to not
generate tables with such names, and is free to ignore or delete any such tables that
turn up.
\li If you need to deal with two databases, use SQL's <a
href="https://sqlite.org/lang_attach.html"><tt>attach database</tt></a>. By default
with SQLite, Apophenia opens an in-memory database handle. It is a sensible workflow to
use the faster in-memory database as the primary database, and then attach an on-disk database
to read in data and write final output tables.

\section edftd Extracting data from the database

\li\ref apop_db_to_crosstab : take up to three columns in the database (row, column, value) and produce a table of values.
\li\ref apop_query_to_data
\li\ref apop_query_to_float
\li\ref apop_query_to_mixed_data
\li\ref apop_query_to_text
\li\ref apop_query_to_vector

\section wdttd Writing data to the database

See the print functions at \ref Legi. E.g.

\code
apop_data_print(yourdata, .output_type='d', .output_name="dbtab");
\endcode

\section cmdline Command-line utilities

A few functions have proven to be useful enough to be worth breaking out into their own programs, for use in scripts or other data analysis from the command line:

\li The \c apop_text_to_db command line utility is a wrapper for the \ref apop_text_to_db command.
\li The \c apop_db_to_crosstab function is a wrapper for the \ref apop_db_to_crosstab function.

\section db_moments Database moments (plus pow()!)

SQLite lets users define new functions for use in queries, and Apophenia uses this facility to define a few common functions.

\li <tt>select ran() from table</tt> will produce a new random number between zero and one for every row of the input table, using \c gsl_rng_uniform. 

\li The SQL standard includes the <tt>count(x)</tt> and <tt>avg(x)</tt> aggregators,
but statisticians are usually interested in higher moments as well---at least the
variance. Therefore, SQL queries using the Apophenia library may include any of these moments:

\code
select count(x), stddev(x), avg(x), var(x), variance(x), skew(x), kurt(x), kurtosis(x),
std(x), stddev_samp(x), stddev_pop(x), var_samp(x), var_pop(x)
from table
group by whatever
\endcode

<tt>var</tt> and <tt>variance</tt>; <tt>kurt</tt> and <tt>kurtosis</tt> do the same thing; choose the one that sounds better to you.
Kurtosis is the fourth central moment by itself, not adjusted by subtracting three or dividing by variance squared.
<tt>var</tt>, <tt>var_samp</tt>, <tt>stddev</tt> and <tt>stddev_samp</tt> give sample variance/standard deviation; <tt>variance</tt>, <tt>var_pop</tt>, <tt>std</tt> and <tt>stddev_pop</tt> give population standard deviation. The plethora of variants are for mySQL compatibility.

\li The  var/skew/kurtosis functions calculate sample moments. If you want the second
population moment, multiply the variance by \f$(n-1)/n\f$; for the third population moment,
multiply the skew by    \f$(n-1)(n-2)/n^2\f$. The equation for the unbiased sample kurtosis
as calculated in <a href="http://modelingwithdata.org/pdfs/moments.pdf">Appendix M of
<em>Modeling with Data</em></a> is not quite as easy to adjust.

\li Also provided: wrapper functions for standard math library
functions---<tt>sqrt(x)</tt>, <tt>pow(x,y)</tt>, <tt>exp(x)</tt>, <tt>log(x)</tt>,
and trig functions. They call the standard math library function of the same name
to calculate \f$\sqrt{x}\f$, \f$x^y\f$, \f$e^x\f$, \f$\ln(x)\f$, \f$\sin(x)\f$,
\f$\arcsin(x)\f$, et cetera. For example:

\code
select sqrt(x), pow(x,0.5), exp(x), log(x), log10(x),
    sin(x), cos(x), tan(x), asin(x), acos(x), atan(x)
from table
\endcode

\li The <tt>ran()</tt> function calls <tt>gsl_rng_uniform</tt> to produce a uniform
draw between zero and one. It uses the stock of RNGs from \ref apop_rng_get_thread.

Here is a test script using many of the above.

\include db_fns.c
*/


/** \page modelsec Models
See \ref gentle_model for an overview of the intent and basic use of the \ref apop_model struct.

This segment goes into greater detail on the use of existing \ref apop_model objects.
If you need to write a new model, see \ref modeldetails.

The \c estimate function will estimate the parameters of your model. Just prep the data, select a model, and produce an estimate:

\code
    apop_data *data = apop_query_to_data("select outcome, in1, in2, in3 from dataset");
    apop_model *the_estimate = apop_estimate(data, apop_probit);
    apop_model_print(the_estimate);
\endcode

Along the way to estimating the parameters, most models also find covariance estimates for
the parameters, calculate statistics like log likelihood, and so on, which the final print statement will show.

The <tt>apop_probit</tt> model that ships with Apophenia is unparameterized:
<tt>apop_probit->parameters==NULL</tt>. The output from the estimation,
<tt>the_estimate</tt>, has the same form as <tt>apop_probit</tt>, but
<tt>the_estimate->parameters</tt> has a meaningful value.

Apophenia ships with many well-known models for your immediate use, including
probability distributions, such as the \ref apop_normal, \ref apop_poisson, or \ref
apop_beta models. The data is assumed to have been drawn from a given distribution and
the question is only what distributional parameters best fit. For example, given that
the data is Normally distributed, find \f$\mu\f$ and \f$\sigma\f$ via
<tt>apop_estimate(your_data, apop_normal)</tt>.

There are also linear models like \ref apop_ols, \ref apop_probit, and \ref apop_logit. As in the example, they are on equal footing with the distributions, so nothing keeps you from making random draws from an estimated linear model.

  \li If you send a data set with the \c weights vector filled, \ref apop_ols estimates Weighted OLS.
  \li If the dependent variable has more than two categories, the \ref apop_probit and
\ref apop_logit models estimate a multinomial logit or probit.
  \li There are separate \ref apop_normal and \ref apop_multivariate_normal functions
because the parameter formats are slightly different: the univariate Normal keeps both
\f$\mu\f$ and \f$\sigma\f$ in the vector element of the parameters; the multivariate
version uses the vector for the vector of means and the matrix for the \f$\Sigma\f$
matrix. The univariate version is so heavily used that it merits a special-case model.

See the \ref models page for a list of models shipped with Apophenia,
including popular favorites like \ref apop_beta, \ref apop_binomial, \ref apop_iv
(instrumental variables), \ref apop_kernel_density, \ref apop_loess, \ref apop_lognormal,
\ref apop_pmf (see \ref histosec below), and \ref apop_poisson.

Simulation models seem to not fit this form, but you will see below that if you can write an objective function for the \c p method of the model, you can use the above tools. Notably, you can estimate parameters via maximum likelihood and then give confidence intervals around those parameters.

<em>More estimation output</em>

In the \ref apop_model returned by \ref apop_estimate, you will find:

  \li The actual parameter estimates are in an \ref apop_data set at \c your_model->parameters.
  \li A pointer to the \ref apop_data set used for estimation, named \c data.
  \li Scalar statistics of the model listed in the output model's \c info group,
which may include some hypothesis tests, a list of expected values, log likelihood, AIC, AIC_c, BIC, et cetera.
These can be retrieved via a form like
\code
apop_data_get(your_model->info, .rowname="log likelihood");
//or
apop_data_get(your_model->info, .rowname="AIC");
\endcode
If those are not necessary, adding to your model an \ref apop_parts_wanted_settings
group with its default values (see below on settings groups) signals to the model
that you want only the parameters and to not waste possibly significant CPU time
on covariances, expected values, et cetera. See the \ref apop_parts_wanted_settings
documentation for examples and further refinements.
  \li In many cases, covariances of the parameters as a page appended to the parameters; retrieve via
\code
apop_data *cov = apop_data_get_page(your_model->parameters, "<Covariance>");
\endcode
  \li Typically for regression-type models, the table of expected values (typically including
expected value, actual value, and residual) is a page stapled to the main info
page. Retrieve via:
\code
apop_data *predict = apop_data_get_page(your_model->info, "<Predicted>");
\endcode

See individual model documentation for what is provided by any given model.

<em>Post-estimation uses</em>

But we expect much more from a model than estimating parameters from data.  

Continuing the above example where we got an estimated Probit model named \c the_estimate, we can interrogate the estimate in various familiar ways:

\code
apop_data *expected_value = apop_predict(NULL, the_estimate);

double density_under =  apop_cdf(expected_value, the_estimate);

apop_data *draws = apop_model_draws(the_estimate, .count=1000);
\endcode

\subpage dataones

\section modelparameterization  Parameterizing or initializing a model

The models that ship with Apophenia have the requisite procedures for estimation,
making draws, and so on, but have <tt>parameters==NULL</tt> and <tt>settings==NULL</tt>. The
model is thus, for many purposes, incomplete, and you will need to take some action to
complete the model. As per the examples to follow, there are several possibilities:

  \li Estimate it! Almost all models can be sent with a data set as an argument to the
<tt>apop_estimate</tt> function. The input model is unchanged, but the output model
has parameters and settings in place.
  \li If your model has a fixed number of numeric parameters, then you can set them with
\ref apop_model_set_parameters.
  \li If your model has a variable number of parameters, you can directly set the 
\c parameters element via \ref apop_data_falloc. For most purposes, you will also need to
set the \c msize1, \c msize2, \c vsize, and \c dsize elements to the size you want. See
the example below.
  \li Some models have disparate, non-numeric settings rather than a simple matrix of
parameters. For example, an kernel density estimate needs a model as a kernel and a
base data set, which can be set via \ref apop_model_set_settings.

Here is an example that shows the options for parameterizing a model. After each
parameterization, 20 draws are made and written to a file named draws-[modelname].

\include ../eg/parameterization.c


\section transformsec Filtering & updating

The model structure makes it easy to generate new models that are variants of prior
models. Bayesian updating, for example, takes in one \ref apop_model that we call the
prior, one \ref apop_model that we call a likelihood, and outputs an \ref apop_model
that we call the posterior. One can produce complex models using simpler transformations
as well. For example, \ref apop_model_fix_params will set the free parameters of
an input model to a fixed value, thus producing a model with fewer parameters. To
transform a Normal(\f$\mu\f$, \f$\sigma\f$) into a one-parameter Normal(\f$\mu\f$, 1):

\code
apop_model *N_sigma1 = apop_model_fix_params(apop_model_set_parameters(apop_normal, NAN, 1));
\endcode

This can be used anywhere the original Normal distribution can be. To give another
example, if we need to truncate the distribution in the data space:

\code
//The constraint function.
double over_zero(apop_data *in, apop_model *m){
    return apop_data_get(in) > 0;
}

apop_model *trunc = apop_model_dconstrain(.base_model=N_sigma1,
                                          .constraint=over_zero);
\endcode

Chaining together simpler transformations is an easy method to produce 
models of arbitrary detail.  In the following example:

\li Nature generated data using a mixture of three Poisson distributions,
with \f$\lambda=2.8\f$, \f$2.0\f$, and \f$1.3\f$.
The resulting model is generated using \ref apop_model_mixture. 
\li Not knowing the true distribution, the analyst 
models the data with a single Poisson\f$(\lambda)\f$ distribution with a prior on \f$\lambda\f$.
The prior selected is a
truncated Normal(2, 1), generated by sending the stock
\ref apop_normal model to the data-space constraint function \ref apop_dconstrain.
\li The \ref apop_update function takes three arguments: the data set, which comes from
draws from the mixture, the prior, and the likelihood. It produces an output model
which, in this case, is a PMF describing a distribution over \f$\lambda\f$, because
a truncated Normal and a Poisson are not conjugate distributions. Knowing that it is
a PMF, the <tt>->data</tt> element holds a set of draws from the posterior.
\li The analyst would like to present an approximation to the posterior in a simpler form,
and so finds the parameters \f$\mu\f$ and \f$\sigma\f$ of the Normal distribution that
is closest to that posterior.

Here is a program---almost a single line of code---that builds the final approximation to the posterior
model from the subcomponents, including draws from Nature and the analyst's prior
and likelihood:

\include ../eg/transform.c

\section mathmethods Model methods

\li\ref apop_estimate : estimate the parameters of the model with data.
\li\ref apop_predict : the expected value function.
\li\ref apop_draw : random draws from an estimated model.
\li\ref apop_p : the probability of a given data set given the model.
\li\ref apop_log_likelihood : the log of \ref apop_p
\li\ref apop_score : the derivative of \ref apop_log_likelihood
\li\ref apop_model_print : write model components to the screen or a file
\li\ref apop_model_copy : duplicate a model
\li\ref apop_model_set_parameters :  Use this to convert a Normal(\f$\mu\f$, \f$\sigma\f$) with unknown \f$\mu\f$ and \f$\sigma\f$ into a Normal(0, 1), for example.
\li\ref apop_model_free
\li\ref apop_model_clear , \ref apop_prep : remove the parameters from a parameterized model. Used infrequently.
\li\ref apop_model_draws : many random draws from an estimated model.



\li\ref apop_update : Bayesian updating
\li\ref apop_model_coordinate_transform : apply an invertible transformation to the data space
\li\ref apop_model_dconstrain : constrain the data space of a model to a subspace. E.g., truncate a Normal distribution so \f$x>0\f$.
\li\ref apop_model_fix_params : hold some parameters constant
\li\ref apop_model_mixture : a linear combination of models
\li\ref apop_model_cross : If \f$(d_1)\f$ has a Normal\f$(\mu, \sigma)\f$ distribution
and \f$d_2\f$ has an independent Poisson\f$(\lambda)\f$ distribution, then \f$(d_1,
d_2)\f$ has an <tt>apop_model_cross(apop_normal, apop_poisson)</tt> distribution with
parameters \f$(\mu, \sigma, \lambda)\f$.


\section modelsettings Settings groups

Describing a statistical, agent-based, social, or physical model in a standardized
form is difficult because every model has significantly different settings. An
MLE requires a method of search (conjugate gradient, simplex, simulated annealing),
and a histogram needs the number of bins to be filled with data.

So, the \ref apop_model includes a single list which can hold an arbitrary number of settings groups, like the search specifications for finding the maximum likelihood, a histogram for making random draws, and options about the model type.

Settings groups are automatically initialized with default values when
needed. If the defaults do no harm, then you don't need to think about
these settings groups at all.

Here is an example where a settings group is worth tweaking: the \ref apop_parts_wanted_settings group indicates which parts
of the auxiliary data you want. 

\code
1 apop_model *m = apop_model_copy(apop_ols);
2 Apop_settings_add_group(m, apop_parts_wanted, .covariance='y');
3 apop_model *est = apop_estimate(data, m);
\endcode


Line one establishes the baseline form of the model. Line two adds a settings group
of type \ref apop_parts_wanted_settings to the model. By default other auxiliary items, like the expected values, are set to \c 'n' when using this group, so this specifies that we want covariance and only covariance. Having stated our preferences, line three does the estimation we want.

Notice that the \c _settings ending to the settings group's name isn't written---macros
make it happen.  The remaining arguments to \c Apop_settings_add_group (if any) follow
the \ref designated syntax of the form <tt>.setting=value</tt>.

There is an \ref apop_model_copy_set macro that adds a settings group when it is first copied, joining up lines one and two above:

\code
apop_model *m = apop_model_copy_set(apop_ols, apop_parts_wanted, .covariance='y');
\endcode

Settings groups are copied with the model, which facilitates chaining
estimations. Continuing the above example, you could re-estimate to get the predicted
values and covariance via:

\code
Apop_settings_set(est, apop_parts_wanted, predicted, 'y');
apop_model *est2 = apop_estimate(data, est);
\endcode

Maximum likelihood search has many settings that could be modified, and so provides
another common example of using settings groups:

\code
apop_model *the_estimate = apop_estimate(data, apop_probit);

//Redo the Probit's MLE search using Newton's Method:
Apop_settings_add_group(the_estimate, apop_mle, .verbose='y', 
                        .tolerance=1e-4, .method="Newton");
apop_model *re_est = apop_estimate(data, the_estimate);
\endcode

To clarify the distinction between parameters and settings, note that parameters are
estimated from the data, often via a maximum likelihood search. In an ML search,
the method of search, the number of bins in a histogram, or the number of steps in a
simulation would be held fixed as the search iterates over possible parameters (and
if these settings do change, then that is a meta-model that could be encapsulated into another
\ref apop_model). As a consequence, parameters are always numeric, while settings may
be any type.

\li \ref Apop_settings_set, for modifying a single setting, doesn't use the designated initializers format.
\li Because the settings groups are buried within the model, debugging them can be a
pain. Here is a documented macro for \c gdb that will help you pull a settings group out of a 
model for your inspection, to cut and paste into your <tt>.gdbinit</tt>. It shouldn't be too difficult to modify this macro for other debuggers.

\code
define get_group
    set $group = ($arg1_settings *) apop_settings_get_grp( $arg0, "$arg1", 0 )
    p *$group
end
document get_group 
Gets a settings group from a model.
Give the model name and the name of the group, like
get_group my_model apop_mle 
and I will set a gdb variable named $group that points to that model, 
which you can use like any other pointer. For example, print the contents with
p *$group
The contents of $group are printed to the screen as visible output to this macro.
end 
\endcode

For using a model, that's all of what you need to know. For details on writing a new settings group, see \ref settingswriting .

\li\ref Apop_settings_add_group
\li\ref Apop_settings_set
\li\ref Apop_settings_get : get a single element from a settings group.
\li\ref Apop_settings_get_group : get the whole settings group.
*/

/** \page dataones Data format for regression-type models

Regression-type estimations typically require a constant column. That is, the 0th
column of the data is a constant (one), so the parameter \f$\beta_0\f$ is
slightly special in corresponding to a constant rather than a variable.

Some stats packages implicitly assume a constant column, which the user never
sees. This violates the principle of transparency upon which Apophenia is based.
Given a data matrix \f$X\f$ with the estimated parameters
\f$\beta\f$, if the model asserts that the product \f$X\beta\f$ has meaning, then you
should be able to easily calculate that product. With a ones column, a dot product is one line:
<tt>apop_dot(x, your_est->parameters)</tt>; without a ones column, one would basically
have to construct one (using \c gsl_matrix_set_all and \c apop_data_stack).

Each regression-type estimation has one dependent variable and several independent. In
the end, we want the dependent variable to be in the vector element. Removing
a column from a <tt>gsl_matrix</tt> and adjusting all subsequent columns is relatively
difficult, because (like most structs built with the aim of very efficient processing) the
struct depends on an equal spacing in memory between each element.

<em> The automatic case</em>

We can resolve both the need for a ones column and for having the dependent column in
the vector at the same time. Given a data set with no vector element and the dependent
variable in the first column of the matrix, we can copy the dependent variable into
the vector and then replace the first column of the matrix with ones. The result fits
all of the above expectations.

You as a user merely have to send in a \c apop_data set with \c NULL vector and a dependent
column in the first column. If the data is coming from the database, then the query
is natural:

\code
apop_data *regression_data = apop_query_to_data("select depvar, indyvar1, indyvar2, indyvar3 from dataset");
apop_model_print(apop_estimate(regression_data, apop_ols));
\endcode

<em> The already-prepped case</em>

If your data has a vector element, then the prep routines won't change anything.
If you don't want to use a constant column, or your data has already been prepped by
an estimation, then this is what you want.

\code
apop_data *regression_data = apop_query_to_mixed_data("vmmm", "select depvar, indyvar1, indvar2, indvar3 from dataset");
apop_model_print(apop_estimate(regression_data, apop_logit));
\endcode
*/


/** \page testpage Tests & diagnostics

Here is the model for all hypothesis testing within Apophenia:

\li Calculate a statistic.
\li Describe the distribution of that statistic.
\li Work out how much of the distribution is (above|below|closer to zero than) the statistic.

There are a handful of named tests that produce a known statistic and then compare to a
known distribution, like \ref apop_test_kolmogorov or \ref apop_test_fisher_exact. For
traditional distributions (Normal, \f$t\f$, \f$\chi^2\f$), use the \ref apop_test convenience
function.

In especially common cases, like the parameters from an OLS regression,
the commonly-associated \f$t\f$ test is included as part of the estimation
output, typically as a row in the \c info element of the output \ref apop_model.


\li\ref apop_test
\li\ref apop_paired_t_test
\li\ref apop_f_test
\li\ref apop_t_test
\li\ref apop_test_anova_independence
\li\ref apop_test_fisher_exact
\li\ref apop_test_kolmogorov
\li\ref apop_estimate_coefficient_of_determination
\li\ref apop_estimate_r_squared

See also these Monte Carlo methods:

\li\ref apop_bootstrap_cov
\li\ref apop_jackknife_cov

To give another example of testing, here is a function that was briefly a part of
Apophenia, but seemed a bit out of place. Here it is as a sample:

\code
// Input: any vector, which will be normalized in place. Output: 1 - the p-value
// for a chi-squared test to answer the question, "with what confidence can I
// reject the hypothesis that the variance of my data is zero?"

double apop_test_chi_squared_var_not_zero(gsl_vector *in){
    Apop_stopif(!in, return NAN, 0, "input vector is NULL. Doing nothing.");
    apop_vector_normalize(in, .normalization_type='s');
    double sum=apop_vector_map_sum(in, gsl_pow_2);
    return gsl_cdf_chisq_P(sum, in->size);
}
\endcode

Or, consider the Rao statistic, 
\f${\partial\over \partial\beta}\log L(\beta)'I^{-1}(\beta){\partial\over \partial\beta}\log L(\beta)\f$
where \f$L\f$ is a model's likelihood function and \f$I\f$ its information matrix. In code:

\code
apop_data * infoinv = apop_model_numerical_covariance(data, your_model);
apop_data * score = &(apop_data*){.vector=apop_numerical_gradient(data, your_model)};
apop_data * stat = apop_dot(apop_dot(score, infoinv), score);
\endcode

Given the correct assumptions, this is \f$\sim \chi^2_m\f$, where \f$m\f$ is the dimension of \f$\beta\f$, so the odds of a Type I error given the model is:

\code
double p_value = apop_test(stat, "chi squared", beta->size);
\endcode

<em>Generalized parameter tests</em>

But if your model is not from the textbook, then you have the tools to apply the
above three-step process to the parameters of any \ref apop_model.

\li Model parameters are a statistic, and you know that <tt>apop_estimate(your_data,
        your_model)</tt> will output a model with a <tt>parameters</tt> element.
\li \ref apop_parameter_model will return an \ref apop_model describing 
    the distribution of these parameters.
\li We now have the two ingredients to send to \ref apop_cdf, which takes in a model
and a data point and returns the area under the data point.

Defaults for the parameter models are filled in via bootstrapping or resampling, meaning
that if your model's parameters are decidedly off the Normal path, you can still test
claims about the parameters.

The introductory example in \ref gentle ran a standard OLS regression, whose output includes some
standard hypothesis tests; to conclude, let us go the long way and replicate those results
via the general \ref apop_parameter_model mechanism. The results here will of course be
identical, but the more general mechanism can be used in situations where the standard
models don't apply.

The first part of this program is identical to the introductory program, using \c
ss08pdc.csv if you have downloaded it as per the instructions in \ref gentle, or a
simple sample data set if not. The second half executes the three steps uses many
of the above features: one of the inputs to \ref apop_parameter_model (which row of
the parameter set to use) is sent by adding a settings group, we pull that row into
a separate data set using \ref Apop_r, and we set its vector value by referring to it
as the -1st element.

\include ols2.c

Note that the procedure did not assume the model parameters had a certain form. It
queried the model for the distribution of parameter \c agep, and if the model didn't have
a closed-form answer then a distribution via bootstrap would be provided. Then that model
was queried for its CDF. [The procedure does assume a symmetric distribution. Fixing this
is left as an exercise for the reader.] For a model like OLS, this is entirely overkill, 
which is why OLS provides the basic hypothesis tests automatically. But for models
where the distribution of parameters is unknown or has no closed-form solution, this
may be the only recourse.
*/

/** \page histosec Empirical distributions and PMFs (probability mass functions)

The \ref apop_pmf model wraps an \ref apop_data set so it can be read as an empirical
model, with a likelihood function (equal to the associated weight for observed
values and zero for unobserved values), a random number generator (which 
simply makes weighted random draws from the data), and so on.  Setting it up is a
model estimation from data like any other, done via \ref apop_estimate(\c your_data,
\ref apop_pmf).

You have the option of cleaning up the data before turning it into a PMF. For example...

\code
apop_data_pmf_compress(your_data);          //remove duplicates
apop_data_sort(your_data);
apop_vector_normalize(your_data->weights);  //weights sum to one
apop_model *a_pmf = apop_estimate(your_data, apop_pmf);
\endcode

These are largely optional.

\li The CDF is calculated based on the percent of the weights between the zeroth row of the PMF
and the row specified. This generally makes more sense after \ref apop_data_sort.
\li Compression produces a corresponding improvement in efficiency when first calculating
CDFs, but is otherwise not necessary.
\li Sorting or normalizing is not necessary for making draws or getting a likelihood or log likelihood.

It is the \c weights vector that holds the density represented by each row; the rest of the row represents the coordinates of that density. If the input data set has no \c weights segment, then I assume that all rows have equal weight.

For a PMF model, the \c parameters are \c NULL, and the \c data itself
is used for calculation. Therefore, modifying the data post-estimation can break some
internal settings set during estimation. If you modify the data, throw away any existing
PMFs (via \ref apop_model_free) and re-estimate a new one.

\section histocompare Comparing histograms

Using \ref apop_data_pmf_compress puts the data into one bin for each unique value in
the data set.  You may instead want bins of fixed with, in the style of a histogram,
which you can get via \ref apop_data_to_bins. It requires a bin specification. If you
send a \c NULL binspec, then the offset is zero and the bin size is big enough to ensure
that there are \f$\sqrt{N}\f$ bins from minimum to maximum. The binspec will be added
as a page to the data set, named <tt>"<binspec>"</tt>. See the \ref apop_data_to_bins
documentation on how to write a custom bin spec.


There are a few ways of testing the claim that one distribution equals another, typically an empirical PMF versus a smooth theoretical distribution. In both cases, you will need two distributions based on the same binspec. 

For example, if you do not have a prior binspec in mind, then you can use the one generated by the first call to the histogram binning function to make sure that the second data set is in sync:

\code
apop_data_to_bins(first_set, NULL);
apop_data_to_bins(second_set, apop_data_get_page(first_set, "<binspec>"));
\endcode

You can use \ref apop_test_kolmogorov or \ref apop_histograms_test_goodness_of_fit to generate the appropriate statistics from the pairs of bins.

Kernel density estimation will produce a smoothed PDF. See \ref apop_kernel_density for details.
Or, use \ref apop_vector_moving_average for a simpler smoothing method.


\li\ref apop_data_pmf_compress() : merge together redundant rows in a data set before calling 
                \ref apop_estimate(\c your_data, \ref apop_pmf); optional.
\li\ref apop_vector_moving_average() : smooth a vector (e.g., <tt>your_pmf->data->weights</tt>) via moving average.
\li\ref apop_histograms_test_goodness_of_fit() : goodness-of-fit via \f$\chi^2\f$ statistic
\li\ref apop_test_kolmogorov() : goodness-of-fit via Kolmogorov-Smirnov statistic
\li\ref apop_kl_divergence() : measure the information loss from one (typically empirical) distribution to another distribution.
*/

/** \page maxipage Optimization

This section includes some notes on the maximum likelihood routine. As in the section
on writing models above, if a model has a \c p or \c log_likelihood method but no \c
estimate method, then calling \c apop_estimate(your_data, your_model) executes the
default estimation routine of maximum likelihood.

If you are a not a statistician, then there are a few things you will need to keep in
mind:

\li Physicists, pure mathematicians, and the GSL minimize; economists, statisticians, and
Apophenia maximize. If you are doing a minimization, be sure that your function returns minus the objective
function's value.

\li The overall setup is about estimating the parameters of a model with data. The user
provides a data set and an unparameterized model, and the system tries parameterized
models until one of them is found to be optimal. The data is fixed.  The optimization
tries a series of parameterized models, searching for the one that is most likely. In
a non-stats setting, you may have \c NULL data.

\li Because the unit of analysis is a parameterized model,
not just parameters, you need to have an \ref apop_model
wrapping your objective function.

This example, to be discussed in detail below, optimizes 
Rosenbrock's banana function, \f$(1-x)^2+ s(y - x^2)^2\f$, where the
scaling factor \f$s\f$ is fixed ahead of time, say at 100.

\include ../eg/banana.c

The \c banana function returns a single number to be minimized.  You will need to write an
\ref apop_model to send to the optimizer, which is a two step process: write a log
likelihood function wrapping the real objective function (\c ll), and a model that uses that
log likelihood (\c b).

  \li The <tt>.vsize=2</tt> part of the declaration of \c b on the second
line of <tt>main()</tt> specified that the model's parameters are a vector of
size two.  That is, the list of <tt>double</tt>s to send to \c banana is set in
<tt>in->parameters->vector->data</tt>.  
  \li The \c more element of the \ref apop_model
structure is designed to hold any arbitrary structure of size \c more_size, which
is useful for models that require additional constants or other settings, like the
<tt>coeff_struct</tt> here. See \ref settingswriting for more on handling model settings.
  \li Statisticians want the covariance and basic tests about the parameters. 
This line shuts off all auxiliary calculations:
\code
Apop_settings_add_group(your_model, apop_parts_wanted);
\endcode
See the documentation for \ref apop_parts_wanted_settings for details about how this
works.  It can also offer quite the speedup: especially for high-dimensional problems,
finding the covariance matrix without any additional information can take dozens of evaluations
of the objective function for each evaluation that is part of the search itself.
  \li MLEs have an especially large number of parameter tweaks that could be made;
see the \ref apop_mle_settings page.
  \li As a useful diagnostic, you can add a \c NULL \ref apop_data set to the MLE
settings in the <tt>.path</tt> slot, and it will be allocated and filled with the
sequence of points tried by the optimizer.
  \li The program has some extras above and beyond the necessary: it uses two methods
(notice how easy it is to re-run an estimation with an alternate method, but the syntax
for modifying a setting differs from the initialization syntax) and checks that the
results are accurate.


\section constr Setting Constraints

The problem is that the parameters of a function must not take on certain values,
either because the function is undefined for those values or because parameters with
certain values would not fit the real-world problem.

If you give the optimizer an unconstrained likelihood function plus a separate constraint
function, \ref apop_maximum_likelihood will combine them to a function that is continuous
at the constraint boundary, but which is guaranteed to never have an optimum outside of the constraint.

A constraint function must do three things:
\li If the constraint does not bind (i.e. the parameter values are OK), then it must return zero.
\li If the constraint does bind, it must return a penalty, that indicates how far off the parameter is from meeting the constraint.
\li If the constraint does bind, it must set a return vector that the likelihood function can take as a valid input. The penalty at this returned value must be zero.

The idea is that if the constraint returns zero, the log likelihood function will
return the log likelihood as usual, and if not, it will return the log likelihood at
the constraint's return vector minus the penalty. To give a concrete example, here
is a constraint function that will ensure that both parameters of a two-dimensional
input are both greater than zero, and that their sum is greater than two. As with the
constraints for many of the models that ship with Apophenia, it is a wrapper for \ref
apop_linear_constraint.

\code
static long double greater_than_zero_constraint(apop_data *data, apop_model *v){
    static apop_data *constraint = NULL;
    if (!constraint) constraint= apop_data_falloc((3,3,2), 0,  1, 0,   //0 < 1x + 0y
                                                           0,  0, 1,   //0 < 0x + 1y
                                                           2,  1, 1);  //2 < 1x + 1y
    return apop_linear_constraint(v->parameters->vector, constraint, 1e-3);
}
\endcode

\li\ref apop_linear_constraint()

\section simanneal Notes on simulated annealing

For convex optimizations, methods like conjugate gradient search work well, and
for relatively smooth optimizations, the Nelder-Mead simplex algorithm is a good
choice. For situations where the surface being searched may have several local optima
and be otherwise badly behaved, there is simulated annealing.

Simulated annealing is a controlled random walk.  As with the other methods, the
system tries a new point, and if it is better, switches. Initially, the system is
allowed to make large jumps, and then with each iteration, the jumps get smaller,
eventually converging. Also, there is some decreasing probability that if the new
point is less likely, it will still be chosen. Simulated annealing is best for
situations where there may be multiple local optima. Early in the random walk, the
system can readily jump from one to another; later it will fine-tune its way toward the
optimum. The number of points tested is determined by the parameters of the simulated
colling program, not the values returned by the likelihood function.  If you know your
function is globally convex (as are most standard probability functions), then this
method is overkill.


\section mlfns Useful functions

\li\ref apop_estimate_restart : Restarting an MLE with different settings can improve results.
\li\ref apop_maximum_likelihood : Rarely called directly. If a model has no \c estimate element, call \ref apop_estimate to prep the model and run an MLE.
\li\ref apop_model_numerical_covariance
\li\ref apop_numerical_gradient
*/


/** \page moreasst Assorted

Some functions for missing data:

\li\ref apop_data_listwise_delete
\li\ref apop_ml_impute

A few more descriptive methods:

\li\ref apop_matrix_pca : Principal component analysis
\li\ref apop_anova : One-way or two-way ANOVA tables
\li\ref apop_rake : Iterative proportional fitting on large, sparse tables


General utilities:

\li\ref Apop_stopif : Apophenia's error-handling and warning-printing macro. 
\li\ref apop_opts : the global options
\li\ref apop_system : a printf-style wrapper around the standard \c system function.

A few more math utilities:

\li\ref apop_matrix_is_positive_semidefinite
\li\ref apop_matrix_to_positive_semidefinite
\li\ref apop_generalized_harmonic
\li\ref apop_multivariate_gamma
\li\ref apop_multivariate_lngamma
\li\ref apop_rng_alloc

*/

/** \page mingw MinGW

Minimalist GNU for Windows is indeed minimalist: it is not a full POSIX subsystem, and provides no package manager. Therefore, you will have to make some adjustments and install the dependencies yourself.

Matt P. Dziubinski successfully used Apophenia via MinGW; here are his instructions (with edits by BK):

\li get libregex (the ZIP file) from:
http://sourceforge.net/project/showfiles.php?group_id=204414&package_id=306189
\li get libintl (three ZIP files) from:
 http://gnuwin32.sourceforge.net/packages/libintl.htm .
 download "Binaries", "Dependencies", "Developer files"
\li follow "libintl" steps from:
http://kayalang.org/download/compiling/windows

\li Modify \c Makefile, adding -lpthread to AM_CFLAGS (removing -pthread) and -lregex to AM_CFLAGS and LIBS

\li Now compile the main library:
\code
make
\endcode

\li Finally, put one more expected directory in place and install:
\code
mkdir -p -- "/usr/local/Lib/site-packages"
make install
\endcode

\li You will get the usual warning about library paths, and may have to take the specified action:
\code
----------------------------------------------------------------------
Libraries have been installed in:
  /usr/local/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
  - add LIBDIR to the `PATH' environment variable
    during execution
  - add LIBDIR to the `LD_RUN_PATH' environment variable
    during linking
  - use the `-LLIBDIR' linker flag

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
\endcode
*/


/* optionaldetails Implementation of optional arguments  [this section ignored by doxygen]
Optional and named arguments are among the most commonly commented-on features of Apophenia, so this page goes into full detail about the implementation. 

To use these features, see the all-you-really-need summary at the \ref designated
page. For a background and rationale, see the blog entry at http://modelingwithdata.org/arch/00000022.htm . 

I'll assume you've read both links before continuing.

OK, now that you've read the how-to-use and the discussion of how optional and named arguments can be constructed in C, this page will show how they are done in Apophenia. The level of details should be sufficient to implement them in your own code if you so desire.

There are three components to the process of generating optional arguments as implemented here:
\li Produce a \c struct whose elements match the arguments to the function.
\li Write a wrapper function that takes in the struct, unpacks it, and calls the original function.
\li Write a macro that makes the user think the wrapper function is the real thing.

None of these steps are really rocket science, but there is a huge amount of redundancy. 
Apophenia includes some macros that reduce the boilerplate redundancy significantly. There are two layers: the C-standard code, and the script that produces the C-standard code.

We'll begin with the C-standard header file:
\code 
#ifdef APOP_NO_VARIADIC
 void apop_vector_increment(gsl_vector * v, int i, double amt);
#else
 void apop_vector_increment_base(gsl_vector * v, int i, double amt);
 apop_varad_declare(void, apop_vector_increment, gsl_vector * v; int i; double amt);
#define apop_vector_increment(...) apop_varad_link(apop_vector_increment, __VA_ARGS__)
#endif
\endcode

First, there is an if/else that allows the system to degrade gracefully
if you are sending C code to a parser like swig, whose goals differ
too much from straight C compilation for this to work. Set \c
APOP_NO_VARIADIC to produce a plain function with no variadic support.

Else, we begin the above steps. The \c apop_varad_declare line expands to the following:

\code
typedef struct { 
    gsl_vector * v; int i; double amt ; 
} variadic_type_apop_vector_increment; 

void variadic_apop_vector_increment(variadic_type_apop_vector_increment varad_in);
  \endcode

So there's the ad-hoc struct and the declaration for the wrapper
function. Notice how the arguments to the macro had semicolons, like a
struct declaration, rather than commas, because the macro does indeed
wrap the arguments into a struct.

  Here is what the \c apop_varad_link would expand to:
  \code
#define apop_vector_increment(...) variadic_apop_increment_base((variadic_type_apop_vector_increment) {__VA_ARGS__})
  \endcode
That gives us part three: a macro that lets the user think that they are
making a typical function call with a set of arguments, but wraps what
they type into a struct.

Now for the code file where the function is declared. Again, there is is an \c APOP_NO_VARIADIC wrapper. Inside the interesting part, we find the wrapper function to unpack the struct that comes in.

\code
\#ifdef APOP_NO_VARIADIC 
 void apop_vector_increment(gsl_vector * v, int i, double amt){
\#else
apop_varad_head( void , apop_vector_increment){
    gsl_vector * apop_varad_var(v, NULL);
    Apop_stopif(!v, return, 0, "You sent me a NULL vector.");
    int apop_varad_var(i, 0);
    double apop_varad_var(amt, 1);
    apop_vector_increment_base(v, i, amt);
}

 void apop_vector_increment_base(gsl_vector * v, int i, double amt){
#endif
	v->data[i * v->stride]	+= amt;
}
\endcode

The 
\c apop_varad_head macro reduces redundancy, and will expand to
\code
void variadic_apop_vector_increment (variadic_type_variadic_apop_vector_increment varad_in)
\endcode

The function with this header thus takes in a single struct, and for every variable, there is a line like
\code
    double apop_varad_var(amt, 1);
\endcode
which simply expands to:
\code
    double amt = varad_in.amt ? varad_in.amt : 1;
\endcode
Thus, the macro declares each not-in-struct variable, and so there will need to be
one such declaration line for each argument. Apart from requiring declarations, you
can be creative: include sanity checks, post-vary the variables of the inputs, unpack
without the macro, and so on. That is, this parent function does all of the bookkeeping,
checking, and introductory shunting, so the base function can do the math. Finally,
the introductory section will call the base function.

The setup goes out of its way to leave the \c _base function in the public namespace,
so that those who would prefer speed to bounds-checking can simply call that function
directly, using standard notation. You could eliminate this feature by merging
the two functions.


<b>The m4 script</b>

The above is all you need to make this work: the varad.h file, and the above structures. But there is still a lot of redundancy, which can't be eliminated by the plain C preprocessor.

Thus, in Apophenia's code base (the one you'll get from checking out the git repository, not the gzipped distribution that has already been post-processed) you will find a pre-preprocessing script that converts a few markers to the above form. Here is the code that will expand to the above C-standard code:

\code
//header file
APOP_VAR_DECLARE void apop_vector_increment(gsl_vector * v, int i, double amt);

//code file
APOP_VAR_HEAD void apop_vector_increment(gsl_vector * v, int i, double amt){
    gsl_vector * apop_varad_var(v, NULL);
    Apop_stopif(!v, return, 0, "You sent me a NULL vector.");
    int apop_varad_var(i, 0);
    double apop_varad_var(amt, 1);
APOP_VAR_END_HEAD
	v->data[i * v->stride]	+= amt;
}
\endcode

It is obviously much shorter. The declaration line is actually a C-standard declaration with the \c APOP_VAR_DECLARE preface, so you don't have to remember when to use semicolons. The function itself looks like a single function, but there is again a marker before the declaration line, and the introductory material is separated from the main matter by the \c APOP_VAR_END_HEAD line. Done right, drawing a line between the introductory checks or initializations and the main function can improve readability.

The m4 script inserts a <tt>return function_base(...)</tt> at the end of the header
function, so you don't have to. If you want to call the function before the last line, you
can do so explicitly, as in the expansion above, and add a bare <tt>return;</tt> to
guarantee that the call to the base function that the m4 script will insert won't ever be
reached.

One final detail: it is valid to have types with commas in them---function arguments. Because commas get turned to semicolons, and m4 isn't a real parser, there is an exception built in: you will have to replace commas with exclamation marks in the header file (only). E.g.,

\code
APOP_VAR_DECLARE apop_data * f_of_f(apop_data *in, void *param, int n, double (*fn_d)(double ! void * !int));
\endcode

m4 is POSIX standard, so even if you can't read the script, you have the program needed to run it. For example, if you name it \c prep_variadics.m4, then run
\code
m4 prep_variadics.m4 myfile.m4.c > myfile.c
\endcode
*/


/**
\page gentle A quick overview

This is a "gentle introduction" to the Apophenia library. It is intended 
to give you some initial bearings on the typical workflow and the concepts and tricks that
the manual pages assume you are familiar with.

If you want to install Apophenia now so you can try the samples on this page, see the \ref setup page.

An outline of this overview:

\li Apophenia fills a space between traditional C libraries and stats packages. 
\li The \ref apop_data structure represents a data set (of course). Data sets are inherently complex,
but there are many functions that act on \ref apop_data sets to make life easier.
\li The \ref apop_model encapsulates the sort of actions one would take with a model, like estimating model parameters or predicting values based on new inputs.
\li Databases are great, and a perfect fit for the sort of paradigm here. Apophenia
provides functions to make it easy to jump between database tables and \ref apop_data sets.

<em> The opening example</em>

Setting aside the more advanced applications and model-building tasks, let us begin with
the workflow of a typical fitting-a-model project using Apophenia's tools:

\li Read the raw data into the database using \ref apop_text_to_db.
\li Use SQL queries handled by \ref apop_query to massage the data as needed.
\li Use \ref apop_query_to_data to pull some of the data into an in-memory \ref apop_data set.
\li Call a model estimation such as \code apop_estimate (data_set, apop_ols)\endcode  or \code apop_estimate (data_set, apop_probit)\endcode to fit parameters to the data. This will return an \ref apop_model with parameter estimates.
\li Interrogate the returned estimate, by dumping it to the screen with \ref apop_model_print, sending its parameters and variance-covariance matrices to additional tests (the \c estimate step runs a few for you), or send the model's output to be input to another model.

Here is an example of most of the above steps which you can compile and run, to be discussed in detail below.

The program relies on the U.S. Census's American Community Survey public use microdata for DC 2008, which you can get from the command line via:

\code
wget https://raw.github.com/rodri363/tea/master/demo/ss08pdc.csv
\endcode
or by pointing your browser to that address and saving the file.

The program:
\code
#include <apop.h>

int main(){
    apop_text_to_db(.text_file="ss08pdc.csv", .tabname="dc");
    apop_data *data = apop_query_to_data("select log(pincp+10) as log_income, agep, sex "
                    "from dc where agep+ pincp+sex is not null and pincp>=0");
    apop_model *est = apop_estimate(data, apop_ols);
    apop_model_print(est);
}
\endcode

If you saved the code to <tt>census.c</tt> and don't have a \ref makefile or other
build system, then you can compile it with

\code
gcc census.c -std=gnu99 -lapophenia -lgsl -lgslcblas -lsqlite3 -o census
\endcode

or 

\code
clang census.c -lapophenia -lgsl -lgslcblas -lsqlite3 -o census
\endcode

and then run it with <tt>./census</tt>. This compile line will work on any system with all the requisite tools,
but for full-time work with this or any other C library, you will probably want to write a \ref makefile.

The results are unremarkable---age has a positive effect on income, and sex
(1=male, 2=female) does has a negative effect---but it does give us some lines of
code to dissect.

The first two lines in \c main() make use of a database.  
I'll discuss the value of the database step more at the end of this page, but for now,
note that there are several functions, \ref apop_query and \ref apop_query_to_data
being the ones you will most frequently be using, that will allow you to talk to and
pull data from either an SQLite or mySQL/mariaDB database. The database is a natural
place to do data processing like renaming variables, selecting subsets, and transforming values.

<em> Designated initializers</em>

Like this line,

\code
apop_text_to_db(.text_file="data", .tabname="d");
\endcode

many Apophenia functions accept named, optional arguments.  To give another example,
the \ref apop_data set has the usual row and column numbers, but also row and column
names. So you should be able to refer to a cell by any combination of name or number;
for the data set you read in above, which has column names, all of the following work:

\code
x = apop_data_get(data, 2, 3); //observation 2, column 3
x = apop_data_get(data, .row=2, .colname="sex"); // same
apop_data_set(data, 2, 3, 1);
apop_data_set(data, .colname="sex", .row=2, .val=1);
\endcode

Default values mean that the \ref apop_data_get, \ref apop_data_set, and \ref apop_data_ptr functions handle matrices, vectors, and scalars sensibly:
\code
//Let v be a hundred-element vector:
apop_data *v = apop_data_alloc(100);
[fill with data here]
double x1 = apop_data_get(v, 10);
apop_data_set(v, 2, .val=x1);

//A 100x1 matrix behaves like a vector
apop_data *m = apop_data_alloc(100, 1);
[fill with data here]
double m1 = apop_data_get(v, 1);

//let s be a scalar stored in a 1x1 apop_data set:
apop_data *v = apop_data_alloc(1);
double *scalar = apop_data_ptr(s);
\endcode

These conveniences may be new to users of less user-friendly C libraries, but it it fully
conforms to the C standard (ISO/IEC 9899:2011). See the \ref designated page for details.


\section apop_data

A lot of real-world data processing is about quotidian annoyances about text versus
numeric data or dealing with missing values, and the \ref apop_data set and its
many support functions are intended to make data processing in C easy. Some users of
Apophenia use the library only for its \ref apop_data set and associated functions. See
\ref dataoverview for extensive notes on using the structure.

The structure includes seven parts:

\li a vector,
\li a matrix,
\li a grid of text elements,
\li a vector of weights,
\li names for everything: row names, a vector name, matrix column names, text names,
\li a link to a second page of data, and
\li an error marker

This is not a generic and abstract ideal, but is the sort of mess that real-world data sets look like. For
example, here is some data for a weighted OLS regression. It includes an outcome
variable in the vector, dependent variables in the matrix and text grid,
replicate weights, and column names in bold labeling the variables:

\htmlinclude apop_data_fig.html
\latexinclude apop_data_fig.tex

Apophenia's functions generally assume that one row across all of these elements
describes a single observation or data point.

See above for some examples of getting and setting individual elements.

Also, \ref apop_data_get, \ref apop_data_set, and \ref apop_data_ptr consider the vector to be the -1st column,
so using the data set in the figure, <tt>apop_data_get(sample_set, .row=0, .col=-1) == 1</tt>.

<em> Reading in data</em>

As per the example above, use \ref apop_text_to_data or \ref apop_text_to_db and then \ref apop_query_to_data.

<em> Subsets</em>

There are many macros to get views of subsets of the data. Each generates a disposable
wrapper around the base data: once the variable goes out of scope, the wrapper
disappears, but modifications made to the data in the view are modifications to the
base data itself.

\include simple_subsets.c

All of these slicing routines are macros, because they generate several
background variables in the current scope (something a function can't do). Traditional
custom is to put macro names in all caps, like \c APOP_DATA_ROWS, which to modern
sensibilities looks like yelling. The custom has a logic: there are ways to hang
yourself with macros, so it is worth distinguishing them typographically. 
Apophenia tones it down by capitalizing only the first letter.

<em> Basic manipulations</em>

See \ref dataoverview for a list of many other manipulations of data sets, such as 
\ref apop_data_listwise_delete for quick-and-dirty removal of observations with <tt>NaN</tt>s,
\ref apop_data_split / \ref apop_data_stack,
or \ref apop_data_sort to sort all elements by a single column.

<em> Apply and map</em>

If you have an operation of the form <em>for each element of my data set, call this
function</em>, then you can use \ref apop_map to do it. You could basically do everything you
can do with an apply/map function via a \c for loop, but the apply/map approach is clearer
and more fun. Also, if you set OpenMP's <tt>omp_set_num_threads(N)</tt> for any \c N
greater than 1 (the default on most systems is the number of CPU cores), then the work
of mapping will be split across multiple CPU threads.  See \ref mapply for a number
of examples.

<em> Text</em>

String handling in C usually requires some tedious pointer and memory handling, but the functions 
to put strings into the text grid in the \ref apop_data structure and get them out
again will do the pointer shunting for you.  The \ref apop_text_alloc function is
really a realloc function: you can use it to resize the text grid as necessary. The
\ref apop_text_set function will write a single string to the grid, though you may be
using \ref apop_query_to_text or \ref apop_query_to_mixed_data to read in an entire
data set at once. Functions that act on entire data sets, like \ref apop_data_rm_rows,
handle the text part as well.

The text grid for \c your_data has <tt>your_data->textsize[0]</tt> rows and <tt>your_data->textsize[1]</tt> columns. If you are using only the functions to this point, then empty elements are a blank string (<tt>""</tt>), not \c NULL.
For reading individual elements, refer to the \f$(i,j)\f$th text element via <tt>your_data->text[i][j]</tt>.

<em> Errors</em>

Many functions will set the <tt>error</tt> element of the \ref apop_data structure being operated on if anything goes wrong. You can use this to halt the program or take corrective action:

\code 
apop_data *the_data = apop_query_to_data("select * from d");
Apop_stopif(!the_data || the_data->error, exit(1), 0, "Trouble querying the data");
\endcode 

<em> The whole structure</em>

Here is a diagram of all of Apophenia's structures and how they
relate. It is taken from this
<a href="http://modelingwithdata.org/pdfs/cheatsheet.pdf">cheat sheet</a> on general C and SQL use (2 page PDF).

\image html structs.png  width="100%"
\image latex structs.png width=18cm

All of the elements of the \ref apop_data structure are laid out at middle-left. You have
already met the vector, matrix, weights, and text grid.

The diagram shows the \ref apop_name structure, which has received little mention so far because names
basically take care of themselves. A query will bring in column names (and row names if you set <tt>apop_opts.db_name_column</tt>), or use \ref apop_data_add_names to add names to your data
set and \ref apop_name_stack to copy from one data set to another.

The \ref apop_data structure has a \c more element, for when your data is best expressed
in more than one page of data. Use \ref apop_data_add_page, \ref apop_data_rm_page,
and \ref apop_data_get_page. Output routines will sometimes append an extra page of
auxiliary information to a data set, such as pages named <tt>\<Covariance\></tt> or
<tt>\<Factors\></tt>. The angle-brackets indicate a page that describes the data set
but is not a part of it (so an MLE search would ignore that page, for example).


Now let us move up the structure diagram to the \ref apop_model structure. 

\section gentle_model apop_model

Even restricting ourselves to the most basic operations, there are a lot of things that 
we want to do with our models: use a data set to estimate the parameters of a model (like the mean and
variance of a Normal distribution), or draw random numbers, or show the
expected value, or show the expected value of one part of the data given fixed values
for the rest of it. The \ref apop_model is intended to encapsulate most of these desires
into one object, so that models can easily be swapped around, modified to create new models, 
compared, and so on.

From the figure above, you can see that the \ref apop_model structure 
includes a number of informational items, key being the \c parameters, \c data, and
\c info elements; a list of settings to be discussed below; and a set of procedures
for many operations.  Its contents are not (entirely) arbitrary: the overall intent
and the theoretical basis for what is and is not included in an \ref apop_model are
described in this <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf">U.S.
Census Bureau research report</a>.

There are helper functions that will allow you to avoid dealing with the model
internals. For example, the \ref apop_estimate helper function means you never have
to look at the model's \c estimate method (if it even has one), and you will simply
pass the model to a function, as with the above form:

\code
    apop_model *est = apop_estimate(data, apop_ols);
\endcode

\li Apophenia ships with a broad set of models, like \ref apop_ols, \ref apop_dirichlet,
    \ref apop_loess, and \ref apop_pmf (probability mass function); see the full list on <a href="http://apophenia.info/group__models.html">the models documentation page</a>. You would fit
any of them using \ref apop_estimate call, with the appropriate model as the second input.
\li The models that ship with Apophenia, like \ref apop_ols, include the procedures and some metadata, but are of course not yet estimated using a data set (i.e., <tt>data == NULL</tt>, <tt>parameters == NULL</tt>). The line above generated a new
model, \c est, which is identical to the base OLS model but has estimated parameters
(and covariances, and basic hypothesis tests, a log likelihood, \f$AIC_c\f$, \f$BIC\f$, et cetera), and a \c data pointer to the \ref apop_data set used for estimation. 
\li You will mostly use the models by passing them as inputs to 
functions like \ref apop_estimate, \ref apop_draw, or \ref apop_predict; more examples below.
Other than \ref apop_estimate, most require a parameterized model like \c est. After all, it doesn't make sense to
draw from a Normal distribution until its mean and standard deviation are specified.
\li If you know what the parameters should be, for most models use \ref apop_model_set_parameters. E.g.

\code
apop_model *std_normal = apop_model_set_parameters(apop_normal, 0, 1);
apop_data *a_thousand_normals = apop_model_draws(std_normal, 1000);

apop_model *poisson = apop_model_set_parameters(apop_poisson, 1.5);
apop_data *a_thousand_waits = apop_model_draws(poisson, 1000);
\endcode

\li You can use \ref apop_model_print to print the various elements to screen.
\li You can combine and transform models with functions such as \ref
apop_model_fix_params, \ref apop_model_coordinate_transform, or \ref
apop_model_mixture. Each of these functions produce a new model, which can be estimated,
re-combined, or otherwise used like any other model.

\code
//A helper function to check whether a data point is nonnegative
double over_zero(apop_data *in, apop_model *m){ return apop_data_get(in) > 0; }

//Generate a truncated Normal distribution by adding a data constraint:
apop_model *truncated_normal= apop_model_dconstrain(.base_model=apop_normal,
                                                    .constraint=over_zero);

//Get the cross product of that and a free Normal.
apop_model *cross = apop_model_cross(apop_normal, truncated_normal);

//Given assumed data, estimate the parameters of the cross product.
apop_model *xest = apop_estimate(assumed_data, cross);

//Assuming more data, use the fitted cross product as the prior for a Normal distribution.
apop_model *posterior = apop_update(moredata, .prior=xest, .likelihood=apop_normal);

//Assuming more data, use the posterior as the prior for another updating round.
apop_model *post2 = apop_update(moredata2, .prior=posterior, .likelihood=apop_normal);
\endcode

\li Writing your own models won't be covered in this introduction, but it can be easy to
copy and modify the procedures of an existing model to fit your needs. When in doubt, delete a procedure, because any procedures that are missing will have
defaults filled when used by functions like \ref apop_estimate (which uses \ref
apop_maximum_likelihood) or \ref apop_cdf (which uses integration via random draws). See \ref modeldetails for details.
\li There's a simple rule of thumb for remembering the order of the arguments to most of
Apophenia's functions, including \ref apop_estimate : the data always comes first.

<em> Settings</em>

How many bins are in a histogram? At what tolerance does the maximum likelihood
search end? What are the models being combined in an \ref apop_mixture distribution?

Apophenia organizes settings in <em>settings groups</em>, which are then attached
to models.  In the following snippet demonstrating Bayesian updating, we specify a Beta distribution prior.  If the
likelihood function were a Binomial distribution, \ref apop_update knows the closed-form
posterior for a Beta-Binomial pair, but in this case, with a PMF as a likelihood,
it will have to run Markov chain Monte Carlo. The \ref apop_mcmc_settings group attached
to the prior specifies details of how the run should work.

For a likelihood, we generate an empirical distribution---a PMF---from an input 
data set, via <tt>apop_estimate(your_data, apop_pmf)</tt>.
When we call \ref apop_update on the last line, it already has all of the above info
on hand.

\code
apop_model *beta = apop_model_set_parameters(apop_beta, 0.5, 0.25);
Apop_settings_add_group(beta, apop_mcmc, .burnin = 0.2, .periods =1e5);
apop_model *my_pmf = apop_estimate(your_data, apop_pmf);
apop_model *posterior = apop_update(.prior= beta, .likelihood = my_pmf);
\endcode

<em> Databases and models</em>

Returning to the introductory example, you saw that (1) the
library expects you to keep your data in a database, pulling out the
data as needed, and (2) that the workflow is built around
\ref apop_model structures.

Starting with (2), 
if a stats package has something called a <em>model</em>, then it is
probably of the form Y = [an additive function of <b>X</b>], such as \f$y = x_1 +
\log(x_2) + x_3^2\f$. Trying new models means trying different
functional forms for the right-hand side, such as including \f$x_1\f$ in
some cases and excluding it in others. Conversely, Apophenia is designed 
to facilitate trying new models in the broader sense of switching out a 
linear model for a hierarchical, or a Bayesian model for a simulation. 
A formula syntax makes little sense over such a broad range of models.

As a result, the right-hand side is not part of 
the \ref apop_model. Instead, the data is assumed to be correctly formatted, scaled, or logged
before being passed to the model. This is where part (1), the database,
comes in, because it provides a proxy for the sort of formula specification language above:
 \code
apop_data *testme= apop_query_to_data("select y, x1, log(x2), pow(x3, 2) from data");
apop_model *est = apop_estimate(testme, apop_ols);
\endcode

Generating factors and dummies is also considered data prep, not model
internals. See \ref apop_data_to_dummies and \ref apop_data_to_factors.

Now that you have \c est, an estimated model, you can interrogate it. This is where Apophenia and its encapsulated
model objects shine, because you can do more than just admire the parameter estimates on
the screen: you can take your estimated data set and fill in or generate new data, use it
as an input to the parent distribution of a hierarchical model, et cetera. Some simple
examples:

 \code
 //If you have a new data set with missing elements (represented by NaN), you can fill in predicted values:
apop_predict(new_data_set, est);
apop_data_print(new_data_set);

 //Fill a matrix with random draws.
apop_data *d = apop_model_draws(est, .count=1000);

 //How does the AIC_c for this model compare to that of est2?
printf("ΔAIC_c=%g\n", apop_data_get(est->info, .rowname="AIC_c") 
                       - apop_data_get(est2->info, .rowname="AIC_c"));
\endcode

\section gentle_end Conclusion

This introduction has shown you the \ref apop_data set and some of the functions
associated, which might be useful even if you aren't formally doing statistical
work but do have to deal with data with real-world elements like column names and
mixed numeric/text values. You've seen how Apophenia encapsulates many of a model's
characteristics into a single \ref apop_model object, which you can send with data to
functions like \ref apop_estimate, \ref apop_predict, or \ref apop_draw. Once you've
got your data in the right form, you can use this to simply estimate model parameters,
or as an input to later analysis.

What's next?  
  \li Check out the system for hypothesis testing, both with traditional known
distributions (using \ref apop_test for dealing with Normal-, \f$t\f$-,
\f$\chi^2\f$-distributed statistics); and for the parameters of any model; in \ref testpage.
  \li Try your own hand at putting new models into the \ref apop_model framework,
as discussed in \ref modeldetails.
  \li For example, have a look at <a href="http://modelingwithdata.org/arch/00000154.htm">this blog</a>
and its subsequent posts, which wrap a microsimulation into an \ref apop_model, so
that its parameters can be estimated and confidence intervals set around them.
  \li See the \ref maxipage page for discussion of the many features the optimization
system has. It allows you to use a diverse set of search types on constrained or
unconstrained models.
  \li Skim through <a href="http://apophenia.info/group__all__public.html">the full list
of macros and functions</a>---there are hundreds---to get a sense of what else
Apophenia offers.
*/

/** \page modeldetails Writing new models

The \ref apop_model is intended to provide a consistent expression of <em>any</em>
model that (implicitly or explicitly) expresses a likelihood of data given parameters,
including traditional linear models, textbook distributions, Bayesian hierarchies,
microsimulations, and any combination of the above.  The unifying feature is that
all of the models act over some data space and some parameter space (in some cases
one or both is the empty set), and can assign a likelihood for a fixed pair of
parameters and data given the model. This is a very broad requirement, often used
in the statistical literature.  For discussion of the theoretical structures, see <a
href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf"><em>A Useful Algebraic System
of Statistical Models</em></a> (PDF).

This page is about writing new models from scratch, beginning with basic models and on
up to models with arbitrary internal settings, specific methods of Bayesian updating
using your model as a prior or likelihood, and so on. I assume you have already read
\ref modelsec on using models and have tried a few things with the
canned models that come with Apophenia, so you already know how a user handles basic
estimation, adding a settings group, and so on.

This page includes:

\li \ref write_likelihoods of writing a new model from scratch.
\li \ref settingswriting, covering the writing of <em>ad hoc</em> structures to hold model- or method-specific details, like the number of periods for burning in an MCMC run or the number of bins in a histogram.
\li \ref vtables, covering the means of writing special-case routines for functions that are not part of the \ref apop_model itself, including the score or conjugate prior/likelihood pairs for \ref apop_update.
\li \ref modeldataparts, a detailed list of the requirements for the non-function elements of an \ref apop_model.
\li \ref methodsection, a detailed list of requirements for the function elements of an \ref apop_model.

\section write_likelihoods A walkthrough

Users are encouraged to always use models via the helper functions, like
\ref apop_estimate or \ref apop_cdf.  The helper functions do some boilerplate error
checking, and call defaults as needed. For example, if your model has a \c log_likelihood
method but no \c p method, then \ref apop_p will use exp(\c log_likelihood). If you don't
give an \c estimate method, then \c apop_estimate will call \ref apop_maximum_likelihood.

So the game in writing a new model is to write just enough internal methods to give the helper functions what they need.
In the not-uncommon best case, all you need to do is write a log likelihood function or an RNG.

Here is how one would set up a model that could be estimated using maximum likelihood:

\li Write a likelihood function. Its header will look like this:
\code
long double new_log_likelihood(apop_data *data, apop_model *m);
\endcode
where \c data is the input data, and \c m is the parametrized model (i.e. your model
with a \c parameters element already filled in by the caller).  This function will
return the value of the log likelihood function at the given parameters.

\li Write the object:

\code
apop_model *your_new_model = &(apop_model){"The Me distribution", 
            .vsize=n0, .msize1=n1, .msize2=n2, .dsize=nd,
            .log_likelihood = new_log_likelihood };
\endcode

 \li The first element is the <tt>.name</tt>, a human-language name for your model.
 \li The \c vsize, \c msize1, and \c msize2 elements specify the shape of the parameter
set. For example, if there are three numbers in the vector, then set <tt>.vsize=3</tt>
and omit the matrix sizes. The default model prep routine will call
<tt>new_est->parameters = apop_data_alloc(vsize, msize1, msize2)</tt>. 
 \li The \c dsize element is the size of one random draw from your model.
 \li It's common to have [the number of columns in your data set] parameters; this
count will be filled in if you specify \c -1 for \c vsize, <tt>msize(1|2)</tt>, or
<tt>dsize</tt>. If the allocation is exceptional in a different way, then you will
need to allocate parameters by writing a custom \c prep method for the model.
 \li Is this a constrained optimization?  Add a <tt>.constraint</tt> element for those too.  See \ref constr for more.

You already have more than enough that something like this will work (the \c dsize is used for random draws):
\code
apop_model *estimated = apop_estimate(your_data, your_new_model);
\endcode

Once that baseline works, you can fill in other elements of the \ref apop_model as needed.

For example, if you are using a maximum likelihood method to estimate parameters, you can get much faster estimates and better covariance estimates by specifying the dlog likelihood function (aka the score):

\code
void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){
    //do algebra here to find df/dp0, df/dp1, df/dp2....
    gsl_vector_set(gradient, 0, d_0);
    gsl_vector_set(gradient, 1, d_1);
}
\endcode
The score is not part of the model object, but is registered (see below) using 
\code
apop_score_insert(apop_new_dlog_likelihood, your_new_model);
\endcode

\subsection On Threading
Many procedures in Apophenia use OpenMP to thread operations, so assume your functions are running in a threaded environment. If a method can not be threaded, wrap it in an OpenMP critical region. E.g.,


\code

void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){
    #pragma omp critical (newdlog)
    {
        //un-threadable algebra here
    }
    gsl_vector_set(gradient, 0, d_0);
    gsl_vector_set(gradient, 1, d_1);
}
\endcode

\section settingswriting  Writing new settings groups

Your model may need additional settings or auxiliary information to function, which
would require associating a model-specific struct with the model.
A method associated with a model that uses such a struct usually begins with calls like
\code
long double ysg_ll(apop_data *d, apop_model *m){
    ysg_settings *sets = apop_settings_get(m, ysg);

    ...
}
\endcode

These model-specific structs are handled as expected by \ref apop_model_copy and \ref
apop_model_free, and many functions that modify or transform an \ref apop_model try to
handle settings groups as expected. This section describes how to build a settings
group so all these automatic steps happen as expected, and your methods can reliably retrieve settings as needed.

But before getting into the detail of how to make model-specific groups of settings
work, note that there's a lightweight method of storing sundry settings, so in many
cases you can bypass all of the following.
The \ref apop_model structure has a \c void pointer named \c more which you can use to
point to a model-specific struct. If \c more_size is larger than zero (i.e. you set
it to <tt>your_model.more_size=sizeof(your_struct)</tt>), then it will be copied via \c
memcpy by \ref apop_model_copy, and freed by \ref apop_model_free. Apophenia's
routines will never impinge on this item, so do what you wish with it.

The remainder of this subsection describes the information you'll have to provide to make
use of the conveniences described to this point: initialization of defaults, smarter
copying and freeing, and adding to an arbitrarily long list of settings groups attached
to a model.  You will need four items: a typedef for the structure itself, plus init, copy, and
free functions.  This is the sort of boilerplate that will be familiar to users of
object-oriented languages in the style of C++ or Java, but it's really a list of
arbitrarily-typed elements, which makes this feel more like LISP. [And being a
reimplementation of an existing feature of LISP, this section will be macro-heavy.]

The settings struct will likely go into a header file, so here is a sample header
for a new settings group named \c ysg_settings, with a dataset, two sizes, and
a reference counter. <tt>ysg</tt> stands for Your Settings Group; replace that
substring with your preferred name in every instance to follow.

\code
typedef struct {
    int size1, size2;
    char *refs;
    apop_data *dataset;
} ysg_settings;

Apop_settings_declarations(ysg)
\endcode

The first item is a familiar structure definition. The last line is a macro that declares the
init, copy, and free functions discussed below. This is everything you would
need in a header file, should you need one. These are just declarations; we'll write
the actual init/copy/free functions below.

The structure itself gets the full name, \c ysg_settings. Everything else is a macro
keyed on \c ysg, without the \c _settings part. Because of these macros, your struct
name must end in \c _settings.

If you have an especially simple structure, then you can generate the three functions with these three macros in your <tt>.c</tt> file:

\code
Apop_settings_init(ysg, )
Apop_settings_copy(ysg, )
Apop_settings_free(ysg, )
\endcode

These macros generate appropriate functions to do what you'd expect: allocating the
main structure, copying one struct to another, freeing the main structure.  
The spaces after the commas indicate that in these cases no special code gets added to
the functions that these macros expand into.

You'll never call the generated functions directly; they are called by \ref Apop_settings_add_group,
\ref apop_model_free, and other model or settings-group handling functions.

Now that initializing/copying/freeing of
the structure itself is handled, the remainder of this section will be about how to
add instructions for the structure internals, like data that is pointed to by the structure elements.

\li For the allocate function, use the above form if everything in your code defaults to zero/\c NULL.  
Otherwise, you will need a new line declaring a default for every element in your structure. There is a macro to help with this too. 
These macros will define for your use a structure named \c in, and an output pointer-to-struct named \c out.
Continuing the above example:
\code
Apop_settings_init (ysg, 
    Apop_stopif(!in.size1, return NULL, 0, "I need you to give me a value for size1.");
    Apop_varad_set(size2, 10);
    Apop_varad_set(dataset, apop_data_alloc(out->size1, out->size2));
    Apop_varad_set(refs, malloc(sizeof(int)));
    *refs=1;
)
\endcode
Now, <tt>Apop_settings_add(a_model, ysg, .size1=100)</tt> would set up a group with
a 100-by-10 data set, and set the reference counter allocated and to one.

\li Some functions do extensive internal copying, so you will need a copy function even
if your code has no explicit calls to \ref apop_model_copy. The default above simply
copies every element in the structure. Pointers are copied, giving you two pointers
pointing to the same data. We have to be careful to prevent double-freeing later.
\code
//The elements of the set to copy are all copied by the function's boilerplate,
//and then make one additional modification:
Apop_settings_copy (ysg,
    #pragma omp critical (ysg_refs)
        (*refs)++;
)
\endcode

\li The struct itself is freed by boilerplate code, but add code in the free function
to free data pointed to by pointers in the main structure. The macro defines a
pointer-to-struct named \c in for your use. Continuing the example:
\code
Apop_settings_free (ysg,
    #pragma omp critical (ysg_refs)
        if (!(--in->refs)) {
            free(in->dataset);
            free(in->refs);
        }
)
\endcode

With those three macros in place and the header as above, Apophenia will treat your
settings group like any other, and users can use \ref Apop_settings_add_group to
populate it and attach it to any model.

\section vtables Registering new methods in vtables

The settings groups are for adding arbitrary model-specific nouns; vtables are for
adding arbitrary model-specific verbs.

Many functions (e.g., entropy, the dlog likelihood, Bayesian updating) have
special cases for well-known models like the Normal distribution. 
Any function may maintain a registry of models and associated special-case procedures, aka a vtable.

Lookups happen based on a hash that takes into account the elements of the model
that will be used in the calculation. For example, the \c apop_update_hash takes in two
models and calculates the hash based on the address of the prior's \c draw method and
the likelihood's \c log_likelihood or \c p method. Thus, a vtable lookup for new models
that re-use the same methods (at the same addresses in memory) will still find the
same special-case function.

If you need to deregister the function, use the associated deregister function,
e.g. <tt>apop_update_vtable_drop(apop_beta, apop_binomial)</tt>. You can guarantee
that a method will not be re-added by following up the <tt>_drop</tt> with, e.g.,
<tt>apop_update_vtable_add(NULL, apop_beta, apop_binomial)</tt>.

The steps for adding a function to an existing vtable:

\li See \ref apop_update, \ref apop_score, \ref apop_predict, \ref apop_model_print, and \ref
apop_parameter_model for examples and procedure-specific details.
\li Write a function following the given type definition, as listed in the function's documentation.
\li Use the associated <tt>_vtable_add</tt> function to add the function and associate it
with the given model. For example, to add a Beta-binomial routine named \c betabinom
to the registry of Bayesian updating routines, use <tt>apop_update_vtable_add(betabinom,
apop_beta, apop_binomial)</tt>.
\li Place a call to <tt>..._vtable_add</tt> in the \c prep method of the given model, thus ensuring that the auxiliary functions are registered after the first time the model is sent to \ref apop_estimate.

The easiest way to set up a new vtable is to copy/paste/modify an existing one. Briefly:

\li See the existing setups in the vtables portion of <tt>apop.h</tt>. 
\li Cut/paste one and do a search and replace to change the name to match your desired use.
\li Set the typedef to describe the functions that get added to the vtable.
\li Rewrite the hash function to check the part of the inputs that interest you. For
example, the update vtable associates functions with the \c draw, \c log_likelihood,
and \p methods of the model. A model where these elements are identical 
will still match even if other elements are different.

\section modeldataparts The data elements

The remainder of this section covers the detailed expectations regarding the elements
of the \ref apop_model structure. I begin with the data (non-function) elements,
and then cover the method (function) elements. Some of the following will be
requirements for all models and some will be advice to authors; I use the accepted
definitions of <a href="http://tools.ietf.org/html/rfc2119">"must", "shall", "may"</a>
and related words.

\subsection datasubsec Data

\li Each row of the \c data element is treated as a single observation by many functions. 
For example, \ref apop_bootstrap_cov depends on each row being an iid observation to function correctly.
Calculating the Bayesian Information Criterion (BIC) requires knowing
the number of observations in the data, and assumes that row count==observation count.
For complex data, the \ref apop_data_pack and \ref  apop_data_unpack functions can help with this. 

\li Some functions (bootstrap again, or many uses of \ref apop_kl_divergence) use \ref
apop_draw to use your model's RNG (or a default) to draw a
  value, write it to the matrix element of the data set, and then move on to an
  estimation or other step. In this case, the data sent in will be entirely in the \c
  ->matrix element of the \ref apop_data set sent to model methods. Your \c likelihood, \c p, \c cdf, and \c estimate routines
  must accept data as a single row of the matrix of the \ref apop_data set for such functions to work.
  They may accept other formats. Tip: you can use \ref apop_data_pack and \ref apop_data_unpack to convert a structured set to a single row and back again.

\li Your routines may accept other data formats, as per contract with the user.
    For example, regression-type functions use a function named \c ols_shuffle
    to convert a matrix where the first column is the dependent variable to a data
    set with dependent variable in the vector and a column of ones in the first
    matrix column.

\subsection paramsubsec Parameters, vsize, msize1,  msize2

\li The sizes will be used by the \c prep method of the model; see below. Given the model \c m and its elements \c m.vsize, \c m.msize1, \c m.msize2,
    functions that need to allocate a parameter set will do so via <tt>apop_data_alloc(m.vsize, m.msize1, m.msize2)</tt>. 


\subsection infosubsec Info

\li The first page, which should be named \c &lt;info&gt;, is typically a list of scalars. Nothing is guaranteed, but the elements may include:

\li AIC: <a href="https://en.wikipedia.org/wiki/Akaike's_Information_Criterion">Aikake Information Criterion</a>
\li AIC_c: AIC with a finite sample correction. ``<em>Generally, we advocate the use of AIC_c when the ratio \f$n/K\f$ is small (say \f$<\f$ 40)</em>'' [Kenneth P. Burnham, David R. Anderson: <em>Model Selection and Multi-Model Inference</em>, p 66, emphasis in original.]
\li BIC: <a href="https://en.wikipedia.org/wiki/Bayesian_information_criterion">Bayesian Information Criterion</a>
\li R squared
\li R squared adj
\li log likelihood
\li status [0=OK, nozero=other].

For those elements that require a count of input data, the calculations assume each row in the input \ref apop_data set is a single datum.

Get these via, e.g., <tt>apop_data_get(your_model->info, .rowname="log likelihood")</tt>.
When writing for any arbitrary function, be prepared to handle \c NaN, indicating that the element is not calculated or saved in the info page by the given model.

For OLS-type estimations, each row corresponds to the row in the original data. For
filling in of missing data, the elements may appear anywhere, so the row/col indices are
essential.

\subsection settingsgroupmention settings, more

In object-oriented jargon, settings groups are the private elements of the data set,
to be pulled out in certain contexts, and ignored in all others. Therefore, there are
no rules about internal use. The \c more element of the \ref apop_model provides a lightweight
means of attaching an arbitrary struct to a model. See \ref settingswriting above for details.

\li As many settings groups of different types as desired can be added to a single \ref apop_model.
\li One \ref apop_model can not hold two settings groups of the same type. Re-additions cause the removal of the previous version of the group.
\li If the \c more pointer points to a structure or value (let it be \c ss), then \c more_size must be set to <tt>sizeof(ss)</tt>.


\section methodsection Methods

\subsection psubsection p, log_likelihood

\li Function headers look like  <tt>long double your_p_or_ll(apop_data *d, apop_model *params)</tt>.
\li The inputs are an \ref apop_data set and an \ref apop_model, which should include the elements needed to fully estimate the probability/likelihood (probably a filled <tt>->parameters</tt> element, possibly a settings group added by the user).
\li Assume that the parameters have been set, by users via \ref apop_estimate or \ref apop_model_set_parameters, or by \ref apop_maximum_likelihood by its search algorithms. If the parameters are necessary, the function shall check that the parameters are not \c NULL and set the model's \c error element to \c 'p' if they are missing.
\li Return \c NaN on errors. If an error in the input model is found, the function may set the input model's \c error element to an appropriate \c char value.
\li If your model includes both \c log_likelihood and \c p methods, it must be the case that <tt>log(p(d, m))</tt> equals <tt>log_likelihood(d, m)</tt> for all \c d and \c m. This implies that \c p must return a value \f$\geq 0\f$. Note that \ref apop_maximum_likelihood will accept functions where \c p returns a negative value, but diagonstics that depend on log likelihood like AIC will return NaN.
\li If observations are assumed to be iid, you may be able to use \ref apop_map_sum to write the core of the log likelihood function.

\subsection prepsubsection prep

\li Function header looks like <tt>void your_prep(apop_data *data, apop_model *params)</tt>.
\li Re-prepping a model after it has already been prepped shall have no effect. Where there is ambiguity with the other requirements, this takes precedence.
\li The model's <tt>data</tt> pointer shall be set to point to the input data.
\li The \c info element shall be allocated and its title set to <tt>\<Info\></tt>.
\li If \c vsize, \c msize1, or \c msize2 are -1, then the prep function shall set them to the width of the input data.
\li If \c dsize is -1, then the prep function shall set it to the width of the input data.
\li If the \c parameters element is not allocated, the function shall allocate it via <tt>apop_data_alloc(vsize, msize1, msize2)</tt> (or equivalent).
\li The default is \ref apop_model_clear. It does all of the above.
\li The input data may be modified by the prep routine. For example, the \ref apop_ols prep routine shuffles a single input matrix as described above under \c data, and the \ref apop_pmf prep routine calls \ref apop_data_pmf_compress on the input data.
\li The prep routine may initialize any desired settings groups. Unless otherwise
stated, these should not be removed if they are already there, so that users can override defaults by adding a settings group before starting an estimation.
\li If any functions associated with the model need to be added to 
a vtable (see above), the registration shall happen here. Registration may also happen elsewhere.

\subsection estimatesubsection estimate

\li Function header looks like  <tt> void your_estimate(apop_data *data, apop_model *params)</tt>.
It modifies the input model, and returns nothing. Note that this is different from the wrapper function, \ref apop_estimate, which makes a copy of its input model, preps it, and then calls the \c estimate function with the prepeped copy.
\li Assume that the prep routine has already been run. Notably, this means that parameters have been allocated.
\li Assume that the \c parameters hold garbage (as in a \c malloc without a subsequent assignment to the <tt>malloc</tt>-ed space).
\li The function shall set the \c parameters of the input model. For consistency with other models, the estimate should be the maximum likelihood estimate, unless otherwise documented.
\li Additional settings may be set.
\li The model's \c &lt;Info&gt; page may be filled with statistics, as discussed at infosubsec. For scalars like log likelihood and AIC, use \ref apop_data_add_named_elmt.
\li Data should not be modified by the \c estimate routine; any changes to the data made by \c estimate must be documented.
\li The default called by \ref apop_estimate is \ref apop_maximum_likelihood.
\li If errors occur during processing, set the model's \c error element to a single character. Documentation should include the list of error characters and their meaning.

\subsection drawsubsection draw

\li Function header looks like <tt>void your_draw(double *out, gsl_rng* r, apop_model *params)</tt>
\li Assume that model \c paramters are set, via \ref apop_estimate or \ref apop_model_set_parameters. The author of the draw method should check that \c parameters are not \c NULL if needed and fill the output with NaNs if necessary parameters are not set.
\li Caller inputs a pointer-to-<tt>double</tt> of length \c dsize; user is expected to make sure that there is adequate space. Caller also inputs a \c gsl_rng, already allocated (probably via \ref apop_rng_alloc, possibly from \ref apop_rng_get_thread).
\li The function shall fill the space pointed to by the input pointer with a random draw from the data space, where the likelihood of any given observation is proportional to its likelihood as given by the \c p method. Data shall be reduced to a single vector via \ref apop_data_pack if it is not already a single vector.

\subsection cdfsubsection cdf

\li Function header looks like <tt>long double your_cdf(apop_data *d, apop_model *params)</tt>.
\li Assume that \c parameters are set, via \ref apop_estimate or \ref apop_model_set_parameters. The author of the CDF method should check that \c parameters are not \c NULL and return NaN if necessary parameters are not set.
\li The CDF method must accept data as a single row of data in the \c matrix of the input \ref apop_data set (as per a draw produced using the \c draw method). May accept other formats.
\li Returns the percentage of the likelihood function \f$\leq\f$ the first row of the input data. The definition of \f$\leq\f$ is chosen by the model author.
\li If one is not already present, an \c apop_cdf_settings group may be added to the model to store temp data. See the \ref apop_cdf function for details.

\subsection constraintsubsection constraint

\li Function header looks like <tt>long double your_constraint(apop_data *data, apop_model *params)</tt>.
\li Assume that \c parameters are set, via \ref apop_estimate, \ref apop_model_set_parameters, or the internals of an MLE search. The author of the constraint method should check that \c parameters are not \c NULL and return NaN if necessary parameters are not set.
\li See \ref apop_linear_constraint for a useful basis and/or example. Many constraints can be written as wrappers for this function.
\li If the constraint is met, then return zero.
\li If the constraint fails, then (1) move the \c parameters in the input model to a
constraint-satisfying value, and (2) return the distance between the input parameters and
what you've moved the parameters to. The choice of within-bounds parameters and distance function is left to the author of the constraint function.
*/


/**\defgroup models 

This section is a detailed description of the stock models that ship with Apophenia.
It is a reference. For an explanation of what to do with an \ref apop_model, see \ref modelsec.

The primary questions one has about a model in practice are what format the input data should take and what to expect of an estimated output.

Generally, the input data consists of an \ref apop_data set where each row is a single observation. Details beyond that are listed below.

The output after running \ref apop_estimate to produce a fitted model are generally
found in three places: the vector of the output parameter set, its matrix, or a new
settings group. The basic intuition is that 
if the parameters are always a short list of scalars, they are in the vector;
if there exists a situation where they could take matrix form, the parameters will be in the matrix;
if they require more structure than that, they will be a settings group.

If the basic structure of the \ref apop_data set is unfamiliar to you, see \ref
dataoverview, which will discuss the basic means of getting data out of a struct. For
example, the estimated \ref apop_normal distribution has the mean in position zero of
the vector and the standard deviation in position one, so they could be extracted as follows:

\code
apop_data *d = apop_text_to_data("sample data from before")
apop_model *out = apop_estimate(d, apop_normal);
double mu = apop_data_get(out>parameters, 0);
double sigma = apop_data_get(out>parameters, 1);

//What is the p-value of test whose null hypothesis is that μ=3.3?
printf ("pval=%g\n", apop_test(3.3, "normal", mu, sigma);
\endcode

See \ref modelsec for discussion of how to pull settings groups using \ref
Apop_settings_get (for one item) or \ref apop_settings_get_group (for a full settings
group).
*/