1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 6165 6166 6167 6168 6169 6170 6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 6192 6193 6194 6195 6196 6197 6198 6199 6200 6201 6202 6203 6204 6205 6206 6207 6208 6209 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 6300 6301 6302 6303 6304 6305 6306 6307 6308 6309 6310 6311 6312 6313 6314 6315 6316 6317 6318 6319 6320 6321 6322 6323 6324 6325 6326 6327 6328 6329 6330 6331 6332 6333 6334 6335 6336 6337 6338 6339 6340 6341 6342 6343 6344 6345 6346 6347 6348 6349 6350 6351 6352 6353 6354 6355 6356 6357 6358 6359 6360 6361 6362 6363 6364 6365 6366 6367 6368 6369 6370 6371 6372 6373 6374 6375 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435 6436 6437 6438 6439 6440 6441 6442 6443 6444 6445 6446 6447 6448 6449 6450 6451 6452 6453 6454 6455 6456 6457 6458 6459 6460 6461 6462 6463 6464 6465 6466 6467 6468 6469 6470 6471 6472 6473 6474 6475 6476 6477 6478 6479 6480 6481 6482 6483 6484 6485 6486 6487 6488 6489 6490 6491 6492 6493 6494 6495 6496 6497 6498 6499 6500 6501 6502 6503 6504 6505 6506 6507 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6562 6563 6564 6565 6566 6567 6568 6569 6570 6571 6572 6573 6574 6575 6576 6577 6578 6579 6580 6581 6582 6583 6584 6585 6586 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596 6597 6598 6599 6600 6601 6602 6603 6604 6605 6606 6607 6608 6609 6610 6611 6612 6613 6614 6615 6616 6617 6618 6619 6620 6621 6622 6623 6624 6625 6626 6627 6628 6629 6630 6631 6632 6633 6634 6635 6636 6637 6638 6639 6640 6641 6642 6643 6644 6645 6646 6647 6648 6649 6650 6651 6652 6653 6654 6655 6656 6657 6658 6659 6660 6661 6662 6663 6664 6665 6666 6667 6668 6669 6670 6671 6672 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684 6685 6686 6687 6688 6689 6690 6691 6692 6693 6694 6695 6696 6697 6698 6699 6700 6701 6702 6703 6704 6705 6706 6707 6708 6709 6710 6711 6712 6713 6714 6715 6716 6717 6718 6719 6720 6721 6722 6723 6724 6725 6726 6727 6728 6729 6730 6731 6732 6733 6734 6735 6736 6737 6738 6739 6740 6741 6742 6743 6744 6745 6746 6747 6748 6749 6750 6751 6752 6753 6754 6755 6756 6757 6758 6759 6760 6761 6762 6763 6764 6765 6766 6767 6768 6769 6770 6771 6772 6773 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6784 6785 6786 6787 6788 6789 6790 6791 6792 6793 6794 6795 6796 6797 6798 6799 6800 6801 6802 6803 6804 6805 6806 6807 6808 6809 6810 6811 6812 6813 6814 6815 6816 6817 6818 6819 6820 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 6833 6834 6835 6836 6837 6838 6839 6840 6841 6842 6843 6844 6845 6846 6847 6848 6849 6850 6851 6852 6853 6854 6855 6856 6857 6858 6859 6860 6861 6862 6863 6864 6865 6866 6867 6868 6869 6870 6871 6872 6873 6874 6875 6876 6877 6878 6879 6880 6881 6882 6883 6884 6885 6886 6887 6888 6889 6890 6891 6892 6893 6894 6895 6896 6897 6898 6899 6900 6901 6902 6903 6904 6905 6906 6907 6908 6909 6910 6911 6912 6913 6914 6915 6916 6917 6918 6919 6920 6921 6922 6923 6924 6925 6926 6927 6928 6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960 6961 6962 6963 6964 6965 6966 6967 6968 6969 6970 6971 6972 6973 6974 6975 6976 6977 6978 6979 6980 6981 6982 6983 6984 6985 6986 6987 6988 6989 6990 6991 6992 6993 6994 6995 6996 6997 6998 6999 7000 7001 7002 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012 7013 7014 7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028 7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042 7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056 7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070 7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126 7127 7128 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170 7171 7172 7173 7174 7175 7176 7177 7178 7179 7180 7181 7182 7183 7184 7185 7186 7187 7188 7189 7190 7191 7192 7193 7194 7195 7196 7197 7198 7199 7200 7201 7202 7203 7204 7205 7206 7207 7208 7209 7210 7211 7212 7213 7214 7215 7216 7217 7218 7219 7220 7221 7222 7223 7224 7225 7226 7227 7228 7229 7230 7231 7232 7233 7234 7235 7236 7237 7238 7239 7240 7241 7242 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 7253 7254 7255 7256 7257 7258 7259 7260 7261 7262 7263 7264 7265 7266 7267 7268 7269 7270 7271 7272 7273 7274 7275 7276 7277 7278 7279 7280 7281 7282 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 7293 7294 7295 7296 7297 7298 7299 7300 7301 7302 7303 7304 7305 7306 7307 7308 7309 7310 7311 7312 7313 7314 7315 7316 7317 7318 7319 7320 7321 7322 7323 7324 7325 7326 7327 7328 7329 7330 7331 7332 7333 7334 7335 7336 7337 7338 7339 7340 7341 7342 7343 7344 7345 7346 7347 7348 7349 7350 7351 7352 7353 7354 7355 7356 7357 7358 7359 7360 7361 7362 7363 7364 7365 7366 7367 7368 7369 7370 7371 7372 7373 7374 7375 7376 7377 7378 7379 7380 7381 7382 7383 7384 7385 7386 7387 7388 7389 7390 7391 7392 7393 7394 7395 7396 7397 7398 7399 7400 7401 7402 7403 7404 7405 7406 7407 7408 7409 7410 7411 7412 7413 7414 7415 7416 7417 7418 7419 7420 7421 7422 7423 7424 7425 7426 7427 7428 7429 7430 7431 7432 7433 7434 7435 7436 7437 7438 7439 7440 7441 7442 7443 7444 7445 7446 7447 7448 7449 7450 7451 7452 7453 7454 7455 7456 7457 7458 7459 7460 7461 7462 7463 7464 7465 7466 7467 7468 7469 7470 7471 7472 7473 7474 7475 7476 7477 7478 7479 7480 7481 7482 7483 7484 7485 7486 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497 7498 7499 7500 7501 7502 7503 7504 7505 7506 7507 7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 7526 7527 7528 7529 7530 7531 7532 7533 7534 7535 7536 7537 7538 7539 7540 7541 7542 7543 7544 7545 7546 7547 7548 7549 7550 7551 7552 7553 7554 7555 7556 7557 7558 7559 7560 7561 7562 7563 7564 7565 7566 7567 7568 7569 7570 7571 7572 7573 7574 7575 7576 7577 7578 7579 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599 7600 7601 7602 7603 7604 7605 7606 7607 7608 7609 7610 7611 7612 7613 7614 7615 7616 7617 7618 7619 7620 7621 7622 7623 7624 7625 7626 7627 7628 7629 7630 7631 7632 7633 7634 7635 7636 7637 7638 7639 7640 7641 7642 7643 7644 7645 7646 7647 7648 7649 7650 7651 7652 7653 7654 7655 7656 7657 7658 7659 7660 7661 7662 7663 7664 7665 7666 7667 7668 7669 7670 7671 7672 7673 7674 7675 7676 7677 7678 7679 7680 7681 7682 7683 7684 7685 7686 7687 7688 7689 7690 7691 7692 7693 7694 7695 7696 7697 7698 7699 7700 7701 7702 7703 7704 7705 7706 7707 7708 7709 7710 7711 7712 7713 7714 7715 7716 7717 7718 7719 7720 7721 7722 7723 7724 7725 7726 7727 7728 7729 7730 7731 7732 7733 7734 7735 7736 7737 7738 7739 7740 7741 7742 7743 7744 7745 7746 7747 7748 7749 7750 7751 7752 7753 7754 7755 7756 7757 7758 7759 7760 7761 7762 7763 7764 7765 7766 7767 7768 7769 7770 7771 7772 7773 7774 7775 7776 7777 7778 7779 7780 7781 7782 7783 7784 7785 7786 7787 7788 7789 7790 7791 7792 7793 7794 7795 7796 7797 7798 7799 7800 7801 7802 7803 7804 7805 7806 7807 7808 7809 7810 7811 7812 7813 7814 7815 7816 7817 7818 7819 7820 7821 7822 7823 7824 7825 7826 7827 7828 7829 7830 7831 7832 7833 7834 7835 7836 7837 7838 7839 7840 7841 7842 7843 7844 7845 7846 7847 7848 7849 7850 7851 7852 7853 7854 7855 7856 7857 7858 7859 7860 7861 7862 7863 7864 7865 7866 7867 7868 7869 7870 7871 7872 7873 7874 7875 7876 7877 7878 7879 7880 7881 7882 7883 7884 7885 7886 7887 7888 7889 7890 7891 7892 7893 7894 7895 7896 7897 7898 7899 7900 7901 7902 7903 7904 7905 7906 7907 7908 7909 7910 7911 7912 7913 7914 7915 7916 7917 7918 7919 7920 7921 7922 7923 7924 7925 7926 7927 7928 7929 7930 7931 7932 7933 7934 7935 7936 7937 7938 7939 7940 7941 7942 7943 7944 7945 7946 7947 7948 7949 7950 7951 7952 7953 7954 7955 7956 7957 7958 7959 7960 7961 7962 7963 7964 7965 7966 7967 7968 7969 7970 7971 7972 7973 7974 7975 7976 7977 7978 7979 7980 7981 7982 7983 7984 7985 7986 7987 7988 7989 7990 7991 7992 7993 7994 7995 7996 7997 7998 7999 8000 8001 8002 8003 8004 8005 8006 8007 8008 8009 8010 8011 8012 8013 8014 8015 8016 8017 8018 8019 8020 8021 8022 8023 8024 8025 8026 8027 8028 8029 8030 8031 8032 8033 8034 8035 8036 8037 8038 8039 8040 8041 8042 8043 8044 8045 8046 8047 8048 8049 8050 8051 8052 8053 8054 8055 8056 8057 8058 8059 8060 8061 8062 8063 8064 8065 8066 8067 8068 8069 8070 8071 8072 8073 8074 8075 8076 8077 8078 8079 8080 8081 8082 8083 8084 8085 8086 8087 8088 8089 8090 8091 8092 8093 8094 8095 8096 8097 8098 8099 8100 8101 8102 8103 8104 8105 8106 8107 8108 8109 8110 8111 8112 8113 8114 8115 8116 8117 8118 8119 8120 8121 8122 8123 8124 8125 8126 8127 8128 8129 8130 8131 8132 8133 8134 8135 8136 8137 8138 8139 8140 8141 8142 8143 8144 8145 8146 8147 8148 8149 8150 8151 8152 8153 8154 8155 8156 8157 8158 8159 8160 8161 8162 8163 8164 8165 8166 8167 8168 8169 8170 8171 8172 8173 8174 8175 8176 8177 8178 8179 8180 8181 8182 8183 8184 8185 8186 8187 8188 8189 8190 8191 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8204 8205 8206 8207 8208 8209 8210 8211 8212 8213 8214 8215 8216 8217 8218 8219 8220 8221 8222 8223 8224 8225 8226 8227 8228 8229 8230 8231 8232 8233 8234 8235 8236 8237 8238 8239 8240 8241 8242 8243 8244 8245 8246 8247 8248 8249 8250 8251 8252 8253 8254 8255 8256 8257 8258 8259 8260 8261 8262 8263 8264 8265 8266 8267 8268 8269 8270 8271 8272 8273 8274 8275 8276 8277 8278 8279 8280 8281 8282 8283 8284 8285 8286 8287 8288 8289 8290 8291 8292 8293 8294 8295 8296 8297 8298 8299 8300 8301 8302 8303 8304 8305 8306 8307 8308 8309 8310 8311 8312 8313 8314 8315 8316 8317 8318 8319 8320 8321 8322 8323 8324 8325 8326 8327 8328 8329 8330 8331 8332 8333 8334 8335 8336 8337 8338 8339 8340 8341 8342 8343 8344 8345 8346 8347 8348 8349 8350 8351 8352 8353 8354 8355 8356 8357 8358 8359 8360 8361 8362 8363 8364 8365 8366 8367 8368 8369 8370 8371 8372 8373 8374 8375 8376 8377 8378 8379 8380 8381 8382 8383 8384 8385 8386 8387 8388 8389 8390
|
1999-11-22 Andrew McCallum <mccallum@justresearch.com>
* Makefile.in (STANDARD_RAINBOW_METHOD_C_FILES): Added dirk.c.
* dirk.c (log_gamma): Cache 100 integer x's.
(bow_dirk_log_kernel): Take vocab size as argument instead of barrel.
(bow_dirk_score): Add exponentiated log-densities, instead of log
densities. Do this by finding the max and subtracting.
1999-11-16 Andrew McCallum <mccallum@justresearch.com>
* cdm.c (cdm_options): New command-line options
"cdm-print-smallest-alphas" and "cdm-print-largest-alphas".
(cdm_parse_opt): Handle them.
(bow_cdm_initialize_ct): New code allows this to be called more than
once. This way you can add new document (and hence words) and
re-calculate the infogain.
(bow_cdm_ct_set_alphas): Added structure ALPHA_RECORD for printing
largest and smallest alphas. Added, but commented out, code for
smoothing the counts before fitting the Dirichlet, using
log(alpha) in place of alpha, smoothing the alphas. Print the
largest and smallest alphas.
(CDM_SCORE_ANNEAL_TEMPERATURE): New macro, currently defined not to be
used.
(bow_cdm_score): Handle it.
1999-11-10 Andrew McCallum <mccallum@justresearch.com>
* svm_base.c (sqrtf): New macro, necessary on some non-Linux
machines. Bug reported by Chuck Rosenberg.
1999-11-08 Andrew McCallum <mccallum@justresearch.com>
* readme.texi: Add simple usage examples for arrow.
* arrow.c (arrow_serve2): Implement the 'query' command. Change
XML labels from "archer" to "arrow".
(main): Change default number of hits on a query from 1 to 10.
* libbow-desc.texi: Update descriptions.
* svm_base.c: Surround many condition man printf's on the
bow_verbosity_level, so that by default rainbow-stats will still
work.
* array.c (cdocs_iterator_count_for_doc): Replace NAN macro with
arithmetic equivalent.
* barrel.c (barrel_iterator_count_for_doc): Likewise.
* wv.c (bow_wv_weight_sum): New function.
* bow/libbow.h: Declare new function.
* train_dirichlet.c (moment_match_mccallum): Separate
implementation of moment matching that determines the variance by
averaging the variance of all dimentions.
(train_dirichlet_mom_sparse): New function.
* bow/train_dirichlet.h: Declare new function.
* tfidf.c (TFIDF_METHOD): Use
bow_wv_set_weights_to_count_times_idf() instead of
bow_wv_set_weights_to_count(), as is correct for TFIDF. This was
previously corrected in the scoring function.
(bow_tfidf_params_tfidf): Change parameter settings for "tfidf"
method. Previously it was identical to the "tfidf_log_words"
method, now it is identical to the "tfidf_log_occur" method. In
other words, previously it calculated IDF using the number of
times the word occurred in the training data; now it uses the
number of training documents in which the word occurs.
* split.c (bow_split_options): Remove documentation for 'r'
suffix. It's confusing and shouldn't be used unawares.
(bow_split_parse_opt): Add a 'pcr' suffix, but its not implemented
yet.
(bow_set_doc_types_randomly_by_count_per_class): Count the number of
untagged documents in each class, and if this function is trying
to tag more than are available, simply have this function tag
less.
* rainbow.c (bow_print_log_odds_ratio): Handle words that are not
in the vocabulary.
* ddf.c: Implement ddfmm classification method. This method fits
the Dirichlet by moment matching only.
* arrow.c (arrow_serve2): New function. Now call this instead of
arrow_serve. It provides output in XML, like archer does. Only
the rank command is implemented.
1999-11-02 Andrew McCallum <mccallum@justresearch.com>
* int4str.c (bow_int2str): Assert that INDEX argument is
non-negative.
1999-10-28 Gregory C Schohn <gcs@cmu.edu>
* svm_base.c (svm_vpc_merge): fixed bug for svml-basename - all
the docs still need to be output, so that the other data (like
word weights can be properly extracted).
1999-10-28 Andrew McCallum <mccallum@justresearch.com>
* cdmem.c (cdmem_options): New command-line option
"cdmem-dist-data".
(cdmem_parse_opt): Handle it.
(bow_cdmem_new_vpc_with_weights): Let the command-line option
determine what documents are used to learn the distance metric.
* README-SVM (Outputing data): Added new section describing how to
produce files ready for input into SVM^light.
1999-10-27 Gregory C Schohn <gcs@cmu.edu>
* svm_base.c (svm_vpc_merge): fixed svml bugs
* svm_base.c fixed outdated documentation for parse info.
* svm_smo.c (smo): fixed a parse error
1999-10-26 Gregory C Schohn <gcs@cmu.edu>
* rainbow.c (rainbow_test): added a line for svms. When svmlight
output is being generated, rainbow_test prints the label (only
works for binary barrels) so that svm_score can append the data
for that example.
* svm_base.c (svm_options[]): removed some of the single character
switches. Added arguments for tsvms & added svml-basename arg.
(svm_permute_data, svm_unpermute_data): added.
(infogain): should have made infogain compatible with sets with
unlabeled data (it ignore those docs with y = 0).
(svm_vpc_merge): added support for using unlabeled docs for
transduction. Also added code to spit out svmlight friendly
files.
(svm_score): added code to write svmlight files.
* svm_trans.c: initial version - pretty much empty now.
* bow/svm.h: added svm_*permute_data declarations & the
transduce_svm declaration.
* svm_al.c (al_svm_test_wrapper): replaced permutation code with
calls to svm_permute_data & svm_unpermute_data.
* svm_smo.c (smo): removed srandom(1) - was only there for
debugging.
* README-SVM (Bugs): removed section about smo being broken (was
fixed).
* Makefile.in: added svm_trans.c (transductive svms) to the
svm_files.
1999-10-25 Andrew McCallum <mccallum@justresearch.com>
* .cvsignore: Add automatically-generated archer files, and a few
others.
1999-10-21 Andrew McCallum <mccallum@justresearch.com>
* barrel.c (bow_barrel_keep_top_words_by_infogain): Don't set the
NUM_WORDS_TO_KEEP to be the WI2IG_SIZE (which is the total number
of words). Set it to the MIN of this and the original
NUM_WORDS_TO_KEEP. Before this fix, no words were ever getting
removed. What a bug! I wonder how long this has been in there?
Reported by Carsten Lanquillon <lanqui@cs.cmu.edu>.
1999-10-20 Andrew McCallum <mccallum@justresearch.com>
* ddf.c (bow_ddf_dirichlet_from_doc_word_counts): Only print the
diagnostics for 10 sampled words, not 50.
* bpe.c (bow_bpe_set_cdoc_word_count_from_wi2dvf_weights): Print
the alphas for only 10 sampled words intead of 20.
1999-10-19 Andrew McCallum <mccallum@justresearch.com>
* svm_base.c: Check verbosity level before printing to stdout.
Only print if above bow_progress.
1999-10-19 Gregory C Schohn <gcs@cmu.edu>
* svm_base.c (svm_score): removed cnt variable (useless) & fixed a
typo-bug (sub_model[i] -> barrel).
* svm_smo.c (smo): changed the printf for information of where
opt_pair failed to an fprintf.
1999-10-19 Gregory C Schohn <gcs@justresearch.com>
* Makefile.local (DIST_ALL_FILES): added -DGCSJPRC (turn local
pedantic debugging) to DEFS.
* Makefile.in (ALL_CPPFLAGS): added -Ibow (so that pr_loqo.h is
found by pr_loqo.c even though they aren't in the same directory
[since we can't change pr_loqo.*]).
(DEFS): Changed from _DEFS & now using += instead of the temporary.
* svm_base.c: the epsilon_crit is now /2 for SMO (since the actual
eps is 2x the variable). fixed some printfs.
* svm_loqo.c (build_svm_guts): added code to remember previous KKT
epsilon (even though nobody sets the initial value to anything
different than the macro).
(build_svm_guts): added local define (GCSJPRC) for debugging stuff
which includes stopping the proc & sending mail.
* svm_smo.c: commented #DEBUG. added kcache_ages to appropriate
spots across the file. removed some print statements that weren't
to useful anymore.
(opt_pair): changed an optimality check - used to use (a2+ao2)*eps
to detrmine if something moved far enough, now just using eps_a
(may not be right, but its more correct than before) - we need it
to prevent inf. looping.
(opt_pair): Removed some unreachable in if statements.
(opt_pair): Fixed calculations of bup & blow - they were backwards
(smo): the threshold, b is now (bup+blow)/2 instead of blow (which
is at most epsilon_crit different).
1999-10-16 Gregory C Schohn <gcs@justresearch.com>
* svm_base.c: Added #ifdef HAVE_LOQO around calls to build_svm_guts
* svm_al.c: Added #ifdef HAVE_LOQO around calls to build_svm_guts
* Makefile.in: Re-enabled svm code. Made the pr_loqo checks look
./bow/pr_loqo.h
1999-10-16 Andrew McCallum <mccallum@justresearch.com>
* README-SVM (Obtaining sources): File renamed from README_SVM.
Clarify directions for where to put pr_loqo.h.
1999-10-15 Andrew McCallum <mccallum@justresearch.com>
* Version (BOW_MINOR_VERSION): Changed from 9 to 95.
* bow/libbow.h (BOW_MINOR_VERSION): Changed from 9 to 95.
Bug fixes for distribution.
* .cvsignore: Added rainbow-rank and rainbow-ts.
* Makefile.in: Temporarily disable SVM from rainbow.
(ARCHER_GENERATED_C_FILES): New variable. Remove this files from
those distributed, because they should be generated.
(ARCHER_DIST_FILES): Added archer.c and archer_query.c
* Makefile.in (DEMO_EXECUTABLES): New variable.
(ARCHER_DIST_FILES): Added dirichlet.c.
(DIST_FILES): Added archer.el
* multiclass.c: Comment out unused variables.
Odd assortment of clean-ups.
* bow/libbow.h (bow_random_reset_seed): Declare function.
* train_dirichlet.c (MOMENT_MATCH_ONLY): New macro.
(SPARSE): Change macro value from 0 to 1. This only effects running
train_dirichlet's main() directly.
(main): comment out the printing of the gammaln() tests. New local
variable COUNTS_SIZE, increased from 100 to 10000. Print more
diagnostics at the end.
* readme.texi: Update for new front-ends and fix command-line
options so they work.
* rainbow.c (rainbow_options): Clean up wording in several places.
(rainbow_query): Change behavior of repeated queries.
(bow_print_log_odds_ratio): Add a new FILE* argument. All callers
changed.
* nbshrinkage.c: Allow different lambda hierarchical mixture
weights for different classes.
* mix.c (mix_options): New command-line option for setting the
number of EM iterations.
(mix_new_vpc): Don't allow initial random class_probs to be zero.
* libbow-desc.texi: Update for new front-ends and MSWin.
* lex-gram.c (bow_lexer_gram_open_text_fp): Properly save the
return value of bow_realloc(). This fixes a nasty crash.
* emsimple.c (bow_emsimple_new_vpc_with_weights): Print
diagnostics using odds_ratio.
* dirichlet.c (main): New command-line argument -I. Handle it.
* dice.c (print_usage): Expand help statement.
* ddf.c (ddf_force_large_alphas): New variable.
(bow_ddf_dirichlet_from_doc_word_counts): Handle it.
(ddfla): New method.
* cdmm.c (CDMM_PRINT_ALPHAS_KEY): Change value to not conflict
with the cdm method.
* bpe.c (bpe_prior_alpha): Change default prior "ghost count" from
1 to 0.
(bow_bpe_set_cdoc_word_count_from_wi2dvf_weights): Make the verbosity
work even when the vocabulary size is less than 20.
(bow_bpe_score): Print more information when BOW_PRINT_WORD_SCORES.
Print more digits of precision of BOW_PRINT_WORD_SCORES for
individual words.
* Makefile.local (RAINBOW_METHOD_C_FILES): Move some of these to
the Makefile.in.
1999-10-15 Gregory C Schohn <gcs@cmu.edu>
* svm_base.c: fixed some preprocessing bugs - also mildly cleaned
up the code...
* svm_smo.c: started fixing bugs (the definition of the error
vector changed - the modified code did not also change in some
spots)- there is still 1 left
* README_SVM: up to date - ready for release.
1999-10-13 Andrew McCallum <mccallum@justresearch.com>
* Makefile.in: Fix errors from last check-in.
1999-10-13 Gregory C Schohn <gcs@cmu.edu>
* Makefile.in (SVM_FILES): Added a check for pr_loqo.[ch] - if
they are there the svm parts of bow are built with it, otherwise
they parts of the code that use pr_loqo are turned off. The
conditional sets necessary files & defines.
(STANDARD_LIBBOW_H_FILES): added bow/svm.h
(STANDARD_LIBBOW_C_FILES): added $(SVM_FILES) (4-5 files,
replacing svm.c)
* configure.in: removed checks for pr_loqo.* - that's now done in
the makefile.
added a check for the fpsetmask macro (which is necessary on at
least freebsd boxes to turn ieee math on).
1999-10-07 Gregory C Schohn <gcs@cmu.edu>
* svm.c: (al_svm_test_wrapper) added printing of documents added
& the # of bound support vectors.
1999-10-06 Dayne Freitag <dayne@tweed.jprc.com>
* tagged.lex: HTML entities now recognized by lex, rather than
function is_entity. Parser returns when label is lexed; return
value indicates word, begin label, or end label.
* opts.c (parse_bow_opt): Conditional removal of code creating the
bow data directory which is not appropriate for DART and FDART.
* labels.c (bow_last_label): New function.
* archer_query_index.c: Major re-write of previous code.
* archer_query_array.c: Major re-write of previous code, much of
which was buggy.
* archer_query.y: Deleted the "term >N term" syntax as
superfluous. Added the "term < term" syntax. Added the "word"
type and "WORD" terminal to distinguish from NUMBER.
* archer_query.c: Code to free allocated structures.
* archer-server.c (archer_query_socket_init): Now releases socket
on failure. Unix socket support. Streamlined code by changing
code under archer_query_serve_one_query. New functions:
archer_query_serve_one_admin_command,
archer_query_serve_admin_commands,
archer_query_server_command_loop,
archer_query_serve_regular_query,
archer_query_server_process_commands. Added security features.
New functions: archer_remote_host_matches_spec,
archer_query_password_ok.
(archer_server_index): Call archer_archive after indexing.
(archer_server_index_with_markup): New function.
(archer_server_query_new): Fixed output. Decomposed, adding new
functions archer_server_print_hitlist and
archer_server_print_hit. Added ndump command and new functions
archer_server_dump_new, archer_server_dump_preamble, and
archer_server_do_dump. Added fields command and new function
archer_server_fields.
* archer_query_execute.c: Fixed many memory leaks, and rewrote
large sections.
* annotation.c (annotation_sarray_reread): New function.
* archer.c: Allow mark-up spoofing of the indexer, batch
incremental indexing, passwords and IP-based client restriction.
New executables
(conditionally compiled) DART, FDART, and IDART.
(archer_get_fp_from_filename): New function.
(flush_labels): New function.
(archer_index_term): New function.
(archer_index_label): New function.
(archer_index_filename_flex): Some code removed to above functions.
(archer_index): Changed to prevent re-opening/re-construction of
already opened files and existing data structures (needed for
batch incremental indexing).
1999-10-06 Gregory C Schohn <gcs@cmu.edu>
* svm.c: (svm_vpc_merge) fixed preprocessing bug that caused big
problems when no weighting was being used in conjunction with
pairwise voting.
* also changed all of the options *, to svm-*.
1999-10-05 Andrew McCallum <mccallum@justresearch.com>
* TODO: Remove and few items that were done.
* NEWS: Describe some new features.
* HACKING: Update to remove no-longer-available CVS server
description.
* barrel.c (bow_barrel_keep_top_words_by_infogain): Fix verbosity
message for vocabulary sizes under 5.
1999-10-05 Andrew McCallum <mccallum@justresearch.com>
* docnames.c (bow_map_filenames_from_dir): Fix printf argument.
1999-09-29 Gregory C Schohn <gcs@cmu.edu>
* svm.c: a lot of minor small changes (like printing stuff), no
bug fixes - make sure to suppress the score matrix (which is
hundred's of MB large) if test-in-train is used with active
learning in the active learning stuff if you don't want it!
1999-09-24 Andrew McCallum <mccallum@justresearch.com>
* dirichlet.c: Added ability to do simple classification. For
example: (echo 2 ; cat
~/research/projects/dicefactory/synth1/bar.counts ) | ./dirichlet
-c 2 18.4738 26.1034 2.49099 2.04999
1999-09-22 Andrew McCallum <mccallum@justresearch.com>
* docnames.c (bow_map_filenames_from_dir): When a directory can't
be opened, simply skip it instead of trying to open it as a file.
(This works around Linux bug whereby directories seem to
disappear.)
1999-09-20 Kamal Nigam <knigam@zeno.jprc.com>
* maxent.c: New code for options maxent-vary-prior-by-count
maxent-gaussian-prior-no-zero-constraints
maxent-prune-features-by-count
maxent-vary-prior-by-count-linearly.
1999-09-19 Gregory C Schohn <gcs@cmu.edu>
* configure.in: added check for srandom - which was & still is
necessary for libbow.h
* bow/libbow.h: changed the defines for srandom & random so that
both get redefined if one is missing.
1999-09-09 Andrew McCallum <mccallum@justresearch.com>
* train_dirichlet.c: Allow the main() test driver to be compiled
in by simply defining TD_MAIN on the gcc command-line.
* random.c (bow_random_reset_seed): New function.
(bow_random_set_seed): Make it work with the above function.
* multiclass.c (multiclass_iterated_mixture_given_doc_and_cis):
New function.
(multiclass_mixture_given_doc): Allow this to be called with a test
document too.
(multiclass_log_prob_of_classes_given_doc): Add commented-out code to
implement BIC.
(multiclass_explore_cis_greedy1): Bug fixes.
* cdmem.c (cdmem_parse_opt): Allow printing of the accuracy on the
unlabeled set.
(bow_cdmem_class_wi2dvf): New function.
(bow_cdmem_new_vpc_with_weights): Save original document types and
classes. Using new macro SET_CASCADE_TREE_WITH_ALL_DATA, allow
three different options for training the distance function. Allow
multiple rounds of CDM distance metric learning.
* cdm.c: Include <bow/train_dirichlet.h> instead of defining
train_dirichlet() extern here.
(bow_cdm_initialize_ct): Make it safe to call this function more than
once.
(bow_cdm_ct_set_alphas): Define COUNTS as double* instead of unsigned*.
Print the bottom-most word in the cascade tree. Don't assert DV.
* barrel.c (bow_barrel_add_from_text_dir): Instead of crashing
when failing to open a file, simply print warning.
1999-09-03 Gregory C Schohn <gcs@cmu.edu>
* svm.c: checkpoint - some bitrotting code may not work...
* svm.c: updated SMO to work with Keerthi, et al's modifications -
the heuristic is much better, but the running time on 20 newsgroups
is still slower than Thorsten's methods.
* svm.c: added a lot of active learning logging.
1999-08-18 Thomas P. Minka <minka@jprc.com>
* train_dirichlet.c, bow/train_dirichlet.h: Added train_sum_alpha
global variable.
1999-08-18 Andrew McCallum <mccallum@justresearch.com>
* bow/libbow.h: Declare new functions.
* bow/cdm.h: Update function prototypes to match.
* wv.c (bow_wv_copy): New function.
* wi2dvf.c (bow_wi2dvf_set_idf_to_count): New function.
(bow_wi2dvf_dv_hidden): New function.
* wa.c (bow_wa_remove): New function.
* vpc.c (bow_wi2dvf_sum): New function.
(bow_barrel_new_vpc): Move the updating of the CDOC->WORD_COUNT to
earlier in the function.
* rainbow.c (rainbow_test): In the test documents include words
that were previously removed from the training set by, for
example, feature selection.
* heap.c (bow_make_dv_heap_from_wi2dvf_hidden): New function.
* naivebayes.c (bow_naivebayes_pr_wi_ci): All m_est_m to be zero,
if set as such explicitly on the command-line.
(bow_naivebayes_total_word_count_for_ci): New function.
* ddf.c: Add ADDITIONAL_COUNT.
* dice.c: Add many command-line options.
* ctdf.c: Add handling for zerotons and unknown words.
* bpe.c: Include bpe_prior_alpha, and various other bug fixes.
* barrel.c: Clean up some verbosity messages.
(bow_barrel_set_idf_to_count_in_train): New function.
* train_dirichlet.c: Change name from gammaln_fast to gammaln, so
this function can be used, depending on the #define.
* cdm.c: Many completions and bug fixes.
* cdmem.c (SET_CASCADE_TREE_WITH_ALL_DATA): New macro.
(bow_cdmem_new_vpc_with_weights): Depending on above macro, use all
labels to set the distance metric.
1999-08-13 Gregory C Schohn <gcs@cmu.edu>
* svm.c: added active learning stuff - has a pretty bad selection
heuristic (subject to pathological cases), had to change
(modularize) different sections of the code...
1999-08-09 Gregory C Schohn <gcs@cmu.edu>
* svm.c: the removal of inconsistent examples (for the
Thorsten-like algorithm) is working, it still needs to be extended
for SMO.
1999-08-01 Gregory C Schohn <gcs@cmu.edu>
* svm.c: 3 bug fixes - one in the lagrange multiplier check, one
in the loop to call pr_loqo (the maximum number of iterations
wasn't increasing when pr_loqo could not converge), & a bug that
caused the equality constraint to fall apart when the working set
size was less than the maximum working set size (actually, rewrote
that block to be way more efficient).
* svm.c Also played with the kernel cache in a lot of different
ways, a simple (but not to simple) solution that is committed
yields the best times (when the kernel cache is not grossly
smaller than the number of support vectors squared).
1999-07-28 William Morgan <wmorgan@jprc.com>
* archer_query_array.c: added
* archer_query_array.h: added
* archer_query_execute.c: added
* archer_query_execute.h: added
* archer_query_index.c: added
* archer_query_index.h: added
* pv.c (bow_pv_read_next_di_li_pi): removed useless assert()
* archer_query.c: reformatted for better GNU style
* archer_query.h: ditto
* archer.c (mem_error): added, as well as other stuff surrounded
by ARCHER_USE_MCHECK defines for optional memory checking
* archer-server.c (archer_server_query_new): made nquery command
call new query engine. right now this dumps core.
* Makefile.in (ARCHER_C_FILES): added archer_query_ files
(ARCHER_H_FILES): ditto
1999-07-19 Thomas P. Minka <minka@jprc.com>
* train_dirichlet.c (train_dirichlet_nr): Added the option to not
train the sum of alphas, only their ratios.
1999-07-16 Andrew McCallum <mccallum@justresearch.com>
* ddf.c: Debug and add smoothing. It smooths in the same
proportion that adding a pseudo-count of 1 would do for
naivebayes.
1999-07-16 William Morgan <wmorgan@jprc.com>
* Makefile.in: fixed small bug that occasionally caused make to
overwrite archer_query.c
1999-07-16 Gregory C Schohn <gcs@cmu.edu>
* svm.c: new version - about 3 times as fast thanks to re-using
error-cache values for the kkt conditions instead of
re-calculating them each time. A valid cache bitmap was also
added...
* svm.c: made semi-small bug fixes, like precision checks, the
removal of a bogus heuristic check & some small coding bugs...
1999-07-16 William Morgan <wmorgan@jprc.com>
* bow/archer.h: removed ARCHER_MAX_LABEL_PARAMS (cruft)
* flex_mail.lex: modified function naming scheme to work with the
new way archer lex files are handled
* configure.in: added AC_PROG_LEX to configure lexer generator
* archer_query.y: added
* archer_query.lex: added
* archer_query.h: added
* archer_query.c: added
* archer.c (archer_query_hits_matching_sequence): fixed bug that
caused archer to hang
* Makefile.in: added rules for .lex and .y files for archer
* archer-server.c (archer_server_query_new): added
(archer_server_query_hits_matching_sequence): fixed bug that
caused archer to hang
1999-07-15 Jason Reed <jcreed@cyclone.jprc.com>
* archer.c (archer_index_filename_flex): Don't write redundant
wi2pv information.
1999-07-13 Jason Reed <jcreed@cyclone.jprc.com>
* archer-server.c (archer_server_query): Include terms matched in
query results.
* archer-server.c: Added 'hits' command to select a range of hits
to show.
(No need to send N tens of thousands of hits over the
socket when someone searches for 'artificial intelligence' or
something equally general)
1999-07-13 Gregory C Schohn <gcs@cmu.edu>
* svm.c: includes smo, which works fine, but is slower than
thorsten's algorithm... for now.
1999-07-13 Thomas P. Minka <minka@jprc.com>
* train_dirichlet.c (newton_step): Don't try to change an alpha
which is zero, because its gradient will always be negative.
1999-07-12 Thomas P. Minka <minka@jprc.com>
* train_dirichlet.c (gammaln,digamma,trigamma): Changed to a
higher precision algorithm. No effect on Dirichlet fitting,
however.
1999-07-12 Jason Reed <jcreed@cyclone.jprc.com>
* lex-suffixing.c (bow_lexer_suffixing_postprocess_word): Fixed
off-by-one bug, I think.
* archer.c: Does incremental writes in query server. (Does *not*
do pure incremental writes if we are just doing an --index)
Removed label name and document name fields from archer_labels
and archer_docs entries, since they made entries variable length.
* archer-server.c: Added indexing capability. (on second socket)
Always use simple lexer for query lexing, independent of
data lexing.
Do 'xxx' suffixing iff appropriate.
* wi2pv.c: Made incremental.
* int4str.c: Added incremental functions.
* int4word.c: likewise.
* sarray.c: likewise.
* array.c: likewise.
* bow/archer.h: likewise.
* bow/libbow.h: likewise.
1999-07-09 Thomas P. Minka <minka@jprc.com>
* train_dirichlet.c (train_dirichlet_sparse, train_dirichlet_nr,
logProb, moment_match): Added extra_count parameter.
1999-07-08 William Morgan <wmorgan@jprc.com>
* archer-server.c (archer_query_socket_init): added SIGPIPE
handling; archer server mode now no longer crashes as easily
1999-07-08 Kamal Nigam <knigam@server5.jprc.com>
* Makefile.local (RAINBOW_METHOD_C_FILES): Added emsimple.c.
1999-07-08 Andrew McCallum <mccallum@justresearch.com>
* Makefile.local (RAINBOW_METHOD_C_FILES): Added nbsimple.c
* naivebayes.c (bow_naivebayes_score): Initialize NUM_SCORES to
get rid of GCC warning.
* crossbow.c (crossbow_options): New option --use-vocab-in-file.
(struct crossbow_arg_state): New element VOCAB_MAP.
(crossbow_doc_read): Read the DOC->CIS.
(crossbow_index_multiclass_list): Implement
BOW_PRUNE_VOCAB_BY_OCCUR_COUNT_N.
(crossbow_index): Handle the VOCAB_MAP.
* hem.c (crossbow_hem_em_one_iteration): Print the perplexity
before returning.
1999-07-02 Andrew McCallum <mccallum@justresearch.com>
* ddf.c: Random bug fixes. Added option --ddf-prior-alpha.
1999-07-08 Gregory C Schohn <gcs@cmu.edu>
* tfidf.c (tfidf): changed cdocs->length to ndocs so that only
those documents which could change the df value are considered.
* svm.c: added support for tfidf scoring for each submodel
(could easily be extended to any type of scoring...)
1999-07-02 Gregory C Schohn <gcs@cmu.edu>
* Makefile.in: added rules for svm.c & pr_loqo.c which are
filled in by configure, if they exist...
* configure.in: added a check for pr_loqo.h & pr_loqo.c, if they
are there, the makefile will build libbow with its complete svm
package, otherwise, svm.c is ignored.
* rainbow.c (main): added support for build-and-save &
test-from-saved (allowing the user to build a model,
then reuse it on succesive runs). Also added support
for svm.c.
* wi2dvf.c (bow_wi2dvf_add_di_wv): bow_dv_add_di_count_weight
now adds the weight value to the dv.
* naivebayes.c (bow_naivebayes_score): added
naivebayes_score_returns_doc_pr flag so that P(X|C) is returned
instead of P(C|X). Added naivebayes_score_unsorted so that the
array is returned in unsorted (ie. each ith index is for the
ith class).
* bow/naivebayes.h: added naivebayes_score_returns_doc_pr &
naivebayes_score_unsorted globals so that naivebayes.c can be
extended for the fisher kernel in svm.
* svm.c: fixed a couple of typos & changed some outdated code.
* bow/svm.h: Initial check-in.
1999-07-02 William Morgan <wmorgan@jprc.com>
* Makefile.in (ARCHER_C_FILES): added required files for
archer compilation that had been lost previously; added
lex -> c rule
* Makefile.local: removed unnecessary lexing rule (now in
Makefile.in)
* annotation.c: added GPL header
* labels.c: ditto
* server.c: moved to archer-server.c
* tagged.lex: moved to tagged_lex.lex
1999-07-01 Thomas P. Minka <minka@jprc.com>
* train_dirichlet.c: Changed several functions to use the new
sparse iterator scheme.
(moment_match): Uses n_group_by_key instead of n_group.
1999-07-01 Andrew McCallum <mccallum@justresearch.com>
* ddf.c: Fix argument types for train_dirichlet_sparse, and call
with correct types.
* bow/libbow.h (bow_iterator_double): New type.
* barrel.c: Added an iterator for the columns of a barrel that
match a class.
(bow_barrel_iterator_for_ci_new): New function.
* array.c: Added (commented-out) code for an iterator over a cdoc.
* train_dirichlet.c (train_dirichlet_nr): Initialize old_logProb
to 0 to get rid of gcc warning.
* ddf.c: Use new iterator.
* Makefile.in: Drastically rearranged to make different sections
for different libbow front-ends.
* Makefile.local: Updated to handle new Makefile.in organization.
Removed rules for Pete Su's old archer query parser.
* Makefile.preamble: Emptied. I think this file is no longer
necessary.
1999-07-01 Thomas P. Minka <minka@jprc.com>
* train_dirichlet.c (train_dirichlet): Changed to call
train_dirichlet_nr for the work.
(train_dirichlet_sparse): Same as train_dirichlet but for sparse
counts.
(train_dirichlet_nr): General Newton-Raphson with option for
sparse counts.
(moment_match): Changed to save memory. Added option for sparse
counts.
(logProb): Added option for sparse counts.
1999-06-30 William Morgan <wmorgan@jprc.com>
* annotation.c: new file
* bow/archer.h: added annotation function interfaces and structs
* server.c (archer_query_socket_init): added annotation handling
code
(archer_server_query): ditto
* opts.c (bow_options): added ANNOTATION_KEY option
* Makefile.local (ARCHER_C_FILES): added annotation.c
1999-06-29 William Morgan <wmorgan@jprc.com>
* bow/libbow.h: added USE_TAGGED_FLEXER
* bow/archer.h: changed lexer interfaces slightly, and moved a few
things from archer.c
* tagged.lex: created
* flex_mail.lex (flex_mail_get_word_extended): added
* opts.c (bow_options): added FLEX_TAGGED_KEY
* archer.c (archer_index_filename_flex): added tagged flexer
option
(archer_query_hits_matching_wi): fixed small bug
(archer_query_hits_matching_sequence): fixed another small bug
(archer_query_socket_init): removed (to server.c)
(archer_query_server_process_commands): ditto
(archer_query_serve_one_query): ditto
(archer_query_serve): ditto
* Makefile.local (ARCHER_C_FILES): added flex_mail.c,
tagged_lex.c and server.c
* server.c: created. moved all server code from archer.c to here.
1999-06-28 Andrew McCallum <mccallum@justresearch.com>
* ddf.c (bow_ddf_dirichlet_from_doc_word_counts): Make it use the
train_dirichlet_sparse().
* ddf.c: New file.
1999-06-15 Thomas P. Minka <minka@jprc.com>
* train_dirichlet.c (train_dirichlet): Changed to always be
conservative.
1999-06-11 Thomas P. Minka <minka@jprc.com>
* train_dirichlet.c (train_dirichlet): Handles the all zero case
properly.
* train_dirichlet.c (main): Reads count data from stdin.
1999-06-11 Kamal Nigam <knigam@zeno.jprc.com>
* cdm.c (bow_cdm_ct_set_alphas): change asserts to allow 0 alphas
* cdmem.c (bow_cdmem_new_vpc_with_weights): Fix word count = 0
case
1999-06-11 Kamal Nigam <knigam@server6.jprc.com>
* train_dirichlet.c (train_dirichlet): changes from Tom.
* naivebayes.c (bow_naivebayes_score): Fixed memory trashing bug.
* em.c (bow_em_score): cosmetic fixes only.
* cdmem.c (bow_cdmem_new_vpc_with_weights): Fixed invocation of
bow_barrel_score.
* cdm.c (bow_cdm_word_probs_using_ct_alphas): Removed some code at
Andrew's request. Changed assert to allow some roundoff error.
(bow_cdm_score): Fixed memory-trashing bug.
1999-06-10 Kamal Nigam <knigam@zeno.jprc.com>
* bow/libbow.h (bow_barrel_set_weights): checked for null function
* Makefile.preamble (EXTRA_METHOD_C_FILES): added cdmem.c
* cdm.c (bow_cdm_word_probs_using_ct_alphas): added assert to
check for NaN
(bow_cdm_print_word_probs): Removed superfluous exit.
(bow_cdm_new_vpc): Removed diagnositc.
1999-06-10 Andrew McCallum <mccallum@justresearch.com>
* train_dirichlet.c: Use new improved method with iteration. It
also now works for Dirichlet densities of arbitrary size, not just
Betas. (From Tom Minka.)
* cdm.c (bow_cdm_word_probs_using_ct_alphas): Fix the setting of
the bottom-most word in the cascade tree.
* cdm.c: Several bug fixes. Now runs.
* Makefile.preamble (EXTRA_METHOD_C_FILES): Added bpe.c, cdm.c and
train_dirichlet.c.
* train_dirichlet.c (train_dirichlet): For now, don't do newton
iterations.
* rainbow.c (rainbow_unarchive): Handle the case in which the
OUTPUTNAME_FILENAME doesn't exist in the model directory.
* opts.c (parse_bow_opt): Add "dw" as an alias for
"document-then-word".
(MAX_NUM_CHILDREN): Upped from 10 to 100.
(bow_argp_add_child): Fix assertion to complain if we overrun again.
* int4word.c (bow_words_add_occurrences_from_file): New function.
(bow_words_add_occurrences_from_text_dir): Use it.
* info_gain.c (bow_infogain_wa): New function.
* barrel.c (bow_barrel_add_document): Add comment questioning
assert().
* bow/libbow.h: Declare new functions.
* multiclass.c: Changed total_num_mixtures_possible calculation.
Changed palpha from 1.0 to 0.01. Changed malpha from 0 to 1.
Changed pruning class set size from 4 to 3. Print a warning if
the correct class vector was never evaluated.
1999-06-10 Andrew McCallum <mccallum@justresearch.com>
* cdm.c: Implemented but not tested.
1999-06-10 Kamal Nigam <knigam@zeno.jprc.com>
* bow/em.h (bow_em_set_priors_using_class_probs): New prototype.
* cdm.c (bow_cdm_score): Implemented.
1999-06-09 Andrew McCallum <mccallum@justresearch.com>
* lex-suffixing.c (bow_lexer_suffixing_postprocess_word): Before
rewinding to the beginning of the file (in order to lex without
adding suffixes), not only check for two newlines in a row, but
also check for the end of the file, so that files without \n\n and
without a trailing \n get processed both with and without suffixes
added.
1999-05-15 Andrew McCallum <mccallum@justresearch.com>
* multiclass.c: Backoff the per-class-set mixture distribution in
a way loosely based on shrinkage. More bug fixes.
1999-05-14 Andrew McCallum <mccallum@justresearch.com>
* multiclass.c: Many bug fixes and enhancements to class set
search.
* lex-japanese.c (bow_lexer_japanese_get_word): Minor bug fixes.
* multiclass.c: Overhauled version with better search in class
vector space.
* Makefile.preamble: Instead of conditioning on "ifdef unix",
condition on "ifndef WIN32", since "unix" wasn't defined on UNIX.
Fix this properly later.
(EXTRA_LIBBOW_H_FILES): Add more.
* Makefile.local (DIST_ALL_FILES): Add extra Makefiles.
1999-06-01 Jason Reed <jcreed@cyclone.jprc.com>
* archer.c (archer_index_filename_old_lex): removed erroneous
redundant code
1999-05-28 Jason Reed <jcreed@cyclone.jprc.com>
* flex_mail.lex: Fixed archer.h include directive
* Makefile.local (flex_mail.o): Added.
(archer): Removed redundant target.
1999-05-27 William Morgan <wmorgan@jprc.com>
* bow/libbow.h (bow_flex_type): new enumeration
* bow/archer.h (BOW_MAX_WORD_LABELS): new #define
(archer_label): new typedef
* pv.c (PV_WRITE_SIZE_INT): fixed off-by-one bug (I think)
(bow_pv_write_size_di_li_pi): new function
(bow_pv_write_next_di_li_pi): likewise
(bow_pv_read_next_di_li_pi): likewise
(bow_pv_add_di_li_pi): likewise
(bow_pv_next_di_li_pi): likewise
* opts.c (bow_options): added flex-mail option
(parse_bow_opt): likewise
* labels.c: new file
* flex_mail.lex: new file
* deflexer.c (bow_flex_option): new variable
* archer.c (archer_label_write): new function
(archer_label_read): new function
(archer_label_free): new function
(archer_label_write): new function
(archer_archive): added label data files
(archer_unarchive): likewise
(archer_index_filename_flex): new function
(archer_query_hits_matching_sequence): added support for labels
(archer_print_all): likewise
* wi2pv.c (bow_wi2pv_wi_next_di_li_pi): new function
(bow_wi2pv_wi_add_di_li_pi): new function
1999-05-13 Andrew McCallum <mccallum@justresearch.com>
* Makefile.local (snapshot-all): Fix typo.
* lex-japanese.c (bow_lexer_japanese_get_word): Better handle
mixed English and Japanese by doing extra handling for the
English: downcase, include only alphabetics and postprocess with
the simple lexer.
* crossbow.c (crossbow_doc_read): Initialize CIS_MIXTURE.
(crossbow_unarchive): Use bow_array_new_with_entry_size_from_data_fp
so that it will work with old models that don't have the
CIS_MIXTURE.
* rainbow.c (rainbow_test): Add error messages warning that -O and
-D are not implemented here.
* bow/crossbow.h (crossbow_doc): Add element CIS_MIXTURE.
* array.c (bow_array_new_with_entry_size_from_data_fp): New
function.
* bow/libbow.h: Declare new function.
* Makefile.preamble: Put file names in EXTRA_* variables.
* Makefile.local (snapshot-all): New target, and support for this
target.
1999-05-06 Kamal Nigam <knigam@zeno.jprc.com>
* maxent.c (maxent_gaussian_prior): New variable for gaussian
prior option.
(maxent_prior_variance): Likewise.
(maxent_halt_accuracy_docs): New variable for new option
--maxent-halt-by-accuracy
(maxent_options,maxent_parse_opt,maxent_newton,
maxent_calculate_accuracy,
bow_maxent_new_vpc_with_weights_doc_then_word,
bow_maxent_new_vpc_with_weights):
Code for new options --maxent-halt-by-accuracy and
--maxent-gaussian-prior and --maxent-prior-variance
1999-05-03 Robert Stockton <rgs@jprc.com>
Updated to run under GNU on win32
* Makefile.preamble [WINNT]: Conditionalized Makefile.preamble to
specify different switches and libraries for Win32. This includes
turning off the "hdb" support which requires unavailable libraries,
and adding -liberty (sic) to enable deprecated features of the gnu
command line parser.
* archer.c (archer_archive, archer_unarchive): updated bow_fopen
calls for data files to specify "binary" mode. This should have no
effect under UNIX, but is necessary for proper operation under win32.
* arrow.c (arrow_archive, arrow_unarchive): likewise
* barrel.c (bow_barrel_new_from_data_file): likewise
* crossbow.c (crossbow_archive, crossbow_unarchive,
crossbow_index_multiclass_list, crossbow_index): likewise
* int4word.c (bow_words_write_to_file, bow_words_read_from_fp,
bow_words_read_from_file, bow_words_reread_from_file): likewise
* rainbow-h.c (hier_barrel_write_to_file, hier_barrel_new_from_file,
set_vocabulary_from_file): likewise
* rainbow.c (rainbow_archive, rainbow_unarchive): likewise
* wi2dvf.c (bow_wi2dvf_write_data_file,
bow_wi2dvf_new_from_data_file): likewise
* wi2pv.c (bow_wi2pv_new, bow_wi2pv_write_to_filename,
bow_wi2pv_new_from_filename, bow_wi2pv_reopen_pv): likewise
* arrow.c [WINNT] (#includes, arrow_socket_init): Conditionalized
socket code to deal with the lack of unix sockets under win32.
* rainbow.c [WINNT] (#includes, rainbow_socket_init): likewise
* lex-gram.c (bow_lexer_gram_get_word): Fixed the BOW_MCHECK option so
that it doesn't try to call "mprove" if the option is disabled.
Fixed up demo script
* demo-script: Rainbow no longer defaults to --test-set=0.30, so I
explicitly added --test-set=0.34 to the command line. (The value
was changed to "1/3 of the data" to correspond to the print
statements.)
* demo-script [WINNT]: Explicitly invoke perl for "rainbow-stats" so
that the script should run better under win32 (while still working
properly under unix).
1999-04-29 Andrew McCallum <mccallum@justresearch.com>
* vpc.c (bow_barrel_new_vpc): Reset the word counts in the
document barrel based on the (possibly reduced) vocabulary in
training set, not in the training+test set. This change has no
effect on naive Bayes classification.
1999-04-27 Andrew McCallum <mccallum@justresearch.com>
* multiclass.c: New overhauled version that represents class sets
sparsely.
1999-04-26 Andrew McCallum <mccallum@justresearch.com>
* hem.c (crossbow_hem_create_children_for_node): Copy word counts
into WORDS, not NEW_WORDS, because then
set_new_words_from_perturbed_words() will fill in NEW_WORDS by
perturbing WORDS. This was previously causing children to be set
completely randomly! (Bug reported by Doug Baker.)
* bow/treenode.h (treenode): Add variables for holding LOO data in
a file, but leave them commented out for now.
* Makefile.local (CROSSBOW_C_FILES): Add lex-japanese.c.
* Makefile.preamble (METHOD_C_FILES): Add lex-japanese.c and
maxent.c.
* istext.c (bow_is_text_always_yes): New global variable.
(bow_fp_is_text): Obey it.
* bow/libbow.h (bow_is_text_always_yes): Declare variable.
* treenode.c (bow_treenode_set_leaf_prior_from_new_prior_all):
Handle "/Misc/" nodes.
* hem.c (crossbow_hem_lambdas_from_validation): New static
variable.
(crossbow_hem_options): New command-line option
"hem-lambdas-from-validation".
(crossbow_hem_parse_opt): Handle it.
(crossbow_hem_em_one_iteration): Don't skip docs with validation tag
and use them for setting lambda if
CROSSBOW_HEM_LAMBDAS_FROM_VALIDATION. Skip nodes named "Misc".
Fix the end of the M-step so it works even if vertical word
movement is turned off.
(crossbow_hem_full_em): Set some of the unlabeled documents to
validation documents if CROSSBOW_HEM_LAMBDAS_FROM_VALIDATION.
(crossbow_hem_full_em): Print the leaf prior.
(hem_cluster_method): Fix ordering.
1999-04-08 Kamal Nigam <knigam@server8.jprc.com>
* maxent.c (maxent_options): Added options --maxent-iterations,
--maxent-keep-features-by-mi, --maxent-halt-by-logprob,
--maxent-logprob-constraints, --maxent-smooth-counts,
--maxent-scoring-hack.
(maxent_parse_opt): Likewise
(maxent_calculate_accuracy): takes extra argument; calculates accuracy
or logprob.
(maxent_prune_vocab_by_mutual_information): New function.
(maxent_newton): New function.
(bow_maxent_new_vpc_with_weights_doc_then_word): New function for
document-then-word event model. Code for new features.
(bow_maxent_new_vpc_with_weights): Code for new features.
(bow_maxent_score): Code for scoring hack option.
(bow_method_maxent): changed prior function.
1999-03-30 Andrew McCallum <mccallum@justresearch.com>
Fix --gram-size so it works again.
* opts.c (parse_bow_opt): Use the correct type to sizeof when
allocating a lex for --gram-size.
* lex-gram.c (bow_lexer_gram_open_text_fp): Realloc the LEX,
making it big enough that we don't overrun the buffer malloced by
bow_lexer_next_open_text_fp(). Note that this function passing
SELF->NEXT, and thus uses the wrong SIZEOF_LEX when mallocing the
lex it returns. (In order to be properly object-oriented, we
would need to separate object data from object methods a little
more.)
* crossbow.c (crossbow_classify_docs_in_dirname): Remove unused
local variables.
1999-03-29 Andrew McCallum <mccallum@justresearch.com>
* hem.c (crossbow_hem_em_one_iteration): Calculate WORD_DEPOSIT
and LAMBDA_DEPOSIT separately, and always deposit probability in
the leaves and ancestors, even if we don't expect those
distributions to change as a result. Use or don't use
ANCESTOR_MEMBERSHIP to ensure the right thing here, depending on
CROSSBOW_HEM_VERTICAL_WORD_MOVEMENT. Set NEW_LAMBDAS even if
there is no shrinkge. Always set WORDS from NEW_WORDS,
independent of various CROSSBOW_HEM_* flag settings. Don't ever
use smoothing to set the lambdas.
* crossbow.c (crossbow_classify_doc): Remove DI variable, and get
WV more straightforwardly.
1999-03-29 Kamal Nigam <knigam@server4.jprc.com>
* crossbow.c (crossbow_classify_doc): initialize di to bogus
value. Return a value.
(crossbow_classify_tagged_docs): Count documents tested correctly.
(crossbow_classify_docs_in_dirname): have classify_filename return
a value.
* hem.c (iteration_limit): Removed.
(crossbow_hem_em_one_iteration): Set words from new_words no matter
what. If no shrinkage, use prior.
(crossbow_hem_full_em): If no shrinkage, initialize words using a
prior, and set lambdas all at the leaves. Use
crossbow_hem_max_num_iterations to stop EM iterating.
1999-03-19 Kamal Nigam <knigam@server4.jprc.com>
* vpc.c (bow_barrel_new_vpc): added comment about setting
cdoc->word_count.
* maxent.c (maxent_num_iterations): new variable and code for
option --maxent-iterations.
(maxent_options): Likewise.
(maxent_parse_opt): Likewise.
(bow_maxent_new_vpc_with_weights): Likewise. Also added correct
normalization of step size (1/doc-length) when calculating
delta_i.
1999-03-19 Kamal Nigam <knigam@zeno.jprc.com>
* maxent.c: initial checkin.
1999-03-17 Kamal Nigam <knigam@server4.jprc.com>
* em.c (em_parse_opt): change bow_cdoc_is_model to
bow_cdoc_is_train
1999-03-19 Andrew McCallum <mccallum@justresearch.com>
* lex-suffixing.c (suffixing_snarf_suffix): Only include
alphabetic characters in the prefix. This way they can be
included in queries.
(bow_lexer_suffixing_postprocess_word): Exclude from the general
non-suffixing indexing all lines with suffixes beginning with
"reference"; not just the ones that match "reference" exactly.
This way "Reference-contexts:" will also be excluded.
* hem.c (crossbow_hem_max_num_iterations): New static variable.
(crossbow_hem_options): New command-line option
"hem-max-num-iterations".
(crossbow_hem_parse_opt): Handle it.
(crossbow_hem_cluster): Obey it.
* crossbow.c (crossbow_classify_doc): Correctly index WA->ENTRY
when VERBOSE is 1.
1999-03-18 Andrew McCallum <mccallum@justresearch.com>
* archer.c (archer_index_filename): Close FP before returning 0 if
LEX is NULL. Otherwise we leave too many files open.
* crossbow.c (crossbow_print_matrix): New function.
(crossbow_options): New command-line option "print-matrix".
(crossbow_parse_opt): Handle it.
1999-03-17 Andrew McCallum <mccallum@justresearch.com>
* opts.c (parse_bow_opt) [LEX_SUFFIXING_KEY]: Set the ULEX->NEXT
*after* we do the memcpy so it doesn't get overridden.
* lex-suffixing.c (bow_lexer_suffixing_open_text_fp): Return 0 if
the first line doesn't have a header.
(bow_lexer_suffixing_postprocess_word): Make an initial attempt at
handling multi-line email headers in which the second line begins
with a tab.
(bow_lexer_suffixing_get_word): Explicitly check the DOCUMENT_POSITION
to see if we are at the end of the file.
* treenode.c (bow_treenode_set_lambdas_from_new_lambdas): If macro
MISC_STAYS_FLAT is non-zero, then if the tree contains "/Misc/",
set the lambdas to uniform.
1999-03-17 Kamal Nigam <knigam@zeno.jprc.com>
* bow/naivebayes.h (bow_naivebayes_pr_wi_ci): Made public
* bow/libbow.h (bow_dv_set_di_count_weight): New prototype.
(bow_wi2dvf_set_wi_di_count_weight): New prototype.
* wi2dvf.c (bow_wi2dvf_set_wi_di_count_weight): New function.
Like add_wi_di_count_weight but sets the value directly.
* dv.c (bow_dv_set_di_count_weight): New function. Like
add_di_count_weight, but sets it directly.
* em.c (em_print_correct): Removed. Generalized this to
em_halt_using_accuracy and em_accuracy_docs and em_accuracy_loo,
to allow for arbitrary sets of documents to be used for halting by
accuracy. Added options (--em-halt-using-accuracy and
--em-print-accuracy) and code for this. Also generalized
em-halt-using-perplexity in a similar fashion. Generalized
em_test_accuracy to em_calculate_accuracy.
(em_halt_using_accuracy): Likewise
(em_accuracy_docs): Likewise.
(em_accuracy_loo): Likewise.
(em_options): Likewise.
(em_parse_opt): Likewise.
(bow_em_new_vpc_with_weights): Likewise.
(em_calculate_accuracy): Likewise.
(em_test_accuracy): Likewise.
(bow_em_score): Likewise. Changed for accuracy-loo
1999-03-16 Andrew McCallum <mccallum@justresearch.com>
* treenode.c (bow_treenode_set_words_from_new_words): Change logic
around setting TN->NEW_WORDS_NORMALIZER. Obey MISC_STAYS_FLAT
macro.
* multiclass.c (multiclass_score): New type.
(multiclass_classify_doc): Use it. Sort scores after evaluating
class sets of size one. Potentially use this to prune the set
tried with larger combinations.
(multiclass_classify_doc_into_single_class): New function. Unused.
(multiclass_cis_is_in_top): New function. Unused.
(compare_multiclass_scores): Renamed from nested function.
* hem.c (crossbow_hem_garbage_collection): New static variable.
(crossbow_hem_options): New command-line option
"hem-garbage-collection".
(crossbow_hem_parse_opt): Handle it.
(crossbow_hem_full_em): Add /Misc/ nodes if doing statistical garbage
collection.
1999-03-13 Andrew McCallum <mccallum@justresearch.com>
* multiclass.c (MAX_CLASSES): Increase from 20+1 to 100+1.
(multiclass_log_prob_of_classes_given_doc): Backoff MIXTURE if there
is no local data available.
(multiclass_classify_doc): Malloc score instead of alloca'ing, because
it can now be bigger than the stack.
1999-03-12 Andrew McCallum <mccallum@justresearch.com>
* rainbow.c (rainbow_options): Command-line option
"index-printed-barrel" name changed to "index-matrix", to match
"print-matrix". (Reported by Doug Baker.)
1999-03-06 Andrew McCallum <mccallum@justresearch.com>
* bow/treenode.h: Fix typo in #ifndef at top of file. Include
<bow/libbow.h>. Declare new functions.
* bow/libbow.h (BOW_DEFAULT_FILE_FORMAT_VERSION): Bumped from 6 to
7. Crossbow now writes multiclass "CIS" information, and also
explicitly writes the CROSSBOW_CLASSES_COUNT.
* bow/crossbow.h (crossbow_doc): Add elements CIS_SIZE and CIS to
hold multiclass information.
(crossbow_method): Add index and clasify_doc functions.
* treenode.c: Replace free() with bow_free().
(bow_treenode_realloc_words_all): New function.
(bow_treenode_free_loo_and_new_loo): New function.
(bow_treenode_free_loo_and_new_loo_all): New function.
(bow_treenode_set_words_from_new_words): Comment out part of
MISC_STAYS_FLAT code that was zero'ing NEW_WORDS_NORMALIZER. This
was messing up LOO processing, and I didn't understand why it was
done in the first place.
(bow_treenode_set_prior_from_new_prior_all): New function.
(bow_treenode_set_prior_and_extra_from_new_prior_all): New function.
(bow_treenode_pr_wi_loo_local): Add more assertions.
* rainbow.c (rainbow_test): Use bow_cdoc_is_train instead of old
bow_cdoc_is_model.
* hem.c: Use new crossbow_classify_tagged_docs arguments. Add
elements to crossbow_method instances.
* crossbow.c (struct crossbow_arg_state): New member
MULTICLASS_LIST_FILENAME.
(crossbow_doc_write): If BOW_FILE_FORMAT_VERSION >= 7, write the CIS
information.
(crossbow_doc_read): Likewise for reading.
(crossbow_archive): Explicitly write the CROSSBOW_CLASSES_COUNT.
(crossbow_unarchive): If available, read it.
(crossbow_index_filename): New function, split out from
crossbow_index().
and its embedded function write_wv_to_fp().
(crossbow_index_multiclass_list): New function.
(crossbow_index): Use crossbow_index_filename.
(crossbow_classify_doc): New function, split out from
crossbow_classify_tagged_docs().
(crossbow_classify_tagged_docs): Use it.
(crossbow_classify_docs_in_dirname): Likewise.
(crossbow_classify): Use the method's TRAIN_CLASSIFIER.
(crossbow_options): New command-line argument "index-multiclass-list".
(crossbow_parse_opt): Handle it.
(main): Don't unarchive if doing crossbow_index_multiclass_list.
* Makefile.local (CROSSBOW_C_FILES): Added multiclass.c.
1999-02-25 Andrew McCallum <mccallum@justresearch.com>
* bow/crossbow.h (crossbow_method): Removed classify function,
added cluster function
* hem.c: Various bow_doc fixes.
(hem_cluster_method, hem_classify_method, hem_fienberg_method): New
method structures.
(_register_method_hem): Implemented.
* crossbow.c (crossbow_classify_all_docs): Deleted function.
(crossbow_classify_tagged_docs): Add new argument FILE *out.
(crossbow_cluster): Use new method function.
(crossbow_classify): Likewise.
(main): Fix for new lexer organization.
* next.c (bow_cdoc_is_train): Renamed from bow_cdoc_is_model. All
callers changed.
* active.c: Use renamed function bow_cdoc_is_train instead of
bow_cdoc_is_model.
* mix.c: Use renamed function bow_cdoc_is_train instead of
bow_cdoc_is_model.
* em.c (em_parse_opt): Use renamed function bow_cdoc_is_train.
* methods.c (bow_method_register_with_name): Add new parameter
SIZE. All callers changed.
* readme.texi: Add required --test-set=0.5
1999-02-23 Sean Slattery <jslttery@rote.learning.cs.cmu.edu>
* rainbow-ac.pl: As with rainbow-pr.pl. Binning is going to work
interestingly here....
* rainbow-pr.pl: Updated so that points are not output in the
middle of a run of predictions with the same confidence.
1999-02-21 Jason Rennie <jrennie@data.jprc.com>
* rainbow.c: don't seg fault if the rainbow_doc_barrel is NULL
1999-02-20 Sean Slattery <jslttery@rote.learning.cs.cmu.edu>
* knn.c (bow_knn_normalise_weights): Cosine normalisation is by
vector length and not by summing. The query normalisation was fine.
* knn.c: Added lots more documentation. Made the weight entry of
each term in each document have the required IDF factored
in. Needed to do this so the cosine normalisation would do the
right thing. Much happier that this implementation is closer to
correct. Also took out the score normalisation at the end.
1999-02-18 Andrew McCallum <mccallum@justresearch.com>
Lexer's reorganized. bow_lexer_simple and bow_lexer_indirect
removed. bow_lexer now has NEXT pointer, so all lexer's can be
chained. Many lexing parameters now in global variables
bow_lexer_* declared in lex-simple.c; this way when --lex-white
follows --skip-headers, the default DOCUMENT_START_PATTERN in
--lex-white doesn't override the --skip-headers option. Order of
lexer's still matters when chaining, like: --lex-white --skip-html
--lex-suffixing.
* Makefile.in (STANDARD_LIBBOW_C_FILES): Remove lex-indirect.c.
Add lex-next.c.
* opts.c (bow_options): New command-line options "shortest-word"
and "lex-alphanum".
(parse_bow_opt): Handle them, and update to new lexer organization.
* lex-suffixing.c: Updated for new lexer organization.
* lex-simple.c: Likewise.
* lex-html.c: Likewise.
* lex-gram.c: Likewise.
* deflexer.c: Likewise.
* rainbow.c: Likewise.
* arrow.c: Likewise.
* lex-indirect.c: Half-way updated, and then deleted in favor of
lex-next.c.
* bow/libbow.h: Updated for new lexer organization.
(bow_realloc_hook): Declared new function pointer.
(bow_realloc): Call new hook instead of malloc's hook.
* bmalloc.c (_bow_realloc_hook): New function.
(bow_realloc_hook): New global variable.
* methods.c (bow_method_register_with_name): When creating the
sarray, allow space of size rainbow_method, not bow_method. This
needs to be fixed further to make any size possible.
* barrel.c (bow_barrel_new_from_data_fp): Use bow_free() instead
of free().
* archer.c (struct archer_arg_state): New element
SCORE_IS_RAW_COUNT.
(archer_index_lines): New function.
(archer_query_hits_matching_sequence): Use new stoplist global
variable.
Only scale if SCORE_IS_RAW_COUNT is non-zero.
(archer_options): New command-line options "index-lines" and
"score-is-raw-count".
(archer_parse_opt): Handle them.
(main): Likewise.
1999-02-12 Andrew McCallum <mccallum@justresearch.com>
* active.c, archer.c, arrow.c, barrel.c, crossbow.c, em.c, evi.c,
goodturing.c, hem.c, kl.c, knn.c, methods.c, mix.c, naivebayes.c,
prind.c, rainbow-h.c, rainbow.c, split.c, tfidf.c, crossbow.h,
em.h, kl.h, knn.h, libbow.h, naivebayes.h, prind.h, tfidf.h,
treenode.h: Separate bow_method from rainbow_method. New
structure rainbow_method. New structure crossbow_method.
1999-02-01 Andrew McCallum <mccallum@justresearch.com>
* hem.c (crossbow_hem_consider_splitting): Fix memory leak by
freeing the GRANDPARENTS array.
1998-12-17 Andrew McCallum <mccallum@justresearch.com>
* naivebayes.c
(bow_naivebayes_set_cdoc_word_count_from_wi2dvf_weights):
Initialize num_words_per_ci[] to zero.
1998-11-29 Andrew McCallum <mccallum@justresearch.com>
* crossbow.c (struct crossbow_arg_state): Add members PRINTING_TAG
and CLASSIFY_FILES_DIRNAME.
(crossbow_classify_tagged_docs): Renamed from
crossbow_classify_test_docs.
(crossbow_classify_docs_in_dirname): New function.
(crossbow_classify): Insist on either restricted or deterministic
horizontal. Handle CLASSIFY_FILES_DIRNAME.
(crossbow_print_doc_names): New function.
(crossbow_options): New command-line options "classify-files" and
"print-doc-names".
(crossbow_parse_opt): Handle them.
* hem.c (SHRINK_WITH_UNIFORM_ONLY): Change from 1 to 0; i.e. set
back to default.
(iteration_limit): Change from 3 to 999.
(crossbow_hem_options): Add command-line options for
"hem-deterministic-horizontal" and "hem-restricted-horizontal".
(crossbow_hem_parse_opt): Handle them.
(crossbow_hem_unlabeled_perplexity): Add a test for -Inf.
(crossbow_hem_em_one_iteration): Handle both labeled and unlabeled
documents. Handle "resticted-horizontal" case. Verbosify how
many documents were incorporated.
(crossbow_hem_cluster): Handle both labeled and unlabeled documents.
(crossbow_hem_full_em): Add "Misc" children. Add commented-out code
for semi-labeled data.
(crossbow_hem_fienberg_treenode): Add #if'ed code for
SHRINK_WITH_UNIFORM_ONLY.
* treenode.c (bow_treenode_add_misc_child_all): New function.
(bow_treenode_set_words_from_new_words): Add special case for
"Misc" nodes.
* split.c (bow_tag_docs): New function.
(bow_tag_change_tags): New function.
* bow/libbow.h (bow_tag_docs): Declare new function.
(bow_tag_change_tags): Likewise.
1998-11-14 Jason Rennie <jrennie@cyclone.jprc.com>
* rainbow.c (bow_print_log_odds_ratio): print to stdout
* rainbow.c (bow_print_log_odds_ratio): new function
(rainbow_parse_opt): add --print-log-odds-ratio option
1998-11-13 Andrew McCallum <mccallum@justresearch.com>
* archer.c (archer_query_server_process_commands): Use different
prefix for pre-fork and post-fork commands, so we don't try to
handle a post-fork as a pre-fork.
(archer_query): Fix bug that caused crash when a +term was the only
term.
1998-11-12 Andrew McCallum <mccallum@justresearch.com>
* archer.c (archer_query): Print commas between query terms.
* lex-suffixing.c (bow_lexer_suffixing_postprocess_word): Skip the
Reference: words the second time through.
1998-11-11 Jason Rennie <jrennie@cyclone.jprc.com>
* rainbow.c (rainbow_query): check rainbow_doc_barrel; free query
word vector
1998-11-09 Andrew McCallum <mccallum@justresearch.com>
* archer.c (archer_docs): Change from a bow_array to a bow_sarray,
so we can keep track of filenames. All relevant functions
changed.
(archer_archive): Flush archer_wi2pv's FP.
(archer_index_filename): New function, moved out of archer_index.
Don't raise an error if the file doesn't exist.
(archer_index): Call it.
(archer_delete_filename): New function.
(archer_query_server_process_commands): New argument
DOING_PRE_FORK_COMMANDS. Add new commands for adding and removing
files.
(archer_query_serve_one_query): Process commands before and after
forking.
* sarray.c (bow_sarray_entry_at_keystr): If the KEYSTR is not
present, don't raise an error, just return NULL.
1998-11-05 Andrew McCallum <mccallum@justresearch.com>
* pv.c (bow_pv_max_sizeof_di_pi): Redefined for smaller number of
bits, since we are now using 7 bits in the IS_MORE bytes, instead
of just 6.
(bow_pv_write_unsigned_int): Updated to write 7 bits in the secondary
"IS_MORE" bytes.
(bow_pv_read_unsigned_int): Likewise for reading.
(PV_WRITE_SIZE_INT): New macro. Unused.
(bow_pv_write_size_di_pi): New function. Unused.
(bow_pv_read_size_di_pi): Likewise.
* bow/archer.h: Remove declaration of functions that are now
defined static in pv.c.
(bow_pe): Add entry BITS_MORE.
* wi2pv.c (bow_wi2pv_print_stats): Rename COUNT_HISTOGRAM to
WORD_COUNT_HISTOGRAM, in anticipation of histogramming the number
of bytes per pv entry also.
* rainbow.c (main): Fix handling of TAG in doc name printing.
* split.c (bow_files_source_type): Added two new types:
bow_files_source_fraction_remaining, and
bow_files_source_number_remaining.
(bow_split_options): Update documentation.
(bow_split_parse_opt): Handle new source types.
(bow_set_doc_types_randomly_by_count): New option
TAKE_PROPORTION_FROM_REMAINING. Revert to pre-1998-10-21 behavior
if this is zero.
(bow_set_doc_types_randomly_by_fraction_remaining): New function.
(bow_set_doc_types): Change the order of operations slightly. Instead
of doing all non-filename options together, do first the options
that specify the number of documents per class, then do the ones
that select the number of documents per class randomly according
to some proportion.
1998-11-01 Andrew McCallum <mccallum@justresearch.com>
* pv.c (bow_pv_write_unsigned_int): Make it a static inline
function, for speed.
(bow_pv_read_unsigned_int): Likewise.
(bow_pv_write_next_di_pi): Likewise.
(bow_pv_read_next_di_pi): Likewise.
* archer.c (archer_index): Pass just a filename, not a directory
name to bow_wi2pv_new().
(archer_query_hits_matching_sequence): New argument SUFFIX_STRING.
All callers changed. Properly tell the difference between a word
that doesn't appear and a stopword. Make a little more efficient.
(archer_query): Parse things like `title:foo'.
(archer_query_serve_one_query): Process special comands like ,HITS
before any query, not just the first.
(archer_options): Remove "compare" option.
1998-10-21 Andrew McCallum <mccallum@justresearch.com>
* split.c (bow_set_doc_types_randomly_by_count): When determining
how many documents are available, instead of skipping only the
ignore documents, skip all the non-untagged documents. This fixes
a bug whereby there are not enough documents left to label. Don't
arbitrarily add one to the number of document to be tagged when
the ratio-indicated number is zero. When adding more documents to
the per-class counts to be tagged, make sure there are untagged
documents available in that class.
1998-10-16 Andrew McCallum <mccallum@justresearch.com>
* rainbow-h.c: All these changes actually made on or before Jun 26
1998.
(hier_barrel): Add di4filename. Change CLASS_WORD_COUNT and
DOC_WORD_COUNTS from ints to floats.
(hier_barrel_new): Initialize di4filename.
(hier_barrel_set_class_word_count): Use the CDOC->NORMALIZER instead
of the CDOC->WORD_COUNT.
(hier_barrel_set_class_word_count): New function. Commented out.
(hier_barrel_set_class_barrel_cdoc_normalizer_to_word_count): New
function.
(hier_barrel_total_of_cdoc_word_counts): Function removed.
(hier_barrel_correct_cdoc_prior): Set the CDOC->NORMALIZER in the
class barrel, so we can use them to reset the WORD_COUNT's. Do it
by calling new function.
(hier_barrel_set_class_model_from_new_class_wi2dvf): Call new function.
(hier_barrel_free): Free HBARREL->NEW_CLASS_WI2DVF.
(rainbowh_set_total_num_words): Remove function.
(hier_barrel_prob_wi_in_ci_local): Make some changes for full_em.
(hier_barrel_prob_wi_local): Likewise.
(hier_barrel_calc_lambda_one_iteration): Add #error.
(hier_barrel_all_reset_doc_word_count_from_loo_doc_wi2dvf): New
function.
(hier_barrel_calc_lambdas_all_nodes): Do more updates at top of
while() loop and add comments.
(rainbowh_unarchive): Deal with di4filename.
1998-10-15 Andrew McCallum <mccallum@justresearch.com>
* archer.c (archer_query_serve_one_query): When forking, in the
child, reopen the wi2pv PV FILE* so we get our own lseek()
position.
* bow/archer.h (bow_wi2pv_reopen_pv): Declare new function.
* wi2pv.c (bow_wi2pv_reopen_pv): New function.
* dice.c: New file.
* hem.c (SHRINK_WITH_UNIFORM_ONLY): New macro.
(PRINT_WORD_DISTS): New macro.
(iteration_limit): New variable. (temporary).
(crossbow_hem_full_em): Use SHRINK_WITH_UNIFORM_ONLY and
PRINT_WORD_DISTS. Use iteration_limit.
(crossbow_hem_fienberg_treenode): Use SHRINK_WITH_UNIFORM_ONLY and
PRINT_WORD_DISTS. Also print results with several fixed lambdas.
* crossbow.c (struct crossbow_arg_state): New member
PRINT_FILE_PREFIX.
(main): Initialize new member.
(crossbow_print_word_probabilities): New function (not complete).
(crossbow_options): New option "print-word-probabilities".
(crossbow_parse_opt): Handle it.
* treenode.c (bow_treenode_free_loo): No longer declared static.
(bow_treenode_free_loo_all): New function.
(bow_treenode_set_new_words_to_zero): New function.
(bow_treenode_print_all_word_probabilities_all): New function.
* bow/treenode.h: Declare new functions.
* kl-div.c (main): Handle new command-line argument -l, for "Larry
Wasserman's Loss function". Implement it.
* archer.c (archer_query_hits_matching_sequence): Change scaler.
(archer_query): Make a local copy of ARCHER_ARG_STATE.QUERY_STRING and
replace the trailing newline and carriage return with \0.
* archer.c (archer_query_hits_matching_sequence): Add 1 to avoid
dividing by zero.
1998-10-15 Jason Rennie <jrennie@cyclone.jprc.com>
* Makefile.preamble (-DVPC_ONLY): new define
* vpc.c (bow_barrel_new_vpc): set IS_VPC to 1
* rainbow.c (rainbow_index): allow possibility of building only
class barrel in order to reduce memory consumption; add checks to
make sure that rainbow_doc_barrel is not NULL before we try to use it
(rainbow_options): add vpc-only option
* hdb-bow.c (bow_barrel_add_from_hdb): if barrel has is_vpc set,
build a multinomial class barrel instead of a document barrel
* barrel.c (bow_barrel_add_from_text_dir): if barrel has is_vpc
set, build a multinomial class barrel instead of a document barrel
(bow_barrel_new): set IS_VPC to 0
1998-10-14 Andrew McCallum <mccallum@justresearch.com>
* hem.c (crossbow_hem_unlabeled_perplexity): New function.
(crossbow_hem_labeled_perplexity): New function.
(crossbow_hem_place_labeled_data): Clear old data and LOO info before
placing the new. Use new functions.
* archer.c (archer_query_hits_matching_wi): New function.
(archer_query_hits_matching_sequence): Use it. Improve efficiency of
wi2pv searching. Use new bow_wa function.
(archer_query): Remove trailing \n from query terms.
* wa.c (bow_wa_add_to_end): New function.
* bow/libbow.h: Declare new function.
* archer.c (archer_query_hits_matching_sequence): Change the
weight-setting method to encourage documents that have all query
term to be ranked above documents that have many repetitions of a
few terms.
(archer_query): Fix it so it works with queries that don't have a
+term. Free all the temporary space that was allocated. Print
the +terms as well as the others.
* wi2pv.c (bow_wi2pv_new): Instead of saving an absolute pathname
(and allowing the PV_FILENAME argument to contain '/'s), expect
PV_FILENAME to be just a filename, and save it only. Use
BOW_DATA_DIR as the directory location in which to place the
PV_FILENAME.
(bow_wi2pv_new_from_filename): Support the new behavior of
bow_wi2pv_new. (Although old models are temporarily supported by
use of a hack.)
* naivebayes.c (bow_naivebayes_set_weights): Remove restriction
that we have less than 200 classes.
* archer.c (archer_query): Combine results of query terms more
efficiently. Print the words that occur in each hit. Currently
only works if there is a +term.
1998-10-13 Andrew McCallum <mccallum@justresearch.com>
* archer.c (archer_query_hits_matching_sequence): Remove yyparse
code. Add code to parse +, - and "" in C. Gather results in
various bow_wa arrays, and combine them with the correct
semantics.
* Makefile.local (archer): Remove dependence on query* files.
* wa.c (bow_wa_weight): New function.
(bow_wa_union): New function.
(bow_wa_intersection): New function.
(bow_wa_overlay): New function.
(bow_wa_diff): New function.
* bow/libbow.h: Declare new functions.
* lex-suffixing.c: Whenever calling *lexer_simple_* functions,
pass BOW_DEFAULT_LEXER_SIMPLE instead of SELF.
(HEADER_TWICE): New macro. When this macro is non-zero, read the
headers twice, once adding the suffixes, once not.
1998-10-13 Jason Rennie <jrennie@cyclone.jprc.com>
* configure.in: AC_HAVE_LIBRARY(socket) changed to
AC_CHECK_LIB(socket,main)
1998-10-09 Peter Su <Peter Su <psu@jprc.com>
* configure.in: added check for YACC
* query.h: New file. header file for query demo.
* Makefile.in: Added rules for building query demo
* query_main.c: New file, short demo for parser
* query_lex.l: New file
* query_parse.y: New file for parsing boolean queries
1998-10-08 Jason Rennie <jrennie@cyclone.jprc.com>
* Makefile.in: define ALL_LIBS before Makefile.preamble is
included
* Makefile.in (kl-div): don't use ALL_LIBS; kl-div doesn't depend
on libargp
* Makefile.in: do STANDARD_ trick with LIBBOW_C_FILES and
LIBBOW_H_FILES to exclude HDB stuff from distribution
* configure.in: Add check for -lsocket. Needed for Solaris
* opts.c: add #if HAVE_HDB checks
* rainbow.c: add #if HAVE_HDB checks
* arrow.c: add #if HAVE_HDB checks
* Makefile.in: Remove all mentionings of HDB
* int4word.c (bow_words_add_occurrences_from_hdb): moved to
hdb-bow.c
* docnames.c (bow_map_filenames_from_hdb): moved to hdb-bow.c
* hdb-bow.c: new file
* barrel.c (bow_barrel_add_from_hdb): move to hdb-bow.c
* Makefile.preamble: add HDB-specific compilation definitions
1998-10-08 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h (bow_random_seed): Renamed from bow_split_seed.
* random.c (bow_random_set_seed): Rename bow_split_seed to
bow_random_seed.
* opts.c (parse_bow_opt): Rename bow_split_seed to
bow_random_seed.
* hem.c (crossbow_hem_create_children_for_node): Use
bow_treenode_set_new_words_from_perturbed_words() instead of the
previous direct use of bow_random_double.
(crossbow_hem_classification_perplexity): New function.
(crossbow_hem_em_one_iteration): When calculating perplexity, and
adding the contribution from this leaf, don't add it if
LEAF_MEMBERSHIP[LI] is zero. This avoid adding infinity.
(crossbow_hem_cluster): After perturbing the leaves (which happens
after EM convergence), don't smooth the new WORDS distribution.
Previously called ...set_words_from_new_words_all() with alpha=1.
(crossbow_hem_full_em): When printing "pp", use "classification
perplexity" instead of "perplexity".
1998-10-07 Andrew McCallum <mccallum@jprc.com>
* archer.c (archer_print_word_stats): New function.
(archer_options): New command-line argument "print-word-stats".
(archer_parse_opt): Handle it.
(main): Check to see if WHAT_DOING is NULL, and complain if so.
* wi2pv.c (bow_wi2pv_print_stats): New function.
* bow/archer.h: Declare new function.
* wi2pv.c (bow_wi2pv_new): Convert PV_FILENAME to an absolute
pathname before saving it. This way, an archived model can be
used when running archer from a different CWD from which the index
was built.
* stopwords.c (_bow_builtin_stopwords_shorter): New global
variable.
1998-10-06 Andrew McCallum <mccallum@jprc.com>
* wa.c (bow_wa_add): New function.
(bow_wa_add_wa): New function.
* bow/libbow.h: Declare new functions.
* wi2pv.c (bow_wi2pv_wi_count): New function.
* bow/archer.h: Declare new functions.
* pv.c (bow_pv_next_di_pi): Exchange the order of "end of PV
checking" and "unnext checking" to fix bug whereby BOW_PV_REWIND()
wasn't working.
* archer.c: Make is work as a server over a socket, like arrow.
Separate the query handling into two functions, in preparation for
having a query parser.
* Makefile.local (archer): New target.
* pv.c, wi2pv.c, archer.c, bow/archer.h: New files. Wow, all this
code written in just 4 hours.
1998-09-30 Jason Rennie <jrennie@cyclone.jprc.com>
* configure.in: new compatibility checks
* split.c (bow_split_parse_opt): use strchr instead of index
(STANDARDS suggests strchr)
* em.c, mix.c, active.c: change "//" to "/* */" (some older compilers
don't know about //)
* split.c (bow_split_parse_opt): use hack if index() doesn't exist
* em.c (random01): removed; replaced calls with bow_random_01
(random_double): removed; replaced calls with bow_random_double
* active.c (active_select_stream_kl): use bow_random_double where
appropriate.
* random.c: check for gettimeofday() and RAND_MAX. Use simple
workarounds if not defined
* lex-simple.c (bow_lexer_simple_open_text_fp) [!HAVE_SETENV]:
complain and die.
* bow/libbow.h [__linux__]: redefine assert (re-definition
on SunOS causes problems)
(assert): insert comma bettween __STRING(expr) and __FILE__
to prevent nastiness. Must have been inadvertently removed earlier.
[!HAVE_STRCHR, !HAVE_STRRCHR]: redefine as index, rindex
[!HAVE_RANDOM, !HAVE_SRANDOM]: redefine as rand, srand.
1998-09-29 Jason Rennie <jrennie@cyclone.jprc.com>
* hdb.c [!HAVE_STRERROR]: redefine as ""
* configure.in: check for strerror
1998-09-28 Andrew McCallum <mccallum@jprc.com>
* rainbow.c (rainbow_options): Add optional argument to
"print-doc-names" that specifies the document type (tag = train,
test,..) to print.
(rainbow_parse_opt): Handle it.
(main): Likewise.
* bow/libbow.h (bow_str2type): New macro.
* Makefile.in: Undo changes that added crossbow to the snapshot
distribution. It's not yet ready to go out.
(LIBBOW_H_FILES): Remove bow/crossbow.h and bow/treenode.h.
(CROSSBOW_C_FILES): Remove variable.
(OTHER_C_FILES): Remove crossbow.c.
(all): Remove crossbow.
(DIST_FILES): Remove $(CROSSBOW_C_FILES).
1998-09-28 Jason Rennie <jrennie@cyclone.jprc.com>
* knn.c (bow_knn_score): calculate and return class scores (same
as before last modification)
* rainbow.c (rainbow_query): always print class names instead of
relying on cdoc names
1998-09-25 Andrew McCallum <mccallum@jprc.com>
* int4str.c (_str2id): Fix bug whereby returned value may be
negative or zero! (Reported by Jason Rennie.)
* hem.c (crossbow_hem_full_em): Print the train- and test-pp for
the "No Shrinkage" case also.
(LOG_LOSS): New macro.
(crossbow_hem_fienberg_treenode): Conditioned on above macro,
implement lambda-calculation with a log log function.
1998-09-24 Andrew McCallum <mccallum@jprc.com>
* crossbow.c (CLASSES_FROM_DIRS): New macro.
(crossbow_index): In embedded function write_wv_to_fp() extract the
directory name from the FILENAME, and use it to determine the
class index of the DOC. Count the number of classes from the
number of different dir names found in the embedded function.
(crossbow_classify): Temporarily use crossbow_hem_fienberg() instead
of crossbow_hem_full_em().
* hem.c (crossbow_hem_full_em): Instead of printing "Initial Flat
Distribution", print the more precise message "No Shrinkage".
(crossbow_hem_fienberg_treenode): New function.
(crossbow_hem_fienberg): New function.
* treenode.c (bow_treenode_set_lambdas_leaf_only): New function.
(bow_treenode_set_lambdas_leaf_only_all): New function.
* bow/treenode.h: Declare new functions.
* split.c (bow_set_doc_types_randomly_by_count_per_class): Print
the total number of documents tagged.
1998-09-21 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h (bow_doc_type): Reorder the enumeration so that the
numbers match those of old archived barrels. (Bug reported by
Kamal Nigam and Jason Rennie.)
1998-09-18 Andrew McCallum <mccallum@jprc.com>
* hem.c (crossbow_hem_options): Explain more defaults.
* rainbow.c (rainbow_options): Rename "print-barrel" to
"print-matrix". Leave "print-barrel" as a hidden alias.
1998-09-17 Andrew McCallum <mccallum@jprc.com>
* Makefile.in: Add crossbow the the snapshot distribution.
(OTHER_C_FILES): Add crossbow.c.
(LIBBOW_H_FILES): Add bow/crossbow.h and bow/treenode.h.
(DIST_FILES): Add $(CROSSBOW_C_FILES).
(all): Add crossbow.
* bow/libbow.h (bow_cdoc): Reorder the variables so that the order
of the first three matches that of crossbow_doc.
(bow_doc): Add CLASS and FILENAME elements.
* bow/crossbow.h (crossbow_doc): Reorder the variables so that the
order of the first three matches that of bow_cdoc.
* crossbow.c (main): Call bow_set_doc_types() only if not
indexing.
* hem.c: Turn of MN. Use new bow_doc types instead of old
crossbow tags.
* mn.c (crossbow_hem_em_one_mn_iteration): Don't reset classes
distribution unless CROSSBOW_CLASSES_COUNT is greater than 1.
* rainbow.c: Fix some typos in calls to
bow_set_doc_types_for_barrel().
(rainbow_query): Remove use of knn_k.
* crossbow.c (crossbow_classnames): New global variable.
(crossbow_archive): Write it.
(crossbow_unarchive): Read it.
(crossbow_index): Initialize it.
(struct crossbow_arg_state): Remove test_fraction.
(crossbow_docs_split): Function removed.
(crossbow_classify): Don't call removed function.
(crossbow_options): "test-percentage" option removed.
(crossbow_parse_opt): No longer handle it.
(main): Call bow_set_doc_types() to do the test train splits.
* em.c (bow_em_new_vpc_with_weights): Comment out unused variables
to avoid warning.
* split.c: All function names with bow_set_files_, changed to
bow_set_doc_. All functions that previously took a bow_barrel as
an argument now take only the pieces of the barrel they need---so
that crossbow can use these functions too.
(bow_set_doc_types_for_barrel): New function that works like old
bow_set_doc_types() function.
* bow/libbow.h (bow_set_doc_types): New function declaration.
(bow_set_doc_types_for_barrel): New function declaration.
* rainbow.c (main): Call bow_set_doc_types_for_barrel() instead
of set_doc_types().
1998-09-17 Jason Rennie <jrennie@cyclone.jprc.com>
* knn.c: make KNN_K static (local); need to change bow_knn_score
so that it returns classes instead of k nearest documents
* rainbow.c (rainbow_query): set NUM_HITS_TO_SHOW to KNN_K if
method is "knn"
* knn.c (bow_knn_get_k_best): fixed a bug in document shifting
schema; return document score in bow_score array
(bow_knn_score): clear away lots 'o unnecessary junk; expect
calling function to allocate BSCORES; initialize BSCORES
1998-09-15 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h (bow_doc): New typedef.
1998-09-14 Andrew McCallum <mccallum@jprc.com>
* hem.c: Replace hem_ with crossbow_hem_.
* Makefile.in (CROSSBOW_C_FILES): Renamed hac.c to hem.c.
* hem.c: File renamed from hac.c.
* crossbow.c: Replaced hac_ with hem_.
* rainbow.c (rainbow_options): Changed "infogain-vector" to
"print-word-infogain". Changed "infogain-pair-vector" to
"print-word-pair-infogain". Changed "weight-vector" to
"print-word-weights". Changed "foilgain-vector" to
"print-word-foilgain".
(main): Call set_doc_types() once at beginning, unless doing
RAINBOW_TESTING, which does its own call of this function.
* libbow.texi (Overview): Added rough notes from Devika
Subramanian.
* bow/libbow.h (BOW_MINOR_VERSION): Updated from 8 to 9.
* treenode.c hac.c crossbow.c bow/treenode.h: Changed treenode_ to
bow_treenode_ in all function names.
1998-09-14 Kamal Nigam <knigam@zeno.jprc.com>
* bow/libbow.h (bow_wv_set_weights_to_count): Changed prototype.
(bow_doc_type): Removed unused doc types and added validation type.
(bow_method): Change prototype so set weights takes two arguments.
(bow_wv_set_weights): Change prototype
(bow_wv_set_weights_to_count): Likewise.
(bow_wv_set_weights_by_event_model): New function.
(bow_cdoc_is_ignored_model): Removed.
(bow_cdoc_is_validation): New function.
* em.c (bow_em_event_model_type): Removed. Superceded by libbow
event models.
(M_EST_M): Removed.
(M_EST_P): Removed.
(WORD_PRIOR_COUNT): Removed.
(bow_em_event_model): Removed.
(bow_em_calculating_perplexity): New global for perplexity calculating
hack.
(binary_pos_ci): New global for binary classification hack.
(em_halt_using_perplexity): New option variable.
(em_perplexity_docs): Likewise.
(em_perplexity_loo): Likewise.
(bow_em_anneal_normalizer): Likewise.
(em_options): Removed option --em-event-model. New code for options
em-halt-using-perplexity em-anneal-normalizer em-print-perplexity.
(em_parse_opt): Likewise.
(bow_cdoc_is_multi_hump_doc): New function.
(bow_cdoc_is_train_or_unlabeled): New function.
(bow_em_perturb_weights): Set the random seed.
(bow_em_perturb_weights): Use the real word_count in normalizer, not
the integral one.
(bow_em_new_vpc_with_weights): num_classes removed. binary_pos_ci
moved to be a global. *_class_map variables removed. Use the
real word_count in normalizer, not the integral one. Remove
infogain pruning; do this in rainbow.c. Change calculation of
word_count for documents to be correct. Do not allow zero class
probs when randomly choosing them for the EM starting point. Add
EM halting when the perplexity of some documents plateaus. Create
dv's for all words in the vocabulary so perplexity calculation
does not get confused if words appear in only the doc barrel and
not the class barrel. Use libbow document event model and
doc-then-word document length. Use bow_em_set_weights (like
bow_naivebayes_set_weights). Calculate and print perplexity if
necessary or requested. Implement normalizer annealing. Add
multi-hump EM more naturally into the code. Don't free the
class_probs; may be used by LOO doc testing. Update call to
bow_em_score.
(em_calculate_perplexity): New function.
(bow_em_test_accuracy): Update call to bow_em_score.
(bow_em_print_log_odds_ratio): Disable while normalizer is holding
word_count info.
(bow_em_set_priors_using_class_probs): Prime each class's count with
1.
(bow_em_pr_wi_ci): New function.
(bow_em_set_weights): New function.
(bow_em_score): New calling prototype to send loo class_probs pointer
as an int. Use bow_em_pr_wi_ci with loo functionality. Add
perplexity calculation hack to return perplexity of each class.
Use libbow event models implicitly through bow_em_pr_wi_ci
(bow_method_em): Change wv_set_weights to explicitly use the event
model.
* naivebayes.c (naivebayes_argp_m_est_m): Remove static so em.c
can see.
(M_EST_M): Removed.
(M_EST_P): Removed.
(bow_naivebayes_pr_wi_ci): Changed loo_wi_count and loo_w_count to
floats to handle the document_then_word event model correctly.
* bow/naivebayes.h (naivebayes_argp_m_est_m): Add prototype so
em.c can use.
* wv.c (bow_wv_set_weights_to_count): Add argument to match method
prototype.
(bow_wv_set_weights_by_event_model): New function.
* split.c (bow_validation_files_source): New variable.
(bow_validation_fraction): New variable.
(bow_validation_number): New variable.
(bow_validation_filename): New variable.
(bow_validation_fancy_counts): New variable.
(bow_split_options): New option --validation-set
(bow_split_parse_opt): New option --validation-set
(bow_split_parse_opt): New format for fancy counts: all ',' as
delimiters.
(bow_set_file_types_randomly_by_count_per_class): Print nice message
to stderr.
(bow_set_file_types_randomly_by_count): Ensure get at least document
per class of training.
(bow_set_files_to_type): Update nice message for stderr.
(bow_set_file_types_of_remaining): Print nice message to stderr.
(set_doc_types): handle validation set also.
* rainbow.c (rainbow_test): Added special call for EM method to
correctly handle EM-style LOO scoring.
* next.c (bow_cdoc_is_ignored_model): Removed.
(bow_cdoc_is_validation): New function.
* knn.c (bow_knn_query_set_weights): Added argument to procedure
to be compliant with new prototype for method
* arrow.c (arrow_compare): Add argument to calls to
bow_wv_set_weights_to_count.
1998-09-13 Andrew McCallum <mccallum@jprc.com>
* NEWS: Update with some new features in version 0.9.
* Version (BOW_MINOR_VERSION): Changed from 8 to 9.
* readme.texi: Update list of coming rainbow improvements. Update
web address for more information.
* HACKING: Update GNU URL's and add parens to argp autoconf
command.
* install.texi: Remove unnecessary copyright notice.
* Makefile.in (LIBBOW_H_FILES): Added bow/hdb.h.
1998-09-12 Andrew McCallum <mccallum@jprc.com>
* opts.c (bow_options): Changed the name of command-line option
"split-seed" to "random-seed".
1998-09-08 Jason Rennie <jrennie@cyclone.jprc.com>
* wi2dvf.c (bow_wi2dvf_add_di_text_str): add lexer->close call to
patch memory leak
1998-09-07 Jason Rennie <jrennie@cyclone.jprc.com>
* docnames.c, barrel.c, int4word.c: fix memory leak associated with
HDB.
* Change instances of "char" in function names to "str" for
better description of functionality.
1998-09-04 Andrew McCallum <mccallum@jprc.com>
* opts.c (bow_options): Make the "lex-for-usenet" option hidden,
since it is currently broken.
1998-09-03 Andrew McCallum <mccallum@jprc.com>
* Makefile.in (snapshot): Make ourselves more Y2K compliant :-) by
including the first two digits of the year in the snapshot file
name.
1998-08-24 Jason Rennie <jrennie@cyclone.jprc.com>
* bow/hdb.h: new file
* bow/libbow.h: Added lots of function declarations
* wi2dvf.c (bow_wi2dvf_add_di_text_char): new function
* scan.c (bow_scan_char_for_string): new function
* rainbow.c (rainbow_test_files): Split test_file into test_file,
test_hdb_file and process_wv; call HDB functions when BOW_HDB is
set.
(rainbow_index): Call HDB functions when BOW_HDB is set.
* opts.c (parse_bow_opt): Added --hdb option with key HDB_KEY
* lex-suffixing.c (bow_lexer_suffixing_open_char): new function
(_bow_suffixing_lexer): new field: bow_lexer_suffixing_open_char
* lex-simple.c (bow_lexer_simple_open_char): new function
(_bow_white_lexer, bow_alpha_only_lexer, bow_alpha_lexer):
new field: bow_lexer_simple_open_char
* lex-indirect.c (bow_lexer_indirect_open_char): new function
* lex-html.c (_bow_html_lexer): new field:
bow_lexer_simple_open_char
* lex-gram.c (bow_lexer_gram_open_char): new function
(_bow_gram_lexer): new field: bow_lexer_gram_open_char
* lex-email.c (_bow_email_lexer): new field:
bow_lexer_simple_open_char
* istext.c (bow_char_is_text): new function
* int4word.c (bow_words_add_occurrences_from_hdb): new function
* hdb.c: new file
(hdb_open): new function
(hdb_close): new function
(hdb_put): new function
(hdb_get): new function
(hdb_each): new function
* docnames.c (bow_map_filenames_from_hdb): new function
* barrel.c (bow_barrel_add_from_hdb): new function
* arrow.c (arrow_index): Call HDB functions if BOW_HDB is set
* Makefile.in: Added stuff necessary to compile with HDB code
(Hash DataBase - simple file system-like database)
1998-08-19 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h: Fix comment descriptions of event models.
* opts.c (bow_event_document_then_word_document_length): New
global variable.
(bow_options): New command-line option
"event-document-then-word-document-length".
(parse_bow_opt): Handle it.
* bow/libbow.h: Declare new global variable.
* naivebayes.c (bow_naivebayes_score): Use
BOW_EVENT_DOCUMENT_THEN_WORD_DOCUMENT_LENGTH. When multiplying by
the number of occurrences in the QUERY_WV, use .weight, not
.count.
* vpc.c (bow_barrel_new_vpc): Use
BOW_EVENT_DOCUMENT_THEN_WORD_DOCUMENT_LENGTH.
1998-08-19 Kamal Nigam <knigam@zeno.jprc.com>
* bow/libbow.h (bow_set_file_types_preprocess): Removed.
(bow_set_all_files_untagged): Removed.
(bow_set_file_types_randomly_by_count_per_class): Removed.
(bow_set_file_types_randomly_by_count): Removed.
(bow_set_file_types_randomly_fraction): Removed.
(bow_set_files_to_type): Removed.
(bow_set_file_types_postprocess): Removed.
(set_doc_types): New prototype.
(bow_ignore_next_wv): Removed.
(bow_ignored_model_next_wv): Removed.
(bow_cdoc_is_unlabeled): New prototype.
* split.c (test_files_count): Removed.
(train_files_count): Removed.
(untagged_files_count): Removed.
(bow_files_source_type): New type.
(bow_split_fancy_count): New type.
(bow_{test,train,unlabeled,ignore}_{files_source,fraction,number,filename,fancy_counts):
New variables.
(bow_test_set_files_use_basename): New variable from opts.c
(bow_split_options): Code for handling split options.
(bow_split_parse_opt): Likewise.
(bow_split_argp): Likewise.
(bow_split_argp_child): Likewise.
(bow_set_all_files_untagged): Don't use removed global variables.
(bow_set_file_types_randomly_by_count_per_class): Likewise.
(bow_set_file_types_randomly_by_fraction): Likewise.
(bow_set_files_to_type): Likewise.
(bow_set_file_types_preprocess): Removed.
(bow_set_file_types_randomly_by_count): Calculate per class nums by
converting to floats to avoid integer roundoff error.
(bow_set_file_types_postprocess): Removed.
(bow_set_file_types_of_remaining): New function.
(set_doc_types): New function.
(_register_split_args): New function.
* rainbow.c (rainbow_options): Removed options
prind-non-uniform-priors prind-no-foilgain-weight-scaling
prind-no-score-normalization and moved to prind.c. Removed
test-percentage set-test-files testing-files
set-files-use-basename testing-files-use-basename set-train-files
and moved functionality to split.c.
(struct rainbow_arg_state): Likewise.
(rainbow_parse_opt): Likewise.
(rainbow_lisp_setup): Likewise.
(main): Likewise.
(rainbow_test): Removed test/train split code into split.c.
* prind.c (prind_options): New for options. Taken from rainbow.c
(prind_parse_opt): Likewise.
(prind_argp): Likewise.
(prind_argp_child): Likewise.
(_register_method_prind): Added option registration.
* opts.c (bow_test_set_files_use_basename): Removed to split.c
* next.c (bow_cdoc_is_model): Updated to new document types.
(bow_cdoc_is_test): Likewise.
(bow_cdoc_is_nontest): Likewise.
(bow_cdoc_is_ignore): Likewise.
(bow_cdoc_is_ignored_model): Likewise.
(bow_cdoc_is_unlabeled): New function.
(bow_ignore_next_wv): Removed.
(bow_ignored_model_next_wv): Removed.
* naivebayes.c (naivebayes_options): Changed argument priority.
* mix.c (mix_options): Changed priority number of the options.
(mix_new_vpc): Updated to new document types.
* knn.c (knn_options): Changed priority number of the options.
* em.c (use_unknown_percent): Removed.
(bow_em_binary_model_counts): Removed.
(unknown_percent): Removed.
(not_unknown_num_by_class): Removed.
(bow_em_unlabeled_num): Removed.
(em_no_splitting): Removed.
(bow_em_fancy_counts): Removed.
(em_no_reset_doc_types): Removed.
(em_options): Removed options em-unlabeled-percent
em-labeled-num-by-class em-unlabeled-num em-no-splitting
em-no-reset-doc-types.
(em_parse_opt): Likewise.
(bow_em_new_vpc_with_weights): Remove all code to do train/unlabeled
splits. This is now handled by libbow. Changed call to
bow_ignore_next_wv to be a call to bow_heap_next_wv.
* active.c (active_initial_known): Removed.
(active_num_unlabeled): Removed.
(active_options): Removed options --active-initial-known and
--active-num-unlabeled. Now handled by libbow.
(active_parse_opt): Likewise.
(active_select_qbc): Update document types so unlabeled docs are now
of type bow_doc_unlabeled. Change other document types to the new
ones.
(active_select_weighted_kl): Likewise.
(active_select_dkl): Likewise.
(active_select_vote_entropy): Likewise.
(active_select_uncertain): Liekwise.
(active_select_relevant): Likewise.
(active_select_length): Likewise.
(active_select_random): Likewise.
(active_select_stream_ve): Likewise.
(active_select_stream_kl): Likewise.
(active_learn): Likewise. Also removed all code to handle
test/train/ignore split here. Now done in libbow. Don't change
doc types at end of active_learn.
1998-08-18 Andrew McCallum <mccallum@jprc.com>
* treenode.c (treenode_keywords_print): Instead of contintionally
printing keywords, and sometimes printing "alias" (depending on kl
value), always print the keywords, AND print the kl divergence
value, so that a later post-processing program can decide for
itself which keywords and aliases it wants to use.
1998-08-17 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h: Declare new split.c functions.
(bow_doc_type): Add bow_doc_untagged.
* rainbow.c (rainbow_options): New command-line options
"set-test-files", "set-train-files", "set-files-use-basename".
Not all are yet handled.
(rainbow_test): Use new file type split.c functions.
(rainbow_test_files): Remove commented out code for building a new
model.
* split.c (bow_set_all_files_untagged): New function.
(bow_set_file_types_preprocess): New function.
(bow_set_file_types_randomly_by_count_per_class): New function.
(bow_set_file_types_randomly_by_count): New function. Takes
over from bow_test_split().
(bow_set_file_types_randomly_by_fraction): New function.
(bow_set_files_to_type): New function. Takes over from
bow_test_set_files().
(bow_test_split, bow_test_split2, bow_test_split3): Functions
removed.
(bow_test_new_heap, bow_heap_next_wv_guts, bow_heap_next_wv,
bow_cdoc_is_*): Functions moved to next.c.
* next.c: New file.
* Makefile.in (LIBBOW_C_FILES): Added next.c.
* random.c (bow_random_set_seed): Moved to this file from split.c.
1998-08-17 Kamal Nigam <knigam@zeno.jprc.com>
* bow/libbow.h (bow_doc_type): Changed values and added an
unlabeled doc type.
* bow/em.h (bow_em_perturb_weights): New prototype.
* em.c: Major overhaul to em.c. Cleaned up code, removed obsolete
functionality, and added some annealing functionality.
(bow_em_stat_method): Removed all but two methods.
(ignored_model_are_false_unknown): removed.
(use_leave_one_out): Removed.
(use_train_for_stats): Removed.
(use_priors_for_initial_class_probs): Removed.
(bow_em_unlabeled_start_method): New type for how to start the
starting point for EM using the unlabeled documents.
(bow_em_multi_hump_init_method): new type for how to distribute the
labeled documents among the multiple negative humps.
(ignored_model_percent): Removed.
(bow_em_pr_window_size): Removed.
(bow_em_take_best_barrel): Removed.
(bow_em_print_stat_summary): Removed.
(em_score_normalize_log): Removed.
(em_unlabeled_classname): Removed.
(use_even_for_initial_class_probs): Removed.
(em_anneal): New option variable.
(em_temperature): new option variable.
(em_temp_reduction): new option variable.
(em_print_correct): New option variable.
(em_unlabeled_start): New option variable.
(em_multi_hump_init): new option variable.
(em_options): Removed options em-unlabeled-class and
em-even-unlabeled-priors em-score-normalize-log. Added options
em-anneal em-temperature em-temp-reduce em-print-progress
em-unlabeled-start and em-multi-hump-init.
(em_parse_opt): Likewise. Also removed extra stat method types.
(bow_em_new_vpc_with_weights): Removed code for all options removed
from em_options. Only handle stat methods for nb_score and
simple. A general clean-up and re-organization to put related
pieces near each other, and de-spaghetti-ization. Added code to
do some deterministic annealing by a specified temperature
reduction schedule. Added code for option to print accuracy after
each round of em (option em-print-progress). Cleaned up code for
specifying how to treat unlabeled docs when specifying starting
point by adding option em-unlabeled-start. Added option to
initialize the multiple negative humps by either randomly putting
a document in a hump, or spreading a document randomly across
humps (option em-multi-hump-init). Changed document types to be
in accordance with new doc types (model -> bow_doc_train, ignore
-> bow_doc_unlabeled, etc.). Prune the vocab before doing
multi-hump initialization based only on the two-class split.
(bow_em_test_accuracy): New function.
(bow_em_set_priors_using_class_probs): Changes for new doc types.
(bow_em_score): Changes for new doc types. Additions for
deterministic annealing code to handle the temperature correctly.
Remove code for em-score-normalize-log.
Tue Aug 4 14:54:55 1998 Andrew's Laptop Account <mccallum@jprc.com>
* mn.c (hac_em_one_mn_iteration): Re-written.
1998-08-04 Andrew McCallum <mccallum@jprc.com>
* mn.c: New file.
Tue Aug 4 14:35:32 1998 Andrew's Laptop Account <mccallum@jprc.com>
* hac.c (hac_split_kl_threshold): Change default value from 2000
to 0.4.
(hac_hypothesize_grandchildren): Use treenode_children_kl_div()
instead of treenode_children_weighted_kl_div().
(MN): New macro, set to 0. Include "mn.c" if true.
(hac_generates_class): New global variable.
(hac_em_one_iteration): Simply return hac_em_one_mn_interation if MN.
(hac_cluster): If CROSSBOW_CLASSES_COUNT, print the class distribution.
* crossbow.c (crossbow_index): Always set CROSSBOW_CLASSES_COUNT,
indepedent of BUILD_HIER_FROM_DIR. Also set classes distribution
to uniform in the root, so that all children will get space
allocated for this distribution.
1998-07-30 Jason Rennie <jrennie@cyclone.jprc.com>
* lex-suffixing.c (suffixing_snarf_suffix): added hack so that
arrow will (hopefully) work on Cora 23k+ document data set
1998-07-28 Jason Rennie <jrennie@cyclone.jprc.com>
* arrow.c (main): Don't load everything into memory at start.
1998-07-22 Jason Rennie <jrennie@cyclone.jprc.com>
* docnames.c (bow_map_filenames_from_dir): Try opening as file if
directory doesn't work.
* docnames.c (bow_map_filenames_from_dir): Give warning and return
if no directory found.
1998-07-17 Andrew McCallum <mccallum@jprc.com>
* hac.c (hac_cluster): Print the iteration number.
(hac_full_em): Don't assert HAC_LOO. Print the word distribution
immediately after depositing the data. Add (commented out) code
for initializing the lambdas to a skewed distribution. Print the
iteration number.
* treenode.c (treenode_pr_wi_loo_local): Handle the case in which
there is no data left after the LOO document is removed.
(treenode_keywords_print): Use treenode_pair_kl_div() instead of
treenode_pair_weighted_kl_div().
* bow/treenode.h: Declare Kamal's new function.
1998-07-15 Kamal Nigam <knigam@zeno.jprc.com>
* hac.c (hac_shrinkage): New variable for shrinkage option.
(hac_loo): New variable for loo option.
(cluster_hac_options): Changes for loo and shrinkage options.
(cluster_hac_parse_opt): Likewise.
(hac_hypothesize_grandchildren): Move code to create children to new
function.
(hac_create_children_for_node): New function.
(hac_perplexity): Changed shrinkage and loo compiler options to
runtime optionns.
(hac_em_one_iteration): Likewise. Added loo-no shrinkage
option.
(hac_cluster): Also created children of root identically to how other
children are created.
(hac_full_em): check that loo and shrinkage are in use.
* treenode.c (treenode_log_local_prob_of_wv_loo): New function.
1998-07-15 Andrew McCallum <mccallum@jprc.com>
* crossbow.c (crossbow_filename2di): New global variable.
(crossbow_classes_count): New global variable.
(crossbow_archive): Write CROSSBOW_FILENAME2DI.
(crossbow_unarchive): Read it, and initialize CROSSBOW_CLASSES_COUNT.
(crossbow_index): Initialize CROSSBOW_FILENAME2DI. Set
CROSSBOW_CLASSES_COUNT and create space in the root for CLASSES
distribution.
(crossbow_classify_test_docs): In addition to calculating fraction
correct, also calculate average inverse rank, and average score
difference between top ranked class and correct class.
* bow/crossbow.h: Declare new global variables.
* hac.c (hac_cluster): Don't print newlines before TEMPERATURE.
(hac_full_em): Use alpha=1 when setting leaf prior, instead of 0.
Don't print leaf priors since they don't change. Print top words
if HAC_VERTICAL_WORD_MOVEMENT.
* treenode.c (treenode_new): Set CLASSES_CAPACITY from PARENT if
there is a parent, then also initialize CLASSES to uniform.
(treenode_write): Write the CLASSES information to disk. NOTE: This
changes the file format for the model.
(treenode_new_from_fp): Read in the CLASSES information.
(treenode_set_classes_uniform): New function.
(treenode_set_classes_from_new_classes): New function.
* bow/treenode.h: Declare new functions.
1998-07-14 Andrew McCallum <mccallum@jprc.com>
* crossbow.c (main): Tell the lexer to toss words longer than 20.
* hac.c (hac_hypothesize_grandchildren): Instead of void, return
non-zero int if a split was made.
(hac_consider_splitting): Likewise.
(hac_cluster): Write classification and keywords to separate files for
different iterations.
* treenode.c (treenode_set_words_from_new_words): Renormalize
WORDS distribution after accelerated EM.
(treenode_set_new_words_from_perturbed_words): Add NOISE_WEIGHT
argument.
(treenode_set_new_words_from_perturbed_words_all): Likewise.
(treenode_set_lambdas_from_new_lambdas): Always avoid accelerated EM,
even if USE_ACCELERATED_EM is non-zero.
(treenode_keywords_print): New function.
(treenode_keywords_print_all): New function.
(treenode_pair_kl_div): New function.
(treenode_pair_weighted_kl_div): New function.
* bow/treenode.h: Declare new functions.
1998-07-13 Andrew McCallum <mccallum@jprc.com>
* crossbow.c (crossbow_classify_test_docs): Add SHRINKAGE argument
and use it instead of macro.
(crossbow_classify): Use TEST_FRACTION when splitting.
(crossbow_options): New option "test-percentage".
(crossbow_parse_opt): Handle it.
(main): Initialize it to .3.
* hac.c (hac_em_one_iteration): If ANCESTOR_MEMBERSHIP is zero,
continue. Set the leaf prior before setting word distribution,
not after.
(hac_place_labeled_data): New function.
(hac_full_em): Use it. Move hac_em_one_iteration() to after the
diagnostic printing, not before.
* treenode.c (treenode_set_lambdas_uniform_all): New function.
(treenode_set_words_from_new_words): Don't assert that
TOTAL_WORD_COUNT is greater than zero. If it is zero, set the
distribution to uniform, with ALPHA equal to 1/N.
(treenode_set_leaf_prior_from_new_prior_all):
(treenode_set_lambdas_from_new_lambdas): Likewise.
(treenode_log_local_prob_of_wv): Use log()-prob, not prob!
(treenode_word_probs_print_all): Only print the prior for leaf nodes.
(treenode_children_kl_div): Don't assert that KLDIV is less than 10.
* bow/treenode.h: Declare new functions.
1998-07-09 Andrew McCallum <mccallum@jprc.com>
Let crossbow do classification.
* crossbow.c (SHRINKAGE): New macro.
(crossbow_arg_state): New element BUILD_HIER_FROM_DIR.
(crossbow_docs_split): New function.
(crossbow_doc_is_model): New function.
(crossbow_doc_is_test): New function.
(crossbow_doc_is_ignore): New function.
(crossbow_new_root_from_dir): New function.
(crossbow_index): Initialize DOC.TAG. Obey BUILD_HIER_FROM_DIR.
(crossbow_classify_test_docs): New function.
(crossbow_classify): Filled out.
(crossbow_options): New command-line options "build-hier-from-dir" and
"classify".
(main): Set default BUILD_HIER_FROM_DIR to 0.
* bow/crossbow.h: Declare new functions.
(crossbow_doc): New element CI.
* hac.c (hac_deterministic_horizontal): New variable.
(hac_vertical_word_movement): New variable.
(cluster_hac_options): New command-line option
"hac-no-vertical-word-movement".
(cluster_hac_parse_opt): Handle it.
(hac_perplexity): New function.
(hac_em_one_iteration): Obey HAC_DETERMINISTIC_HORIZONTAL and
HAC_VERTICAL_WORD_MOVEMENT.
(hac_cluster): Don't initialize children of root if there already are
some. Add LOO counts in the initial random E-step.
(hac_full_em): New function.
* treenode.c (treenode_new): Add NAME argument. All callers
changed. Initialize CLASSES to NULL.
(treenode_new_from_fp): Initialize CLASSES to NULL.
(treenode_children_weighted_kl_div): New function.
* bow/treenode.h: Declare new functions.
(treenode): Add new elements CLASSES_CAPACITY, CLASSES, NEW_CLASSES.
* bow/libbow.h (bow_array_next_index): Declare function.
1998-07-09 Jason Rennie <jrennie@cyclone.jprc.com>
* rainbow-stats.pl (read_trial): try to sort class names
1998-07-06 Andrew McCallum <mccallum@jprc.com>
* hac.c (LOO): New macro.
(cluster_hac_options): New command-line option "hac-maximum-depth".
(cluster_hac_parse_opt): Handle it.
(hac_hypothesize_grandchildren): Obey it.
(hac_em_one_iteration): Implement LOO.
* treenode.c (treenode_new): Initialize LOO treenode variables.
(treenode_new_from_fp): Likewise.
(treenode_add_new_loo_for_di_wvi): New function.
(treenode_free_loo): New function.
(treenode_set_loo_from_new_loo): New function.
(treenode_set_words_from_new_words): Set NEW_WORDS_NORMALIZER.
(treenode_pr_wi_loo_local): New function.
(treenode_pr_wi_loo): New function.
(treenode_word_likelihood_ratios): Use shrinkage estimates for word
probabilities intead of the local estimates.
* bow/treenode.h: Declare new functions.
(treenode): Add elements DI_LOO, DI_WVI_LOO, NEW_DI_LOO and
NEW_DI_WVI_LOO.
* configure.in (CFLAGS): Change default to include -Wimplicit
instead of -Wno-implicit.
* bow/crossbow.h (crossbow_convert_log_probs_to_probs): Declare
function.
1998-07-03 Andrew McCallum <mccallum@jprc.com>
* crossbow.c (crossbow_arg_state): New member CLUSTER_OUTPUT_DIR.
(crossbow_leaf_document_probs_print): Calculate cross entropy instead
of probability.
(crossbow_archive): Add DIRNAME argument.
(crossbow_unarchive): Likewise.
(crossbow_classify_all_docs): New function.
(crossbow_options): New command-line option "cluster-output-dir".
(main): Handle it.
* hac.c: Add command-line options for many parameters. Call
crossbow_classify_all_docs().
* treenode.c (USE_ACCELERATED_EM): New macro.
(EM_ACCELERATION): New macro.
(treenode_set_words_from_new_words): Use EM_ACCELERATION.
(treenode_set_lambdas_from_new_lambdas): Likewise.
(treenode_iterate_all_under_node): New function.
(treenode_pr_wi): Make it work for interior nodes of the tree as well
as the leaves.
(treenode_prior): New function.
(treenode_node_count): New function.
(treenode_normalized_word_prob_all_print): New function.
* bow/treenode.h: Declare new functions.
1998-06-27 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h (bow_fwrite_double): New function.
(bow_fread_double): New function.
* bow/treenode.h: Declare new functions.
(treenode): Change all float's to double's.
* crossbow.c (crossbow_unarchive): Use renamed function
treenode_new_from_fp().
* hac.c (SHRINKAGE): New macro.
(hac_hypothesize_grandchildren): Change KLDIV threshold from 0.0001 to
0.2. Instead of setting new LAMBDAS to uniform, re-divide the
distribution from the new leaf's parent. Use shrinkage.
(hac_em_one_iteration): Use shrinkage.
(hac_consider_splitting): New function.
(cluster_hac): Use it. Initialize by dividing each document
probabilistically among children, instead of by hard assignment.
Print just the word prob distributions, not the likelihood ratios.
* treenode.c (treenode_new): Initialize LAMBDAS and NEW_LAMBDAS to
put all the weight on the new treenode.
(treenode_write): Write the PRIOR and NEW_PRIOR. Everything that was
a float is now a double.
(treenode_new_from_fp): Read them. Likewise.
(treenode_set_lambdas_uniform): New function.
(treenode_set_lambdas_from_new_lambdas): New function.
(treenode_log_local_prob_of_wv): New function.
(treenode_complete_log_prob_of_wv): New function.
(treenode_pr_wi): New function.
(treenode_log_prob_of_wv): Changed to use shrunken pr_wi estimates.
(treenode_word_leaf_likelihood_ratios): Deal with zero word probs.
(treenode_children_kl_div): Likewise.
1998-06-24 Andrew McCallum <mccallum@jprc.com>
* hac.c (hac_branching_factor): Renamed from NUM_CLUSTERS.
(hac_hypothesize_grandchildren): Renamed from
hac_hypothesize_grandchidren, and implemented.
(hac_em_one_iteration): Fix bug in perplexity calculation.
(cluster_hac): Intead of clustering flatly, grow a tree. Print more
diagnostics.
* bow/treenode.h: Declare more functions.
* treenode.c (treenode_new): Fix parenthezation of strlen for
NAME. Set WORDS_CAPACITY from parent.
(treenode_word_likelihood_ratios): Calculate weighted log likelihood
ratio instead of straight log likelihood.
(treenode_word_leaf_likelihood_ratios): New function.
(treenode_word_leaf_odds_ratios): New function.
(treenode_word_leaf_mean_ratios): New function.
(treenode_word_leaf_likelihood_ratios_print): New function.
(treenode_word_leaf_odds_ratios_print): New function.
(treenode_children_kl_div): New function.
(treenode_is_leaf_parent): New function.
* random.c (bow_random_double): Add LOW value after the
multiplication!
* em.c (random_double): Add LOW value after the multiplication!
1998-06-23 Jason Rennie <jrennie@cyclone.jprc.com>
* barrel.c (bow_barrel_add_from_text_dir): when printing
bow_num_words, allow 8 spaces
1998-06-22 Andrew McCallum <mccallum@jprc.com>
* Makefile.in (LIBBOW_C_FILES): Add random.c.
(CROSSBOW_C_FILES, CROSSBOW_O_FILES): New variables.
(crossbow): New target.
* bow/crossbow.h: Remove treenode functions. Declare new
functions.
* bow/libbow.h (bow_sparray): Declaration for a sparse array, but
it's not implemented yet.
(bow_wv_word_count): Declare new function.
(bow_random_double, bow_random_01): Declare new functions.
* crossbow.c: Move all treenode functions to treenode.c.
(crossbow_convert_log_probs_to_probs): New function.
(crossbow_leaf_document_probs_print): New function.
(crossbow_index): Create the root with space for 10 children.
(crossbow_cluster): Call cluster_hac().
* random.c: Fix typos.
1998-06-19 Andrew McCallum <mccallum@jprc.com>
* random.c: New file.
* wv.c (bow_wv_word_count): New function.
* array.c (bow_array_next_index): New function.
1998-06-18 Kamal Nigam <knigam@kamal.jprc.com>
* cluster-flat.c (cluster_flat): Initial check-in.
1998-06-18 Andrew McCallum <mccallum@jprc.com>
* crossbow.c (treenode_word_likelihood_ratio): New function.
(treenode_word_likelihood_ratio_print_top): New function.
* bow/crossbow.h: Declare many functions.
* crossbow.c (treenode_new): Initialize the WORDS and NEW_WORDS to
zero.
(crossbow_wv_at_di): New function.
(crossbow_load_doc_wvs): New function.
(crossbow_unarchive): Call it. Read wv's from file named "wvs".
(crossbow_index): Get the WV_SEEK_POS from the correct FP! Write wv's
to file named "wvs".
(crossbow_options): New option "cluster".
(crossbow_parse_opt): Handle it.
(main): Call crossbow_unarchive if WHAT_DOING isn't crossbow_index.
* crossbow.c: Move definitions to crossbow.h. Create read and
write an array of crossbow_doc structures.
(crossbow_index) Fix bug that was causing crash.
* crossbow.h: New file.
1998-06-17 Andrew McCallum <mccallum@jprc.com>
* crossbow.c: Completely re-written in preparation for
implementing hierarchical clustering.
* array.c (bow_array_init): Use bow_malloc instead of malloc.
1998-06-09 Jason Rennie <jrennie@cyclone.jprc.com>
* tfidf.c: change 'extern int' to 'int' so that
bow_tfidf_num_hit_documents is defined
1998-06-09 Andrew McCallum <mccallum@jprc.com>
* tfidf.c (bow_tfidf_set_weights): When DOING_LOG_COUNTS, use the
count+1, not just count. This avoids unwanted zeros when
count==1!
* wv.c (bow_wv_set_weights_to_log_count_times_idf): Likewise.
* tfidf.c (bow_tfidf_num_hit_documents): New global variable.
(bow_tfidf_score): Set it.
* bow/tfidf.h: Declare new global variable.
* arrow.c (arrow_query): Print BOW_TFIDF_NUM_HIT_DOCUMENTS.
1998-06-08 Andrew McCallum <mccallum@jprc.com>
* arrow.c (arrow_query): Cast argument to bow_free to avoid
warning.
* tfidf.c (DOING_LOG_COUNTS): New macro.
(bow_tfidf_set_weights): Use it to (conditionally) set weights to the
log(count)*idf, instead of count*idf. This is an attempt to
improve IR for Cora by more strongly insisting on having all query
terms be present.
(bow_tfidf_score): Likewise. Calculate the total number of documents
that had query terms present, and print this number to standard
out as a directive to the Cora CGI script as ",HITCOUNT %d".
* wv.c (bow_wv_set_weights_to_log_count_times_idf): New function.
* libbow.h: Declare new function.
* arrow.c (arrow_query): Use NUM_SUFFIXES an another place it
should be, instead of the old constant 5. Print the score on each
line, in addition to the filename and hit words.
1998-06-05 Andrew McCallum <mccallum@jprc.com>
* tfidf.c (bow_tfidf_score): Fix bug in query word recording, and
uncomment code.
* tfidf.c (bow_tfidf_score): Only return documents that contain
one or more query terms. Add code for returning the query words
that were found in each document, but comment it out since it
isn't working yet.
* arrow.c (NUM_SUFFIXES): New macro.
(arrow_query): Use it. Print not just the FILENAME, but also the
HITS[I].NAME, which, in the case of TFIDF, should contain the
query words that were found in this document.
1998-06-04 Andrew McCallum <mccallum@jprc.com>
* barrel.c (bow_barrel_new_from_printed_barrel_file): Read word
count as a float, not an integer. Initialize the barrel's wi2dvf
WEIGHT as well as COUNT with the word count.
1998-06-03 Andrew McCallum <mccallum@jprc.com>
* lex-suffixing.c (suffixing_snarf_suffix): Change suffix boundary
from '-' to "xxx". Lowercase the suffix.
* arrow.c (arrow_query): Likewise.
1998-06-01 Andrew McCallum <mccallum@jprc.com>
* STANDARDS: New file.
* rainbow-h.c (hier_barrel): New element LOO_DOC_WI2DVF.
(hier_barrel_new): Initialize it to NULL.
(hier_barrel_init_all_new_class_wi2dvf): If it is NULL, initialize it
to an empty wi2dvf.
(hier_barrel_set_class_model_from_new_class_wi2dvf): Instead of
making a new empty NEW_CLASS_WI2DVF for the next round, just set
it to NULL; hier_barrel_init_all_new_class_wi2dvf() will create a
new one for us, now that it is being called inside the
while()-loop in the function hier_barrel_calc_lambdas_all_nodes().
(hier_barrel_calc_lambda_one_iteration): Calculate the DI in the root
of each document from heap. Then use this to add
E-step-probability-mass to the SOURCE_HBARREL->LOO_DOC_WI2DVF.
(hier_barrel_calc_lambdas_all_nodes): Call
hier_barrel_init_all_new_class_wi2dvf() inside the while loop, so
that the LOO_DOC_WI2DVF gets properly re-initialized.
(hier_barrel_word_probabilities_for_hbarrel): Comment out unused
variable.
(rainbowh_unarchive): Initialize the FILENAME -> DI mapping in the
root's DOC_BARREL. This is used in
hier_barrel_calc_lambda_one_iteration() to calculate ROOT_DI.
1998-06-02 Kamal Nigam <knigam@kamal.jprc.com>
* active.c: active_select_stream_kl: New function.
(active_selection_type): added skl as a method.
(active_parse_opt): Likewise.
(active_learn): Cleaner futzing and restoring of EM options when
calculating test stats. Correctly rebuild barrel for testing
stats.
* em.c (bow_em_new_vpc_with_weights): Don't try to set priors if
doing uniform class priors; this avoids potential badness. If
zero labeled docs, only create a random starting point if we're
not doing 'em-no-splitting'; active learning needs specific
starting points. Create full word-class matrix for all words in
vocabulary; this avoids problems when perturbing the barrel for
zero-occurring words.
* barrel.c (bow_barrel_copy): initialize classnames correctly
1998-06-01 Andrew McCallum <mccallum@jprc.com>
These changes from Sean Slattery <Sean.Slattery@cs.cmu.edu>
* bow/libbow.h (bow_bitvec): Added element BITS_SET.
* bitvec.c (bow_bitvec_set): Keep BITS_SET up to date.
1998-06-01 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c: First attempts at Full EM. It doesn't do
leave-one-document out properly because it doesn't keep track of
how a document's data was distributed among the ancestors.
(rainbowh_options): Added "emfull" as a shrinkage option.
(rainbowh_parse_opt): Handle it.
(struct rainbowh_arg_state): New element FULL_EM.
(hier_barrel_prune_words_not_in_map_and_set_vpc): Remind myself to set
the CDOC->NORMALIZER here.
(hier_barrel_set_local_class_model): Likewise.
(hier_barrel_init_all_new_class_wi2dvf): Create NEW_CLASS_WI2DVF with
space enough for bow_num_words() + 2, not jus tbow_num_words().
(hier_barrel_set_class_model_from_new_class_wi2dvf): Return
immediately if HBARREL->CLASS_BARREL is NULL, and only do the
replacement if HBARREL->NEW_CLASS_WI2DVF is non-NULL.
(hier_barrel_prob_wi_in_ci_wittenbell): Function gutted.
(old_hier_barrel_prob_wi_in_ci_shrink): Function removed.
(hier_barrel_prob_wi_in_ci_local): Comment out assertion about WEIGHT
equalling COUNT. As a temporary measure, check for negative
counts and make them zero.
(hier_barrel_prob_wi_in_ci_local): As a temporary measure, check for
negative counts and make them zero.
(hier_barrel_prob_wi_local): Comment out assertion about WEIGHT
equalling COUNT. As a temporary measure, check for negative
counts and make them zero.
(hier_barrel_calc_lambda_one_iteration): Attend to
RAINBOWH_ARG_STATE.FULL_EM.
(hier_barrel_all_reset_cdoc_word_count_from_wi2dvf_weights): New
function.
(hier_barrel_calc_lambdas_all_nodes): Print overall perplexity and do
new necessary resets.
(hier_barrel_word_probabilities_for_class): Don't bother trying entire
CLASS_BARREL->WI2DVF->SIZE, because many will have null DV's. Get
local probabilities instead of shrunken ones.
(hier_barrel_word_probabilities_for_hbarrel): New function.
(hier_barrel_print_word_probabilities_for_hbarrel): Use new function.
(hier_barrel_print_all_highest_prob_words): New function.
(_hier_barrel_local_score): When printing scores, print count instead
of weight.
(hier_barrel_set_lambdas): Do full EM if requested by attending to
RAINBOWH_ARG_STATE.FULL_EM.
(main): Initialize RAINBOWH_ARG_STATE.FULL_EM.
* Makefile.local (rainbow-h): Include argp/libargp.a in the
dependencies.
* naivebayes.c
(bow_naivebayes_set_cdoc_word_count_from_wi2dvf_weights): New
function.
* bow/naivebayes.h
(bow_naivebayes_set_cdoc_word_count_from_wi2dvf_weights): Declare
function.
* wa.c (bow_wa_append): Fix boundary condition for array growth.
1998-06-01 Jason Rennie <jrennie@kamal.jprc.com>
* arrow.c (arrow_query): return 0 if query words all have 0 IDF
1998-05-31 Jason Rennie <jrennie@kamal.jprc.com>
* libbow.h (BOW_MAX_WORD_LENGTH): increased to 4096
1998-05-29 Jason Rennie <jrennie@kamal.jprc.com>
* arrow.c (arrow_query): added a feedline to a verbosify statement
1998-05-29 Jason Rennie <jrennie@data.jprc.com>
* arrow.c (arrow_serve): kill child after serving is done.
(main): prevent zombie children in System V environment.
1998-05-22 Andrew McCallum <mccallum@jprc.com>
* arrow.c (arrow_index): Make default method tfidf instead of
prind.
(arrow_query): Add suffixes "-Title", "-Author", "-Institution",
"-Abstract" to the query.
* lex-suffixing.c: New file.
* Makefile.in (LIBBOW_C_FILES): Added lex-suffixing.c.
* bow/libbow.h (bow_suffixing_lexer, bow_default_lexer_suffixing):
Declare variables.
* bow/tfidf.h (bow_method_tfidf): Declare method.
* deflexer.c (_bow_default_lexer_init): Add suffixing lexer.
* opts.c (bow_options): New command-line option "lex-suffixing".
(parse_bow_opt): Handle it.
* error.c (_bow_error): fflush() before aborting.
* rainbow-rank.pl: Calculate average inverse rank, instead of
average rank.
1998-05-06 Andrew McCallum <mccallum@jprc.com>
* split.c (bow_basename): Don't include the `/' at the beginning
of the NUM_COMPONENTS components. This way Reuters experiments,
with a --testing-files file containing only the document numbers
will still work; before this function was returning strings like
"/15232" for the filenames in the CDOCS.
* vpc.c (bow_barrel_set_vpc_priors_by_counting): Avoid crashing
when PRIOR_SUM is zero.
* em.c (random01): Make sure the this doesn't return a number
equal to 0 or 1, just as "Recipes in C" says it should for doing
the Gamma distribution properly. We were getting negative values
somewhere later in the code and suspect this is the culprit.
(bow_em_perturb_weights): Don't check the cdoc->type of the class
CDOCS; they aren't set. Assert that CDOC->WORD_COUNT is
non-negative.
1998-05-05 Andrew McCallum <mccallum@jprc.com>
* Makefile.in (PERL_FILES): Added rainbow-rank.pl. Written by
Andrew Y. Ng <ayn@ai.mit.edu>.
1998-05-04 Andrew McCallum <mccallum@jprc.com>
* em.c (bow_em_perturb_weights): Properly perturb zero-valued
WEIGHTS when doing dirichlet-based perturbation.
1998-04-30 Andrew McCallum <mccallum@jprc.com>
* kl.c (bow_kl_score): Don't insist on a non-empty QUERY_WV. If
the query is empty, just use the class priors. Accomplish this by
setting QUERY_WORD_COUNT equal to 1.
* naivebayes.c (bow_naivebayes_set_weights): Avoid errors when the
barrel is completely empty.
1998-04-28 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hier_barrel_prune_words_not_in_map_and_set_vpc):
Return immediately if HBARREL is NULL.
(hier_barrel_print_infogain_hier): Likewise.
(hier_barrel_set_vpc_and_populate_lower_branches): After freeing the
DOC_BARREL's, immediately re-create them, instead of relying on
later code to create them on demand.
(hier_barrel_is_leaf_parent): New function.
(_hier_barrel_set_node_scores): Use it to determine when to avoid
scoring an interior node. The previous test only worked with
uniform-depth trees, and would cause some classes never to win.
(main): Likewise.
* split.c (bow_basename): Fix handling of NUM_COMPONENTS.
* active.c (active_learn): Fix typo:
em_perturb_starting_point_by_variance ->
bow_em_perturb_starting_point.
1998-04-25 Andrew McCallum <mccallum@jprc.com>
* em.c (bow_em_perturb_starting_point): Renamed from
em_perturb_starting_point_by_variance. Changed from int (treated
as binary) to an bow_em_perturb_method (an enum), for none,
gaussian and dirichlet.
(em_options): Expand documentation for "em-perturb-starting-point".
(em_parse_opt): Handle it.
(random01): New function.
(bow_gamma_distribution): New function.
(bow_em_perturb_weights): Use it to implement dirichlet perturbation.
* bow/em.h (bow_em_perturb_method): New typedef.
(bow_em_perturb_starting_point): Renamed from
em_perturb_starting_piont_by_variance.
* active.c: Rename em_perturb_starting_point_by_variance to
bow_em_perturb_starting_point. Note that this variable is now an
enum.
1998-04-24 Andrew McCallum <mccallum@jprc.com>
* arrow.c (arrow_arg_state): New element SERVE_WITH_FORKING.
(arrow_query): New argument NUM_HITS_TO_SHOW.
(arrow_socket_init): Add verbosity.
(arrow_serve): Initial code thinking about a forking server.
(main): In query_serving, touch all dv's if we are a forking server.
1998-04-24 Kamal Nigam <knigam@hurricane.jprc.com>
* bow/libbow.h: New prototypes for bow_barrel_copy and split.c
changes.
* active.c (active_selection_type): added stream vote entropy.
(active_stream_epsilon): New option for stream vote entropy.
(active_perturb_after_em): New option.
(active_select_stream_ve): New function.
(active_learn): Only calculate density if needed for dkl. Code for
new option active_perturb_after_em.
* barrel.c (bow_barrel_copy): New function.
* split.c (bow_cdoc_is_nontest): New function.
(bow_cdoc_is_ignore): New Function.
(bow_cdoc_is_ignored_model): New function.
(bow_test_next_wv): Converted to use bow_heap_next_wv so returns all
docs.
(bow_nontest_next_wv): Likewise.
(bow_model_next_wv): Likewise.
(bow_ignore_next_wv): Likewise.
(bow_ignored_model_next_wv): Likewise.
1998-04-24 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hier_barrel_test): Make the format of the infogain
filename conform to the new standard in the hier/Makefile.
* rainbow-h.c (rainbowh_options): Add new command line argument
"testing-files-use-basename" and have it take an optional numeric
argument to indicate how many path components to take from the end
of the path name.
(rainbowh_parse_opt): Handle it.
* rainbow.c (rainbow_options): Make "testing-files-use-basename"
take an optional numeric argument to indicate how many path
components to take from the end of the path name.
(rainbow_parse_opt): Handle it.
* split.c (bow_basename): Added new argument NUM_COMPONENTS, so we
can grab a certain number of the directory names too.
(bow_test_set_files): Use new argument.
1998-04-24 Jason Rennie <jr6b@andrew.cmu.edu>
* arrow.c (arrow_process_commands): New function.
(arrow_serve): Added capability to fork off child servers
(arrow_parse_opt): added option "--query-forking-server"
1998-04-23 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hier_barrel_calc_lambda_one_iteration):
Temporarily turn off use of NEW_CLAS_WI2DVF.
(hier_barrel_print_barrel): Use correct function name to print
selected.
1998-04-22 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hier_barrel_print_barrel): Avoid crashes, and print
documents only from their leaves.
* barrel.c (bow_barrel_printf_selective): Renamed from
bow_barrel_printf, and added third argument.
(bow_barrel_printf): Now a wrapper around bow_barrel_printf_selected.
* bow/libbow.h: Declare new function.
* rainbow-h.c: Removed calls to
hier_barrel_check_weight_equals_count.
* rainbow-h.c (hier_barrel_for_filename): Make sure
LONGEST_MATCH_CI is initialized.
(hier_barrel_wa_flat_infogain): Initialize MARGINAL_TOTAL to zero.
* Makefile.local (rainbow-h): Depend on rainbow-h.o instead of
rainbow-h.c, so rainbow-h.o gets built.
* bow/libbow.h: Include <mcheck.h> if BOW_MCHECK is defined.
(bow_malloc_hook, bow_free_hook): Global variables declared.
(bow_malloc): Call the hook. Do special things if MCHECK is defined.
(bow_realloc): Call the hook.
(bow_free): Likewise.
(bow_test_split3, bow_ignore_split, bow_argp_add_child): Functions
declared.
* bmalloc.c (bow_malloc_check_all): New function.
(_bow_malloc_hook): New function.
(_bow_free_hook): New function.
(bow_malloc_hook, bow_free_hook): Global variables defined.
* arrow.c: Declare external wicoo functions.
* bow/naivebayes.h: Declare some functions in naivebayes.c.
* naivebayes.c: Declare external simple_good_turing function.
* istext.c (NUM_TEST_CHARS): Change from "static const int" to a
macro.
* rainbow-h.c (hier_barrel): New element NEW_CLASS_WI2DVF.
(hier_barrel_check_weight_equals_count): New function.
(hier_barrel_init_all_new_class_wi2dvf): New function.
(hier_barrel_set_class_model_from_new_class_wi2dvf): New function.
(hier_barrel_set_all_class_model_from_new_class_wi2dvf): New function.
(hier_barrel_add_stats): Add the doc_barrel COUNT to the class_barrel
WEIGHT, instead of adding zero for the weight.
(hier_barrel_ancestor): In third argument return the CI, which in the
returned hbarrel points towards the HBARREL argument.
(hier_barrel_is_ancestor): New function.
(hier_barrel_prob_wi_in_ci_local): Use WEIGHT instead of COUNT to
calculate probabilities.
(hier_barrel_calc_lambda_one_iteration): Check that the weight equals
the count, temporarity as debugging. Prepare to do *full* EM, by
accumulating probability mass of beta's in NEW_CLASS_WI2DVF.
(hier_barrel_calc_lambdas_all_nodes): New function.
(hier_barrel_print_word_probabilities_for_class): Now uses bow_wa, and
takes new argument specifying now many to print.
(hier_barrel_print_all_highest_prob_words): New function.
(hier_barrel_word_probabilities_for_class): New function.
(hier_barrel_print_infogain_flat): Free the WA at end!
(main): Pass new argument to
hier_barrel_print_word_probabilities_for_class.
1998-04-21 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hash_elt_t): typedef removed.
(hier_barrel): Remove all elements having to do with Andrew Ng's old
hash.
(hier_barrel_new_from_file): Remove second, conditional fclose(fp).
That is, don't close the fp twice! On RedHat5.0 this crashes,
although it didn't on RedHat4.x.
(HIER_TEST): Preprocessor conditional removed.
* rainbow-h.c (hier_dir_is_leaf): Remove "const" from type of
argument to dir_leaf_select. Avoids warning with new RedHat5.0.
* rainbow-h.c (rainbowh_options): New command line option
"use-vocab-size-hier", and old command line option,
"use-vocab_size" is now an alias for it. Functionality of this
option unchanged. New command line options "use-vocab-size-flat",
"print-word-infogain-flat", "print-word-infogain-hier".
(struct rainbowh_arg_state): New elements VOCAB_SIZE_METHOD and
FORCE_TRIVIAL_HIER.
(rainbowh_parse_opt): Handle new options. Add new shrinkage method
"emtriv", setting FORCE_TRIVIAL_HIER.
(hier_barrel_calc_lambda_one_iteration): Don't print warning about
zero lambda if FORCE_TRIVIAL_HIER.
(hier_barrel_initialize_lambdas): If FORCE_TRIVIAL_HIER, set middle
lambdas to zero, and local and root lambdas to .5.
(hier_barrel_print_infogain_hier): Renamed from
hier_barrel_print_infogain.
(hier_barrel_wa_flat_infogain): New function.
(hier_barrel_print_infogain_flat): New function.
(hier_barrel_test): Change sprintf of filename in which to store the
reduce vocabulary. Now has pattern "infogain%d-%s-%05d", where
string is the vocab size method. Obey VOCAB_SIZE_METHOD.
(main): Change defaults from populate_by_scoring=1 and
hier_structure=neice to populate_by_scoring=0 and
hier_structure=leaf. Print the total number of words using
bow_verbosify instead of fprintf. Implement
RAINBOWH_PRINTING_INFOGAIN_FLAT.
Tue Apr 21 11:17:46 1998 Andrew's Laptop Account <mccallum@jprc.com>
* bow/libbow.h (bow_entropy): Declare function.
(bow_event_models): Document with comments.
(bow_infogain_event_model): Declare global variable.
* info_gain.c (bow_entropy): Change type of first argument from
count[] to count*.
(bow_infogain_per_wi_new): Instead of using an infogain event
model that matches the word probability event model, use whichever
event model is specified by new global variable
BOW_INFOGAIN_EVENT_MODEL. Default is "document event model".
Thus the default is the old behavior before the binomial paper was
written. Before this change, naive Bayes by default used a "word
event model".
* naivebayes.c (bow_naivebayes_set_weights): Fix the setting of
IDF to Pr(w). Consolodate some code that no longer needs to be
separate because we no longer set the DV weights in this function.
* opts.c (bow_infogain_event_model): New global variable.
(bow_options): New option "infogain-event-model".
(parse_bow_opt): Handle it.
* wa.c (bow_wa_sort): Fix bug by which the wrong pointer was
passed to qsort.
(bow_wa_sort_reverse): Likewise.
(bow_wa_fprintf): If N is less than 0, print all entries.
Wed Apr 15 11:06:25 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hier_barrel_correct_cdoc_prior_recurse): Fix typo.
(rainbowh_query): Add new filename argument to call to
bow_wv_new_from_text_fp().
* mix.c (mix_num_centroids_per_class): Change default from 4 to 2.
(mix_num_iterations): New static variable.
(mix_options): New command-line option "mixture-num-em-iterations".
(mix_parse_opt): Handle it.
(mix_perplexity): New function, (empty).
(mix_new_vpc): Fix bugs.
* rainbow-h.c (hier_barrel_correct_cdoc_prior_recurse): Change the
method of correction. Instead of multiplying by a ratio of the
number of children, multiply by all the priors up the tree to the
root.
(hier_barrel_total_of_cdoc_word_counts): New function (not used).
* Makefile.in (STANDARD_METHOD_C_FILES): Renamed from
METHOD_C_FILES.
(METHOD_C_FILES): Set from $(STANDARD_METHOD_C_FILES).
(DIST_FILES): Use STANDARD_METHOD_C_FILES instead of
METHOD_C_FILES. Thus we exclude methods added in Makefile.preamble.
* rainbow-stats.pl: Changes by Jason Rennie to make it more memory
efficient.
This version in the snaphot for today's date.
* rainbow.c (rainbow_query): Add new filename argument to
bow_wv_new_from_text_fp.
(rainbow_lisp_query): Likewise.
(rainbow_test_files): Likewise.
* arrow.c (arrow_query): Likewise.
(main): Likewise.
* crossbow.c (main): Likewise.
* active.c (active_selection_type): New method "dkl".
(active_alpha): New static variable.
(active_beta): New static variable.
(active_options): New command-line option "active-beta".
(active_parse_opt): Handle it.
(active_parse_opt): Look for "dkl" method.
(active_select_dkl): New function.
(active_doc_barrel_set_entropy): Use active_alpha.
(active_wv_density): Likewise. Make it KL divergence instead of cross
entropy.
(active_learn): Turn on and off cross entropy not just for wkl, but
also for dkl.
Mon Apr 13 14:08:56 1998 Andrew McCallum <mccallum@jprc.com>
* Makefile.in: Include file Makefile.preamble, if it is there.
(%.o:%.c): Move rule down into the Rules section.
* active.c (active_select_weighted_kl): Use new density
calculation to weight the scores by which we select documents for
labeling.
(active_doc_barrel_set_entropy): New function.
(active_doc_barrel_set_pr_w): New function.
(active_wv_density): New function.
(active_doc_barrel_set_density): New function.
(active_learn): Call active_doc_barrel_set_density().
* Makefile.preamble: New file. Moved METHOD_C_FILES appending
line here.
* Makefile.local (METHOD_C_FILES): Remove.
* Makefile.in (METHOD_C_FILES): Remove active.c, em.c and mix.c
from here.
* Makefile.local: Add a line that appends active.c, em.c, and
mix.c to METHOD_C_FILES.
* naivebayes.c (bow_naivebayes_new_odds_ratio_for_ci): Create the
RET with a little more space to avoid memory overrun and crash.
Set the ratio using a weighted likelihood ratio instead of
unweighted.
(bow_naivebayes_print_odds_ratio_for_all_classes): Print the number of
words in each class to stderr.
* rainbow-stats.pl: Include the pruned suffix in the printed
message to STDERR when it is used.
* rainbow.c (rainbow_test): Use BOW_BARREL_CLASSNAME_AT_INDEX on
the RAINBOW_DOC_BARREL, instead of using FILENAME_TO_CLASSNAME on
the CDOC->FILENAME. This works better when there is not a
one-to-one correspondence between classes and documents.
(rainbow_test_files): Count the number of classes with
BOW_BARREL_NUM_CLASSES on the RAINBOW_DOC_BARREL, not the class
barrel. Again this is necessary for methods in which there are
multiple mixtures per class, as in the "mixture method. Set the
CURRENT_CLASS using BOW_BARREL_CLASSNAME_AT_INDEX instead of
FILENAME_TO_CLASSNAME.
Now --lex-pipe-command's get the fully-qualified path name of the
file being lexed in their environment variable RAINBOW_LEX_FILENAME.
* lex-simple.c (bow_lexer_simple_open_text_fp): New argument
FILENAME. Set the environment variable RAINBOW_LEX_FILENAME
before forking a lex pipe command, so the command can have access
to this filename.
* bow/libbow.h (bow_lexer): Make OPEN_TEXT_FP function take a new
argument, the filename. Add FILENAME as an argument to several
functions.
* int4str.c (bow_int4str_new_from_text_file): Pass FILENAME to lex
creation function.
* int4word.c (bow_words_add_occurrences_from_text_dir): Likewise.
* lex-gram.c (bow_lexer_gram_open_text_fp): New argument FILENAME;
pass it on.
* lex-indirect.c (bow_lexer_indirect_open_text_fp): Likewise.
* wi2dvf.c (bow_wi2dvf_add_di_text_fp): New argument FILENAME.
Pass to lex creation function.
* wv.c (bow_wv_new_from_text_fp): Likewise.
* robin.c (robin_index): Pass FILENAME to lex creation function.
* barrel.c (bow_barrel_add_from_text_dir): Pass FILENAME to
bow_wi2dvf_add_di_text_fp().
Mon Mar 30 18:48:19 1998 Andrew McCallum <mccallum@jprc.com>
* wa.c (bow_wa_new): Malloc ENTRY.
(bow_wa_free): Free it.
(bow_wa_append): Expand ENTRY as necessary.
* bow/libbow.h (bow_wa): Make ENTRY a separate malloc'ed pointer.
Tue Mar 24 11:19:46 1998 Andrew McCallum <mccallum@jprc.com>
Implement variable-depth hierarchies.
* rainbow-h.c (hier_barrel): Remove element CHILDREN_RANKINGS.
Add elements DEPTH and LAMBDAS.
(hier_barrel_new): Initialize LAMBDAS to NULL.
(hier_barrel_set_depth_and_allocate_lambdas): New function.
(hier_barrel_free): Free LAMBDAS.
(EXTRA_CHECKS): Change from 0 to 1.
(hier_barrel_ancestor): New function.
(hier_barrel_prob_wi_in_ci_local): If CI == CI_IN_PARENT_ROOT, then
take probability from root, even if HBARREL is NULL. This
provides a way to use this function to query the root, which was
impossible before.
(hier_barrel_prob_wi_in_ci_shrink): Replace individual calls to
ancestor word distributions with a loop.
(hier_barrel_calc_lambda_one_iteration): Likewise. Return perplexity
instead of log-probability of the data.
(hier_barrel_prob_wi_shrink): Since the body only ever work on the
root, only shrink local and uniform, instead of uniform three
times.
(hier_barrel_initialize_lambdas): Set initial values with loop.
(hier_barrel_calc_lambda): Keep doing EM iterations until perplexity
change is less than 0.1, instead of going until all lambdas remain
constant.
(hier_barrel_calc_lambdas): Don't print "Calculting lambda", just
print the hbarrel name.
(rainbowh_unarchive): Do some post-processing after unarchiving. Set
the depth and allocate lambdas.
(main): Verbosify the total number of words instead of fprintf'ing it.
Mon Mar 23 19:59:33 1998 Andrew McCallum <mccallum@jprc.com>
* naivebayes.c (bow_naivebayes_new_odds_ratio_for_ci): New
function.
(bow_naivebayes_print_odds_ratio_for_all_classes): New function.
(bow_naivebayes_set_weights): Don't assert that we are using one of a
limited number of methods.
(bow_naivebayes_score): Initialize BSCORES[CI].NAME to NULL.
* bow/naivebayes.h: Declare functions.
* bow/libbow.h (bow_ws): New structure.
(bow_wa): New structure.
Declare new bow_wa_* functions.
* Makefile.in (LIBBOW_C_FILES): Added mix.c and wa.c
* wa.c: New file.
* mix.c: New file.
1998-03-23 Kamal Nigam <knigam@yawp.jprc.com>
* em.c (em_no_reset_doc_types): New variable for resetting option.
* split.c (bow_test_split_for_em_binary): Function removed.
(bow_test_split3): New function.
(bow_test_split_for_em_multihump): Function removed.
* em.c (bow_em_binary_model_counts): Code for fancy counts.
(bow_em_fancy_counts): Likewise.
(em_options): Likewise, and for new option --em-no-reset-doc-types.
(em_parse_opt): Likewise, and for specifying number of multiple
negative humps.
(bow_em_new_vpc_with_weights): Changed array lengths to static for
debugging sanity. Added code for fancy counts, multiple neg
humps, and not resetting doc types. Fixed initial assert to be
more allowable.
(bow_em_score): Conditional rescaling so binary scoring works
correctly.
Mon Mar 23 09:57:11 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow.c (rainbow_test): Assert that only one trial was
requested, because there is no point in testing the training
documents twice, and also, otherwise the EM method will mess up.
* rainbow.c (rainbow_options): New command-line option
"test-on-training".
(rainbow_parse_opt): Handle it.
(rainbow_test): Implement it.
(struct rainbow_arg_state): New element TEST_ON_TRAINING.
(main): Initialize it.
In naivebayes, use the WEIGHT field of DV's instead of the COUNT
field.
* naivebayes.c (bow_naivebayes_pr_wi_ci): Calculate this from the
DV's WEIGHT not from its COUNT.
(bow_naivebayes_set_weights): Use WEIGHT instead of COUNT.
(bow_naivebayes_score): Likewise.
* vpc.c (bow_barrel_new_vpc): Add code to update the
CDOC->WORD_COUNT in the DOC_BARREL in order to match the
(potentially) pruned vocabulary. When building the wi2dvf, add
WEIGHT correctly for the different event models.
* wi2dvf.c (bow_wi2dvf_add_di_text_fp): When incorporating the
word that has just been tokenized, add one to both the COUNT and
the WEIGHT, not just the COUNT. Note that this will change the
way the barrel->wi2dvf is built from data, and barrels tokenized
before this change will now be different from barrels tokenized
afterwards. This change is inkeeping with naivebayes' use of the
WEIGHT field instead of the COUNT field, although vpc.c has a fix
to make the old-style wi2dvf's work.
* split.c (bow_tmp_word_struct): Add element WEIGHT.
(bow_test_next_wv): Put WEIGHT into the WV to be returned.
Thu Mar 19 10:45:25 1998 Andrew McCallum <mccallum@jprc.com>
* info_gain.c (bow_infogain_per_wi_new_document_event): Renamed
from bow_infogain_per_wi_new.
(bow_infogain_per_wi_new_word_event): New function.
(bow_infogain_per_wi_new): New function.
Sat Mar 7 11:16:56 1998 Andrew McCallum <mccallum@jprc.com>
* dv.c (_bow_dv_index_for_di): In embedded function
grow_if_necessary(), grow if LENGTH is >= SIZE, not just ==.
Also, add 1 to SIZE before multiplying for growth, so that if SIZE
is 1, it will grow. (Reported by Sean Slattery.) These changes
fix bugs that prevented adding to a barrel read from disk.
Fri Mar 6 11:07:32 1998 Andrew McCallum <mccallum@jprc.com>
* naivebayes.c (bow_naivebayes_pr_wi_ci): If using the document
event model, force special kind of smoothing.
(bow_naivebayes_score): Change loop over words to be more readable.
Fix bug in condition for when to use the word count in the query.
Properly handle zero-length documents.
Thu Mar 5 12:30:36 1998 Kamal Nigam <knigam@hurricane.jprc.com>
* int4str.c (bow_int4str_new_from_string_file): switched
reading_numbers back to regular.
Thu Mar 5 11:31:55 1998 Andrew McCallum <mccallum@jprc.com>
Implement the "document" event model, i.e. the multi-variate
Bernoulli.
* naivebayes.c (bow_naivebayes_set_weights): If we are using the
document event model, don't check that the sum of P(w|c) over all
words in a single class equals one, because it isn't true.
(bow_naivebayes_score): Change the inner loop over words. If we're
using the document event model, loop over all words in the
vocabulary instead of just the words in the query. When
incorporating the probability of a word that does not occur in the
query document, make it 1-p. Don't use the count of the word when
using the document event model.
* vpc.c (bow_barrel_new_vpc): Assert that the word count is
non-zero before adding a count of 1 when using the document event
model.
Thu Mar 5 11:18:16 1998 Kamal Nigam <knigam@hurricane.jprc.com>
* .cvsignore: added kl-div
* active.c (active_selection_type): added vote entropy.
(active_parse_opt): Likewise.
(active_no_final_em): New option for not finishing with an EM round.
(active_options): Likewise.
(active_parse_opt): Likewise.
(active_document_entropy): New function.
(active_select_weighted_kl): Correct weighting metric
(active_select_vote_entropy): New function.
(active_learn): prune vocab and set word counts at each round. add
code for active_no_final_em option. turn off cross entropy for
final model building.
* barrel.c (bow_barrel_printf): fixed array bounds bug.
* em.c (bow_em_new_vpc_with_weights): set word counts after pruing
the vocab. Remove buggy conditions when calculating normalizer.
Mon Mar 2 14:51:32 1998 Andrew McCallum <mccallum@jprc.com>
This version used for the ICML'98 hierarchical classification paper.
* rainbow-h.c (hier_barrel_prob_wi_in_ci_local): Don't assert that
LOCAL_COUNT > 0, just return the uniform distribution if it is.
(hier_barrel_calc_lambda_one_iteration): Don't assert that LAMBDA must
be greater than 0, (because sometimes EM makes it go to zero);
just print a warning if it is.
(_hier_barrel_set_node_scores): Goto DO_CHILDREN only if we are
pruning, not always whenever SCORE is -FLT_MAX. Remove
bow_error() call for HIER_STRUCTURE!=HIER_LEAF.
(hier_barrel_score): Initialize the root HBARREL->SCORE to 1.
(main): Print the number of words in each top-level branch, and the
total number of words.
Fri Feb 20 13:04:45 1998 Andrew McCallum <mccallum@jprc.com>
* active.c (active_document_entropy): New function.
(active_select_weighted_kl): Use it to weight document selection
weights by e^(-KLdiv).
Thu Feb 19 11:46:14 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (rainbowh_options): new command-line option
"scoring-pruning".
(struct rainbowh_arg_state): New element SCORING_PRUNING_NUM_CHILDREN.
(rainbowh_parse_opt): Handle it.
(hier_barrel_prob_wi_in_ci): Raise error if we try to use wittenbell
smoothing.
(_hier_barrel_set_node_scores): Pay attention to
SCORING_PRUNING_NUM_CHILDREN.
(main): Initialize SCORING_PRUNING_NUM_CHILDREN.
Fri Feb 20 12:45:30 1998 Kamal Nigam <knigam@server7.jprc.com>
* bow/em.h (em_cross_entropy): made extern
* em.c (em_cross_entropy): made non-static
* active.c (active_selection_type): added new type length.
(active_parse_opt): added new length selection method.
(active_select_qbc): reverted some changes lost from bad conflict
resolution.
(active_select_weighted_kl): fixed bugs to make run.
(active_select_length): New function.
(active_learn): create a dummy vpc barrel before trying to set word
counts. turn off crossentropy for generating test stats.
Wed Feb 18 20:36:38 1998 Andrew McCallum <mccallum@jprc.com>
* em.c (em_cross_entropy): Renamed from em_crossentropy.
(bow_em_score): Make it do cross entropy if above var is non-zero.
Wed Feb 18 20:08:03 1998 Kamal Nigam <knigam@server7.jprc.com>
* active.c (active_parse_opt): new option for new selection method
wkl
* em.c (em_parse_opt): new option for crossentropy scoring
(em_crossentropy): Likewise.
Wed Feb 18 20:24:57 1998 Andrew McCallum <mccallum@jprc.com>
* naivebayes.c (naivebayes_cross_entropy): New static variable,
turned off by default.
(bow_naivebayes_score): Obey it.
* active.c (active_select_weighted_kl): New function.
Wed Feb 18 20:08:03 1998 Kamal Nigam <knigam@server7.jprc.com>
* bow/libbow.h (bow_smoothing): added goodturing.
(bow_cdoc_yes): prototype for new function.
(bow_smoothing_goodturing_k): new global for options.
* bow/em.h (bow_em_pr_struct): New from em.c
* em.c (bow_em_pr_struct): moved to header file.
* naivebayes.c (bow_naivebayes_initialize_goodturing): new
function.
(bow_naivebayes_goodturing_discounts): new variable.
(bow_naivebayes_goodturing_barrel): new variable.
(bow_naivebayes_pr_wi_ci): added option for goodturing smoothing
(bow_naivebayes_set_weights): Likewise.
* split.c (bow_cdoc_yes): new function.
* opts.c (bow_options): added option smoothing-goodturing-k and
added to smoothing-method option.
(parse_bow_opt): Likewise.
* Makefile.in (LIBBOW_C_FILES): add goodturing.c
* active.c (active_select_relevant): fixed so it picks most
relevant docs instead of least relevant.
(active_remap_scores): New function.
(active_parse_opt): New options for remapping scores to probabilities.
(active_pr_print_stat_summary): Likewise.
(active_pr_window_size): Likewise.
(active_remap_scores_pr): Likewise.
(active_learn): add word count calculation. Call remap for new option.
Wed Feb 18 14:04:00 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hier_barrel_test): Handle case where
RAINBOWH_ARG_STATE.VOCAB_SIZE is zero. Now we can do tests with
the full, unpruned vocabulary.
Tue Feb 17 17:05:54 1998 Andrew McCallum <mccallum@jprc.com>
* info_gain.c (bow_infogain_per_wi_new): Use correct type in RET
allocation.
Mon Feb 16 17:11:21 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow.c (rainbow_options): New command-line option
"testing-files-use-basename".
(rainbow_parse_opt): Handle it.
(rainbow_test): Remove old document iteration code.
* split.c (bow_test_set_files): Obey
BOW_TEST_SET_FILES_USE_BASENAME.
(BASENAME_ONLY): Macro removed.
* opts.c (bow_test_set_files_use_basename): New global variable.
* bow/libbow.h: Declare new global variable.
Sat Feb 14 09:54:33 1998 Andrew McCallum <mccallum@jprc.com>
This version used for most ICML results.
* rainbow-h.c (hier_barrel): New element CLASS_WORD_COUNT.
(str_match_len): Move it out of hier_barrel_filename_to_classname.
(hier_barrel_for_filename): New function.
(hier_barrel_set_class_word_count): New function.
(hier_barrel_wi_count): New function.
(EXTRA_CHECKS): New macro.
(hier_barrel_prob_wi_in_ci_local): New argument LOO_G1CHILD_CI. This
allows you to leave out a grandchild from the stats.
(hier_barrel_prob_wi_in_ci_shrink): Likewise.
(hier_barrel_prob_wi_local): New argument LOO_CHILD_CI. This allows
you to leave out a child from the stats.
(hier_barrel_prob_wi_shrink): Likewise.
(LOO_CI_IN_PARENT): New macro.
(LOO_PARENT_CI_IN_G1PARENT): New macro.
(hier_barrel_calc_lambda_one_iteration): Use New arguments to prob
functions.
Thu Feb 12 16:12:24 1998 Kamal Nigam <knigam@server7.jprc.com>
* bow/em.h (em_perturb_starting_point_by_variance): Made
externally viewable.
(bow_em_num_em_runs): Likewise.
* rainbow.c (rainbow_test): Added warning message if going into
buggy code.
* em.c (em_perturb_starting_point_by_variance): Made non-static so
active.c can change its value.
(bow_em_num_em_runs): Likewise.
(bow_method_em): Added element for bow_barrel_free.
* active.c (active_final_em): new global for new option.
(active_print_committee_matrices): Likewise.
(active_qbc_low_kl): Likewise.
(active_options): added three new options.
(active_parse_opt): Likewise.
(active_select_qbc): Fixed for roundoff error. Added code for new
option.
(active_learn): Print the names of initial docs.
Code for new options. Change all non-test docs back to model at
end.
(active_test): Change to use new heap routine that gets all docs.
Wed Feb 11 10:05:34 1998 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h (bow_free_barrel): Remove check for NULL function,
so that the absence of this function will be noticed. Previously
we had not noticed that em's barrel free function was NULL, and
where getting major memory leaks.
Tue Feb 10 18:26:53 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow.c (main): Fix typos.
Make --prune-vocab-by-occur-count and --prune-vocab-by-doc-count
work in conjunction with --print-barrel.
* rainbow.c (main): Obey options -O and -D before doing things
that don't update the vocab on their own.
* wi2dvf.c (bow_wi2dvf_hide_words_by_occur_count): New function.
* int4word.c (bow_words_occurrences_for_wi): Just return 0 when WI
is too large, instead of raising an assertion.
Mon Feb 9 18:00:40 1998 Andrew McCallum <mccallum@jprc.com>
* vpc.c (bow_barrel_new_vpc): Assert that the cdoc's from the
DOC_BARREL are at least as large as necessary. This can fail when
reading from a disk archive that was created before CLASS_PROBS
was added to BOW_CDOC.
* rainbow.c: Initialize DOC_CDOC, when using bow_test_next_wv()!
* normalize.c (bow_wv_normalize_weights_by_vector_length): Handle
an empty WV.
(bow_wv_normalize_weights_by_summing): Likewise.
* naivebayes.c (naivebayes_normalize_log): New static variable.
(naivebayes_rescale_scores): New static variable.
(naivebayes_final_rescale_scores): New static variable.
(naivebayes_options): New command-line option
"naivebayes-normalize-log".
(naivebayes_parse_opt): Handle it.
(bow_naivebayes_pr_wi_ci): Fix Witten-Bell. It should now be working
properly.
(NO_SCALING): Macro removed, replaced by NAIVEBAYES_RESCALE_SCORES.
(bow_naivebayes_score): Calculate the entropy of words in the
document, H_W_D, although it is currently unused. Do scaling or
not depending on new static variable. Handle
NAIVEBAYES_NORMALIZE_LOG: when scoring, instead of using exp(),
return -1/(log(P(C|d)^3)) rescaled so highest is -2, normalized to
sum to one instead of P(C|d). This results in values that are not
so close to zero and one.
* em.c (em_options): New command-line option
"em-score-normalize-log".
(em_parse_opt): Handle it.
(bow_em_score): Obey it. When scoring, return -1/(log(P(C|d)^3)
(where the scores have been rescaled so that the highest value is -2),
normalized to sum to one instead of P(C|d). This results in
values that are not so close to zero and one
Mon Feb 9 17:52:18 1998 Kamal Nigam <knigam@server7.jprc.com>
* em.c (em_score_normalize_log): new option.
(bow_em_perturb_weights): fixed bug for calculating perturbation.
(bow_em_score): code for new normalization option
Sat Feb 7 12:41:39 1998 Andrew McCallum <mccallum@jprc.com>
* active.c (active_select_qbc): Set WEIGHT to average
KL-divergence-to-the-mean, not average entropy.
1998-02-07 Kamal Nigam <knigam@tsunami.jprc.com>
* active.c (active_options): added active-committee-size
(active_select_qbc): checked for p(w_c) = 0
* em.c (bow_em_perturb_weights): fix looping and calculation bugs
(bow_em_new_vpc_with_weights): skip E-step on last iteration
* active.c (active_selection_type): changes for committee learning
(active_learn): made changes to support committee learning
Fri Feb 6 10:36:19 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hier_barrel_correct_cdoc_prior): Check prior
consistency more carefully.
(HBARREL_LAMBDA_PARENT): New macro.
(hier_barrel_calc_lambda_one_iteration): Increment with COUNT, not ++!
When printing P(w|c) for Agriculture/Org, print the count. Use
new macro; particularly when calculating LOG_PROB_OF_DATA where
the old version did not use the correct value for lambda parent.
Use COUNT in calculation of LOG_PROB_OF_DATA.
(hier_barrel_calc_lambda): Print ">" when LOG_PROB_OF_DATA goes down
in order to help debugging.
(hier_barrel_calc_lambdas): Get NUM_WORDS in the class, then print it,
so we can see it with the EM iteration numbers.
* vpc.c (bow_barrel_set_vpc_priors_by_counting): Print warning if
class has zero prior.
* em.c (em_options): New command-line option
"em-perturb-starting-point".
(em_parse_opt): Handle it.
(random_double): New static function.
(bow_em_gaussian): New function.
(bow_em_perturb_weights): New function.
(bow_em_new_vpc_with_weights): Use it.
* active.c (active_select_qbc): Added assertion.
1998-02-06 Kamal Nigam <knigam@tsunami.jprc.com>
* active.c: changes for compilation bugs
* bow/libbow.h (bow_test_set_files): added prototype
* split.c (bow_random_set_seed): new function.
(bow_test_split): Uses bow_random_set_seed.
(bow_test_split2): Likewise.
(bow_test_split_for_em_binary): Likewise.
(bow_test_split_for_em_multihump): Likewise.
(bow_ignore_split): Likewise.
Fri Feb 6 10:36:19 1998 Andrew McCallum <mccallum@jprc.com>
* active.c (active_commitee_size): New static variable.
(active_options): New command-line option "committee-size".
(active_parse_opt): Handle it.
(active_select_qbc): New function.
(active_select_uncertain): Add new COMMITTEE_SIZE argument.
(active_select_relevant): Likewise.
(active_select_random): Likewise.
1998-02-06 Kamal Nigam <knigam@tsunami.jprc.com>
* em.c (em_parse_opt): added check for --em-num-iterations
* bow/libbow.h (bow_test_split2): new prototype for removed option
em_seed
(bow_split_seed): new option.
* split.c (already_seeded): new variable to ensure only one
seeding
(bow_test_split): code for already_seeded and new option --split-seed
(bow_ignore_split): Likewise.
(bow_test_split2): Likewise.
(bow_test_split_for_em_binary): Likewise.
(bow_test_split_for_em_multihump): Likewise.
* rainbow.c (rainbow_test): moved print statement for run header
for active learning
* opts.c (bow_split_seed): new variable for new option
(bow_options): code for new option --split-seed
(parse_bow_opt): likewise
* em.c (em_parse_opt): added new option --em-no-splitting and
removed option --em-split-seed
(bow_em_new_vpc_with_weights): added code for active learning to be
compatible. removed em-split-seed option (superseeded by a bow
option) and added --em-no-splitting option
* Makefile.in (METHOD_C_FILES): added active.c
* active.c: new file.
Wed Feb 4 21:18:54 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (rainbowh_options): New command-line option
"print-lambdas-to-file".
(rainbowh_parse_opt): Handle it.
(struct rainbowh_arg_state): New member LAMBDA_FP.
(hier_barrel_calc_lambda_one_iteration): Change all floats to doubles.
When running through word data in E-step, explicitely skip words
that are not in the vocabulary, even though it shouldn't be
necessary, since bow_model_next_wv() should not include them
anyway. Add some assertions and debugging verbosity. Handle new
command-line option.
(hier_barrel_calc_lambda): Base stopping condition on the sum of
diff's across all lambda's, not just the child lambda.
(hier_barrel_calc_lambdas): Handle LAMBDA_FP.
(main): Likewise.
* naivebayes.c (naivebayes_parse_opt): Make command-line option
"naivebayes-m-est-m" work.
(bow_naivebayes_pr_wi_ci): Likewise. Try to fix Witten-Bell, but
still not working. Use doubles instead of floats.
(bow_naivebayes_set_weights): Use bow_naivebayes_pr_wi_ci().
Mon Feb 2 18:49:02 1998 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h: Declare new functions and global variables.
(bow_event_models): New type.
(bow_dv_heap): New members HEAP_WV, HEAP_WV_DI and LAST_DI, for use
with new function bow_heap_next_wv().
* heap.c (bow_make_dv_heap_from_wi2dvf): Initialize new members of
bow_dv_heap structure.
* naivebayes.c (NO_SCALING): New macro to turn off rescaling of
scores. Currently set so that Scaling is used.
(bow_naivebayes_score): Handle "document event model" BOW_EVENT_MODEL.
* opts.c (bow_event_model): New global variable.
(bow_options): New command-line option "event-model".
(parse_bow_opt): Handle it.
* rainbow-h.c (rainbowh_options): New option "print-doc-names".
(hier_barrel_correct_cdoc_prior_recurse): New function.
(hier_barrel_total_of_cdoc_priors): New function.
(hier_barrel_correct_cdoc_prior): New function.
(hier_barrel_prune_words_not_in_map_and_set_vpc): Call new function
hier_barrel_correct_cdoc_prior() at end.
(hier_barrel_set_local_class_model): Raise an error if
BOW_PRUNE_VOCAB_BY_INFOGAIN_N.
(hier_barrel_set_vpc_and_populate_lower_branches): Use new function
bow_heap_next_wv().
(hier_barrel_set_vpc_and_populate_lower_branches): Call new function
hier_barrel_correct_cdoc_prior() at end.
(IMPOSSIBLE_SCORE_FOR_ZERO_CLASS_PRIOR): New macro.
(_hier_barrel_local_score): Use it. Empty documents now allowed!
Move the initialization of NUM_WORDS_IN_QUERY_WV down after
setting of priors, because sometimes the document is empty anyway.
(_hier_barrel_set_node_scores): Declare PARENT_SCORE double instead of
float.
(hier_barrel_test): Use hacked-up for-loop to make sure we don't skip
empty documents. Print scores with precision based on
BOW_SCORE_PRINT_PRECISION.
(main): Handle new command line option.
* rainbow.c (rainbow_parse_opt) [USE_VOCAB_IN_FILE_KEY,
HIDE_VOCAB_IN_FILE_KEY]: Use bow_int4str_new_from_text_file(), not
bow_int4str_new_from_text_string(). This makes the rainbow-h
infogain files useable for in this context.
(rainbow_test): When determining when to alloca() HITS, use a
condition on HITS, not TN, because TN is set past 0 when
--testing-files is used! Use new function bow_heap_next_wv() for
looping through all test documents. Only free EMPTY_WV if it is
non-NULL.
* split.c (BASENAME_ONLY): Changed from 1 to 0. Now when using
--testing-files, we compare the entire filename path, not just the
basename. Warning: The AAAI'98 Reuters experiments relied on the
old setting.
(bow_heap_next_wv_guts): New function.
(bow_heap_next_wv): New function.
(bow_cdoc_is_model): New function.
(bow_cdoc_is_test): New function.
* vpc.c (bow_barrel_new_vpc): Count the number of documents
properly by running through list of documents, not be looking at
DV's! Handle the event=document bow_event_model. Verbosify the
number of documents in each class.
(bow_barrel_set_vpc_priors_by_counting): Change PRIOR_SUM from float
to double.
* TODO: Append email from Doug.
* evi.c (_register_method_evi): Add new argument to
bow_register_with_name().
Fri Jan 30 10:49:28 1998 Andrew McCallum <mccallum@jprc.com>
* methods.c (bow_method_register_with_name): New argument CHILD
allows the automatic setting of the GROUP index in the options, so
that the help message for options will get grouped together at the
bottom of the help message.
* knn.c (_register_method_knn): Add new argument to
bow_method_register_with_name().
* kl.c (_register_method_kl): Likewise.
* em.c (_register_method_em): Likewise.
* tfidf.c (_register_method_tfidf_): Likewise.
* prind.c (_register_method_prind): Likewise.
* naivebayes.c (_register_method_naivebayes): Likewise.
* bow/libbow.h: Add new argument to
bow_method_register_with_name().
* configure.in: Check for alloca.h, needed on DEC alpha machines.
Suggested by Jason Rennie.
Wed Jan 28 17:45:34 1998 Andrew McCallum <mccallum@jprc.com>
* wi2dvf.c (bow_wi2dvf_hide_words_by_doc_count): Return
immediately if COUNT is 0.
* rainbow.c (rainbow_options): New command-line option
"hide-vocab-indices-in-file".
(rainbow_parse_opt): Handle it.
(struct rainbow_arg_state): New member HIDE_VOCAB_INDICES_FILENAME.
(rainbow_test): Implement new option. Print scores even for empty
documents. They were skipped before.
(main): Implement new option.
* opts.c (bow_prune_words_by_doc_count_n): New global variable.
(bow_options): New command-line option "prune-vocab-by-doc-count".
(parse_bow_opt): Handle it.
* bow/libbow.h: Declare new command-line option global variable.
* int4word.c (bow_words_keep_top_by_infogain): Use qsort() instead
of previous N^2 implementation.
* int4str.c (bow_int4str_new_from_string_file): If the first word
is all numbers, read interpret the strings as word indices instead
of strings.
* barrel.c (bow_barrel_printf): Implement new format indicator,
'I'. Print in format used by Mehran Sahami's feat-sel program.
Wed Jan 21 17:39:12 1998 Andrew McCallum <mccallum@jprc.com>
This version used for AAAI-98 submission. CVS tagged with
`aaai98-submission'.
* rainbow.c (main): Move rainbow_word_count_printing below "Do
things that require the vocabulary or class/word weights to have
been updated" so that the class barrel will be re-made and
up-to-date. Temporarily, call
bow_naivebayes_print_odds_ratio_for_class() for implementing the
--weight-vector option.
* naivebayes.c (bow_naivebayes_print_odds_ratio_for_class): New
function.
(bow_naivebayes_set_weights): Set IDF to P(w).
Wed Jan 21 17:39:53 1998 Kamal Nigam <knigam@hurricane.jprc.com>
* rainbow.c (rainbow_test): changes to support --em-multi-hump-neg
* em.c (bow_em_new_vpc_with_weights): added support for new options
--em-multi-hump-neg and --em-even-unlabeled-priors
(bow_em_score): Likewise.
* split.c (bow_test_split_for_em_multihump): New function.
Thu Jan 22 16:08:13 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hier_barrel_prune_words_not_in_map_and_set_vpc): No
longer unhiding the vocabulary at the end of the function. Yipes.
Is this OK?
(hier_barrel_set_local_class_model): Likewise.
(_hier_barrel_local_score): No longer zero NUM_WORDS_IN_QUERY_WV at
declaration time; do it just before the loop that counts them. In
the loop, before incrementing for a particular word, check agains
the RAINBOWH_ARG_STATE.VOCAB_MAP, not the
HBARREL->CLASS_BARREL->WI2DVF!
(hier_barrel_test): Expand comment about the loop over test documents
skipping document that don't have any words in the vocabulary.
* wicoo.c (bow_barrel_shrink_wv): New function. Currently empty.
Sat Jan 17 12:27:41 1998 Andrew McCallum <mccallum@jprc.com>
* em.c (bow_em_new_vpc_with_weights): Only free BEST_WI2DVF under
correct conditions. This fixes bug whereby rainbow produces no
output.
Fri Jan 16 09:51:46 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow.c (rainbow_options): New command-line option
"hide-vocab-in-file".
(rainbow_parse_opt): Handle it.
(struct rainbow_arg_state): New element hide_vocab_map.
(rainbow_query): Handle --hide-vocab-in-file, and, importantly, put
bow_keep_top_words_by_infogain above the bow_barrel_prune_vocab_
functions so that they will actually have an effect! Remember,
bow_keep_top_words_by_infogain begins by unhiding all words!
(rainbow_lisp_setup): Initialize rainbow_arg_state.hide_vocab_map.
(rainbow_test): Put the bow_barrel_prune_vocab_ function calls inside
the trial for()-loop, after the bow_keep_top_words_by_infogain,
for the same reason as above!
(main): Handle --hide-vocab-in-file, and, importantly, put
bow_keep_top_words_by_infogain above the bow_barrel_prune_vocab_
functions so that they will actually have an effect!
* naivebayes.c (bow_naivebayes_pr_wi_ci): In wittenbell, use
LOO-adjusted class word count instead of CDOC->NUM_WORDS.
* barrel.c (bow_barrel_prune_words_in_map): New function.
* bow/libbow.h: Declare new function.
* split.c (bow_basename): Fix so it will strip the leading `/'.
* opts.c (bow_smoothing_method): New global variable.
(bow_options): New command line option "smoothing-method";
currently takes "laplace", "mestimate" and "wittenbell".
"mestimate" is not yet implemented.
(parse_bow_opt): Handle it.
* bow/libbow.h (bow_smoothing): New enum typedef.
(bow_smoothing_method): Declare new global variable set by opts.c.
* naivebayes.c (bow_naivebayes_pr_wi_ci): Use different smoothing
methods depending on BOW_SMOOTHING_METHOD; laplace and wittenbell
implemented.
* kl.c (bow_kl_score): Previously used wittenbell smoothing all
the time. Now use wittenbell only if indicated on command-line
with --smoothing-method, otherwise use laplace.
Thu Jan 15 11:47:00 1998 Andrew McCallum <mccallum@jprc.com>
* vpc.c (bow_barrel_new_vpc): Remove assertion that all classes
have some words in their training data after vocabulary pruning.
(bow_barrel_set_vpc_priors_by_counting): When calculating class
priors, don't count documents that are not in the MODEL set!
* rainbow.c (rainbow_test): Fix typo.
* naivebayes.c (bow_naivebayes_score): If a class has a zero class
prior, set the score to a special flagged value, and then when
returning scores in the array, make those scores be -DBL_MAX.
Now, for the first time, naivebayes can handle classes that have no
training documents in them.
Thu Jan 15 14:15:22 1998 Kamal Nigam <knigam@hurricane.jprc.com>
* em.c (bow_em_new_vpc_with_weights): Fixed two memory leaks.
Thu Jan 15 11:47:00 1998 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h (bow_score_print_precision): New variable extern
declaration.
* kl.c (bow_kl_score): Assert QUERY_WORD_COUNT to make sure we're
not trying to classify an empty document.
* em.c (bow_em_compare_to_nb): Use BOW_SCORE_PRINT_PRECISION when
printing scores.
* bow/libbow.h (bow_stoplist_replace_with_file): New function
declaration.
Thu Jan 15 11:39:44 1998 Kamal Nigam <knigam@hurricane.jprc.com>
* opts.c (bow_score_print_precision): New option, defaults to 10.
(parse_bow_opt): Support for new option.
* rainbow.c (rainbow_query): Print precision obeys new option.
(rainbow_test): Likewise.
(rainbow_test_files): Likewise.
* naivebayes.c (naivebayes_binary_scoring): New option.
(naivebayes_parse_opt): Support for new option.
(bow_naivebayes_score): Support for new option. If binary scoring,
don't move scores out of log space or normalize them. This
alleviates the 0-1 problem, at the cost of losing probabilities.
* bow/libbow.h (bow_score): Changed weight from float to double.
(bow_test_split2): Added prototype.
* split.c (bow_ignore_split): New function.
(bow_test_split2): Allows there to be too few documents to grab, and
just grabs all of them if that's its only choice. Takes an
optional random seed for seeding the random number generator.
(bow_test_split_for_em_binary): New function.
* em.c (use_priors_for_initial_class_probs): flipped value from 0
to 1
(unlabeled_normalizer): moved from predefined value to an option
(bow_em_num_em_runs): Likewise
(bow_em_print_word_vector): Likewise
(bow_em_making_barrel): if doing binary scoring, still use
probabilities for rounds of EM
(EM_LABELED_NUM_BY_CLASS): changed name from EM_UNLABELED_NUM_BY_CLASS
(em_binary_pos_classname): added option for binary case
(em_binary_neg_classname): added option for binary case
(bow_em_unlabeled_num): added option for specifying number of
unlabeled docs
(bow_em_print_probs): added option to print word probabilities for
kl-div across runs
(bow_em_binary_case): added option for binary classification
(bow_em_split_seed): added random seed option to facilitate paired
trials
(em_parse_opt): added support for new options
(bow_em_new_vpc_with_weights): Added support for binary
classification, although it's commented out now. Makes only
initial labeled docs be all positive ones and sets priors of
unlabeled docs to be all negative. Added support for seeding the
random number generator for labeled docs so we can have paired
trials. Added support for specifying the number of unlabeled
docs. This requires a new doc type, so conditions on doc types
have changed slightly. If pruning vocab by info gain, we need to
do this after the labeled docs are selected instead of before the
barrel is built, so this logic is added in here. Allow number of
labeled docs to be zero. In this case, we need to be careful
about calculating priors with no docs, and also need to initialize
the class vectors randomly. Added condition around printing class
probabilities with new option. Only score docs for score to prob
mapping if its needed by the method. Make naive bayes scoring
respect the unlabeled_normalizer.
(bow_em_set_priors_using_class_probs): made condition changes for new
doc type.
(bow_em_score): Change the scoring if doing binary classification.
Don't convert back from log space and don't normalize. We lose
probabilities, but it avoids the 0-1 problem, and we can still
rank the predictions.
Tue Jan 13 14:41:55 1998 Andrew McCallum <mccallum@jprc.com>
* rainbow.c (rainbow_query): Print scores with precision %.20g
instead of default 6.
(rainbow_test): Likewise.
(main): When printing top words by infogain with -I, if
--testing-files was on the command line, then do this test/train
split before calculating the infogains.
Thu Jan 8 12:56:22 1998 Andrew McCallum <mccallum@jprc.com>
* opts.c (bow_options): New command-line option
"replace-stoplist-file".
(parse_bow_opt): Handle it.
* stoplist.c (bow_stoplist_replace_with_file): New function.
Wed Jan 7 17:14:59 1998 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h (bow_doc_type): Added types UNUSED_MODEL and
UNUSED_IGNORE.
Tue Jan 6 16:39:30 1998 Andrew McCallum <mccallum@jprc.com>
* em.c (bow_em_new_vpc_with_weights): When printing P(C|w)
distributions, initialize TOTAL_WORD_COUNT inside the loop over
words!
* int4word.c (bow_int2word): If INDEX is larger than the number of
words in the underlying word map, return NULL.
* em.c (bow_em_new_vpc_with_weights): After the M-step, before the
E-step, print out the P(C|w) distribution for all words to a file
named pcw01, where "01" is the EM-iteration number.
* int4word.c (bow_word2int_no_add): New function.
* bow/libbow.h: Declare new function.
Sun Dec 21 20:36:08 1997 Andrew McCallum <mccallum@jprc.com>
* rainbow.c (rainbow_options): Document --print-barrel's FORMAT
better.
* Makefile.in (DIST_FILES): Added $(METHOD_C_FILES).
* rainbow.c (rainbow_options): New command-line option
"index-printed-barrel".
(rainbow_parse_opt): Handle it. Add a term to the conditions under
which we complain about "Need data from more than one class"
because when doing --index-printed-barrel the filename is in the
place of a classname, and there is supposed to be only one.
(rainbow_index_printed_barrel): New function.
(main): Handle two different ways of what_doing == rainbow_indexing.
* barrel.c (getline): New function.
(bow_barrel_new_from_printed_barrel_file): New function.
* bow/libbow.h: Declare new function.
Wed Dec 17 14:47:40 1997 Andrew McCallum <mccallum@jprc.com>
* lex-simple.c (bow_not_isspace): New function.
(_bow_white_lexer): Instead of insisting on starting with a alphabetic
character, downcasing and stoplisting, etc, instead just grab
tokens delimited by whitespace, making no changes to the contents.
* bow/libbow.h: Change comment on BOW_WHITE_LEXER.
(bow_default_lexer_white): New declaration.
* opts.c (bow_options): New command-line option "lex-white".
(parse_bow_opt): Implement it.
* deflexer.c: Add a lexer structure for BOW_DEFAULT_LEXER_WHITE.
* int4str.c (_str2id): Re-written to make sure it never returns 0,
which would cause an infinite loop in _str_hash_add().
Tue Dec 16 12:24:49 1997 Andrew McCallum <mccallum@jprc.com>
* kl-div.c: New file.
* Makefile.in (kl-div): New target.
(OTHER_C_FILES): New variable, includes kl-div.
(DIST_FILES): Add it.
(libbow.a): Include the $(METHOD_O_FILES) in there too, so that simple
executables like kl-div can properly get the default_lexer.
(all): Added kl-div.
(default): Target now just depends on `all'.
Mon Dec 15 14:35:16 1997 Andrew McCallum <mccallum@jprc.com>
* rainbow.c (rainbow_unarchive): Use filename_to_classname() when
initializing CLASSNAMES.
* wicoo.c: New file. Builds a structure for calculating
word co-occurrence statistics.
* Makefile.in (LIBBOW_C_FILES): Added wicoo.c.
* arrow.c: Include headers for socket server implementation.
(arrow_options): New command-line option "query-server" and "print-coo"
(struct arrow_arg_state): New element SERVER_PORT_NUM.
(arrow_parse_opt): Implement new options.
(arrow_query): Take new arguments for in and out FILE*'s. Print
filenames instead of file contents.
(arrow_socket_init): New function.
(arrow_serve): New function.
(arrow_coo): New function.
(main): Call new functions.
* rainbow-h.c (rainbowh_options): New command-line option
"print-word-probabilities".
(rainbowh_parse_opt): For 't', be smarter about warning when this badly
interacts with --use-vocab-in-file. Likewise for
`--use-vocab-in-file'. Implement new option.
(hier_barrel_class_index_of_classname): New function.
(hier_barrel_prob_wi_uniform): New function.
(hier_barrel_prob_wi_in_ci_local): Use it.
(hier_barrel_prob_wi_in_ci_shrink): Likewise.
(hier_barrel_prob_wi_local): Likewise.
(hier_barrel_prob_wi_shrink): Likewise.
(hier_barrel_print_word_probabilities_for_class): New function.
(_hier_barrel_local_score): Avoid counting out-of-vocabulary words
when calculating NUM_WORDS_IN_QUERY_WV.
(hier_barrel_set_lambdas): New function
(hier_barrel_test): Call it.
(main): No longer avoid handling BOW_PRUNE_VOCAB_BY_OCCUR_COUNT_N.
Handle new options.
* kl.c (bow_kl_score): When setting QUERY_WORD_COUNT be sure to
only count the words that are part of the barrel's vocabulary.
Use QUERY_WORD_COUNT properly for leave-one-out evaluation instead
of only the number of times the word in question occurs.
* naivebayes.c (bow_naivebayes_pr_wi_ci): New function.
(bow_naivebayes_print_word_probabilities_for_class): New function.
(bow_naivebayes_score): Get the total number of words in the query for
doing proper leave-one-out evaluation. Use
bow_naivebayes_pr_wi_ci() instead of calculating it locally.
* dv.c (bow_dv_add_di_count_weight): Verbosify about count
overflow at level BOW_PROGRESS instead of BOW_VERBOSE.
* split.c (BASENAME_ONLY): New macro.
(bow_basename): New function.
(bow_test_set_files): Use it for comparing basename of file only.
Useful for comparisons of hier vs non-hier file lists.
* wi2dvf.c (bow_wi2dvf_remove_wi): Call bow_error() because this
function is currently broken.
(bow_wi2dvf_hide_words_by_doc_count): New function.
(bow_wi2dvf_dv): Use ABS() when examining SEEK_START, to interact
properly with word hiding when words haven't yet been read in.
* vpc.c (bow_barrel_new_vpc): Assert DOC_BARREL->CLASSNAMES.
* rainbow.c (*_KEY): Replace #define's with enum.
(rainbow_options): New option "print-word-probabilities".
"print-barrel" now takes an optional argument.
(struct rainbow_arg_state): New element BARREL_PRINTING_FORMAT.
(rainbow_parse_opt): Handle new options.
(rainbow_unarchive): If RAINBOW_DOC_BARREL->CLASSNAMES is NULL, create
it and fill it. This is necessary when reading old barrels that
didn't archive CLASSNAMES.
(rainbow_socket_init): Reformat.
(rainbow_serve): Likewise.
(rainbow_test_files): In nested function test_file(), make sure the
file passes the bow_fp_is_text() test. Otherwise we the counts
are off when doing leave-one-out-evaluation.
(main): Move the barrel printing after the vocabulary reduction, and
call with new argument.
* bow/libbow.h: Declare new function.
(assert): New commented out macro that is useful when the builtin
assert() doesn't allow the debugger to examine the error-full
frame.
* lex-html.c (bow_lexer_html_get_raw_word): Decrement
LEX->DOCUMENT_POSITION before returning when HTML `<' was
unterminated. Otherwise we get errors when we catch the inifite
loop.
* barrel.c (bow_barrel_prune_words_not_in_map): Hide words instead
of removing them.
(bow_barrel_keep_top_words_by_infogain): Set BOW_WORD2INT_DO_NOT_ADD
so that lexing doesn't add more words to the vocabulary. This
fixes a bug in which new words in queries were causing the total
count of the number of words in a query document to be too high
during leave-one-out evaluation.
(bow_barrel_printf_old1): Renamed from bow_barrel_printf.
(bow_barrel_printf): New function that can print in different
formats.
Tue Nov 25 11:10:12 1997 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hier_barrel): New element LAMBDA_G1PARENT.
(hier_barrel_new): Initialize it to zero.
(hier_barrel_add_stats): Initialize CDOC->CLASS_PROBS to NULL.
(hier_barrel_add_document): Likewise.
(hier_barrel_set_lambda_from_delta): Initialize LAMBDA_G1PARENT to 0.
(HIER_BARREL_PARENT): New macro.
(HIER_BARREL_G1PARENT): New macro.
(hier_barrel_prob_wi_in_ci_shrink): Use LAMBDA_G1PARENT, get only
local probs from each of them, and mix them.
(hier_barrel_prob_wi_shrink): Likewise.
(hier_barrel_calc_lambda_one_iteration): Use LAMBDA_G1PARENT. Get
only local probs from each of them, and mix them.
(rainbowh_options): New option "print-barrel".
(rainbowh_parse_opt): Handle it.
(hier_barrel_print_barrel): New function.
(main) Call it.
Mon Nov 24 16:36:55 1997 Andrew McCallum <mccallum@jprc.com>
* lex-simple.c (bow_lexer_simple_get_raw_word): Decrement
LEX->DOCUMENT_POSITION before assert(). Otherwise files that end
without a newline seem to cause crashes.
* rainbow.c (BASENAME): Removed macro.
(rainbow_index): Use filename_to_classname() instead of BASENAME.
Mon Nov 24 12:57:52 1997 Kamal Nigam <knigam@cs.cmu.edu>
* em.c, bow/em.h: New files!
* Makefile.in (LIBBOW_H_FILES): Added em.h.
(METHOD_C_FILES): Added em.c.
* vpc.c (bow_barrel_new_vpc): Initialize CLASS_PROBS to NULL.
* split.c (bow_test_split2): New function.
(bow_ignore_next_wv): New function.
(bow_ignored_model_next_wv): New function.
* rainbow.c (rainbow_class_barrel): Initialize to NULL.
* bow/libbow.h: #include <bow/em.h>. Declare new functions.
(bow_cdoc): Add element CLASS_PROBS.
* barrel.c (_bow_barrel_cdoc_free): Free CLASS_PROBS if non-NULL.
(bow_barrel_add_from_text_dir): Initialize CLASS_PROBS to NULL.
(_bow_barrel_cdoc_read): Likewise.
Mon Nov 24 08:45:29 1997 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (*_KEY): Changed from #define's to enum.
(struct rainbowh_arg_state): New member shinkage_method.
(rainbowh_options): New option "shrinkage-method".
(rainbowh_parse_opt): Implement it.
(hier_barrel): New elements LAMBDA and LAMBDA_UNIFORM.
(CI_IN_PARENT_ROOT): Macro deleted.
(hier_barrel_new): Initialize LAMBDA and LAMBDA_UNIFORM.
(hier_barrel_prune_words_not_in_map_and_set_vpc): Initialize
DOC_BARREL->CLASSNAMES.
(hier_barrel_set_local_class_model): Likewise.
(hier_dir_is_leaf): New code, but commented out with AUTO_LEAF_DETECT.
(_hier_barrel_new_from_text_dir_recurse): Rather than exiting with
error if the directory was found to be empty of subdirectories,
instead call _hier_barrel_new_from_text_dir_leaf(). This means
that you don't have to create "rainbow-hier-leaf" files in the
directories at the leaf of the data directory.
(hier_barrel_prob_wi_in_ci_wittenbell): Change LOO_CLASS from int to
char*.
(hier_barrel_prob_wi_in_ci): Likewise.
(hier_barrel_score): Likewise.
(hier_barrel_set_lambda_from_delta): New function.
(hier_barrel_set_children_lambda_from_delta): New function.
(hier_barrel_calc_delta): Once DELTA is determined, set LAMBDA's to
match.
(hier_barrel_calc_delta_recurse): Set LAMBDA of HBARREL root here.
(last_pr_w_c): Global variable removed.
(old_hier_barrel_prob_wi_in_ci_shrink): Renamed from
hier_barrel_prob_wi_in_ci_shrink().
(LOO_CLASS_IN_PARENT): Macro deleted.
(LOO_CLASS_MATCH): New macro.
(hier_barrel_prob_wi_in_ci_local): New function.
(hier_barrel_prob_wi_in_ci_shrink): New function.
(hier_barrel_prob_wi_local): New function.
(hier_barrel_prob_wi_shrink): New function.
(hier_barrel_calc_lambda_one_iteration): New function.
(hier_barrel_calc_lambda): New function.
(hier_barrel_calc_lambdas): New function.
(check_prob_wi_in_ci): Don't immediately return; actually do the check.
(_hier_barrel_local_score): Some minor clean-ups, and make it do KL
instead of log-naivebayes, by dividing by document length.
(_hier_barrel_set_node_scores): Change condition for doing local
scoring so that if HIER_LEAF, we don't bother evaluating interior
nodes of the tree. Remove some OVERRIDE_LOG_PROBS-commented-out
code. Only call self recursively on HBARREL's with children.
(hier_barrel_print_loo_accuracy): Don't evaluate from root of tree.
Perhaps this should be changed back?
(hier_barrel_print_loo_accuracy_vary_delta): Start at delta/1024
instead of delta/1000. Save the best delta, and set lambda from
it.
(hier_barrel_test): Deal with NULL TEST_FILES_FILENAME. Attend to
SHRINKAGE_METHOD and do whichever was indicated on command-line.
(rainbowh_archive): Write format version to file.
(rainbowh_unarchive): Read it.
(main): Initialize shrinkage method to em. Don't
check_prob_wi_in_ci() here because the LAMBDA's aren't ready.
Sun Nov 23 18:47:50 1997 Andrew McCallum <mccallum@jprc.com>
* rainbow.c (BASENAME): New macro.
(rainbow_index): Use it to initialize BARREL->CLASSNAMES with correct
classnames!
* rainbow.c (rainbow_test): Don't free the heap. It was already
done automatically.
(main): Likewise.
* heap.c (bow_dv_heap_free): Add comment.
* bow/libbow.h (ABS) New macro.
Mon Nov 17 10:09:20 1997 Andrew McCallum <mccallum@jprc.com>
* bow/libbow.h: Include <bow/knn.h>.
(bow_barrel_add_classname): New macro.
* Makefile.in (LIBBOW_C_FILES): Remove deflexer.c.
(METHOD_C_FILES): Add deflexer.c, because it also has a constructor
function that may not get linked in otherwise.
* Makefile.local (rainbow-h): Add METHOD_O_FILES.
* rainbow-h.c (hier_barrel_prune_words_not_in_map_and_set_vpc):
Free and reinitialize HBARREL->DOC_BARREL->CLASSNAMES, instead of
building a local array to be passed to the VPC function.
(hier_barrel_set_local_class_model): Likewise.
(hier_barrel_select_test): Function deleted.
(hier_barrel_free): Don't free the hash; it's been deleted.
(hier_barrel_set_vpc_with_weights): Comment out unused function.
* rainbow-h.c: Remove old functions by Andrew Ng that are no
longer used.
(fix_up_node): Function deleted.
(classify_from_rankings_helper: Likewise.
(classify_from_rankings): Likewise.
(hier_recursive_reping_document): Likewise.
(new_bow_di_to_wv): Likewise.
(hier_recursive_set_rankings): Likewise.
(classify_single_doc): Likewise.
(set_true_class): Likewise.
(hier_recursive_reping_all_documents): Likewise.
(new_hash_ele_t): Likewise.
(free_hash_ele_t): Likewise.
(hashit): Likewise.
(create_hash): Likewise.
(find_di_from_fn): Likewise.
(find_child_di): Likewise.
(find_di): Likewise.
(free_hash): Likewise.
(hier_barrel_single_pass_set_vpc_with_weights): Likewise.
(classify_document): Likewise.
(is_descendant_of): Likewise.
(set_niece_pointer): Likewise.
(fix_up_is_niece_field): Likewise.
(fix_up_niece_pointers): Likewise.
(init_ignore_structure): Likewise.
(set_all_model): Likewise.
* rainbow.c (rainbow_classnames): Deleted global variable.
(rainbow_unarchive): Remove code for initializing RAINBOW_CLASSNAMES.
(main): New local variable RAINBOW_CLASSNAMES.
* vpc.c (bow_barrel_new_vpc): Drop CLASSNAMES and NUM_CLASSES
arguments. Get that information from DOC_BARREL->CLASSNAMES
instead.
(bow_barrel_new_vpc_merge_then_weight): Drop last two args.
(bow_barrel_new_vpc_weight_then_merge): Likewise.
* rainbow.c (rainbow_index): Drop last two arguments to vpc
function.
(rainbow_query): Likewise.
(rainbow_lisp_setup): Likewise.
(rainbow_test): Likewise.
(rainbow_test_files): Likewise.
(main): Likewise.
* rainbow-h.c (hier_barrel_set_vpc_with_weights): Drop last two
arguments to vpc function.
(hier_barrel_set_local_class_model): Likewise.
(hier_barrel_set_vpc_with_weights): Likewise.
* knn.c (bow_knn_classification_barrel): Drop last two arguments.
* bow/libbow.h (bow_method): VPC_WITH_WEIGHTS no longer takes
CLASSNAMES and NUM_CLASSES arguments.
(bow_barrel_new_vpc_with_weights): Likewise.
(bow_barrel_new_vpc): Likewise.
(bow_barrel_new_vpc_merge_then_weight): Likewise.
(bow_barrel_new_vpc_weight_then_merge): Likewise.
* Makefile.in (clean): Remove PERL_RUNNABLE_FILES.
The following changes mostly added with Sean Slattery.
Add KNN method. Add CLASSNAMES element to barrel structure, and
use it to get classnames and the number of classes, rather than
using the BARREL->CDOCS->LENGTH from the class barrel.
* knn.c, bow/knn.h: New files that implement k-NN classification
based on TFIDF weights and cosine-similarity.
* bow/libbow.h (bow_barrel): Add new element CLASSNAMES. It's a
str4int map between classnames and their indices.
(bow_method): Add new element FREE_BARREL. A function to call to free
the barrel.
(bow_free_barrel): New macro.
(bow_barrel_num_classes): New macro.
(bow_barrel_classname_at_index): New macro.
(bow_barrel_add_from_text_dir): Declare type-change in last argument.
(BOW_DEFAULT_FILE_FORMAT_VERSION): Bumped from 5 to 6. Barrel's now
write out their new CLASSNAMES map.
* kl.c (bow_method_kl): Add new element bow_barrel_free().
* Makefile.in (LIBBOW_C_FILES): Remove the .c files that implement
methods.
(METHOD_C_FILES): Put them in this new variable.
(METHOD_O_FILES): New variable.
($(DEMO_EXECUTABLES)): Depend on METHOD_O_FILES and explicitly link
them in. Previously we just depending on linking them from the
library, but the executables didn't require any symbols from the
method files, so they would get linked in unless we added a kludgy
function call that required them.
* barrel.c (bow_barrel_new): Initialize CLASSNAMES to NULL.
(bow_barrel_add_document): Assert that CLASSNAMES is NULL, as a check
to make sure that bow_barrel_add_from_text_dir wasn't used on the
barrel's in rainbow-h.
(bow_barrel_add_from_text_dir): Change type of class argument: instead
of specifying which class by integer index, specify by char*
classname. Initialize the CLASSNAMES map if it hasn't been
already.
(bow_barrel_new_from_data_fp): Read CLASSNAMES.
(bow_barrel_write): Write CLASSNAMES.
(bow_barrel_free): Free CLASSNAMES.
* vpc.c (bow_barrel_new_vpc): Create and initialize CLASSNAMES new
element in VPC_BARREL. Use BARREL_NUM_CLASSES to determine size
of VPC_BARREL, not the number of documents in the DOC_BARREL!
* wi2dvf.c (bow_wi2dvf_new_from_data_fp): Add comment about this
not reading all the data.
* prind.c (bow_method_prind): Add new element bow_barrel_free().
* tfidf.c (bow_method_tfidf): Likewise.
* naivebayes.c (bow_method_naivebayes): Likewise.
* rainbow.c: Use bow_free_barrel() instead of bow_barrel_free().
Use bow_barrel_num_classes() instead of looking at the length of
cdocs.
(rainbow_options): Say what the default is for "test-percentage".
(rainbow_index): Use new last arg to bow_barrel_add_from_text_dir().
(rainbow_test): Use bow_barrel_classname_at_index() instead of
filename_to_classname().
(main): Don't manually register methods kl and evi. The proper .o's
now get explicitly linked in by the Makefile.
* robin.c (robin_index): Use new last argument for
bow_barrel_add_from_text_dir().
* arrow.c (arrow_index): Use new last arg to
bow_barrel_add_from_text_dir().
* rainbow-h.c: Use the macro bow_free_barrel() instead of
bow_barrel_free().
(hier_barrel_print_loo_accuracy): New function.
(hier_barrel_print_loo_accuracy_vary_delta): New function.
(hier_barrel_print_loo_accuracy_vary_delta_recurse): New function.
(hier_barrel_test): As a temporary test, instead of printing test
results, just print varying delta LOO results and return.
(rainbowh_hbarrel): Global variable moved up.
(hier_barrel_new_from_text_dir): Use new last argument to
bow_barrel_add_from_text_dir().
(hier_barrel_prob_wi_in_ci_wittenbell): Add new arguments for LOO.
(hier_barrel_prob_wi_in_ci_shrink): Likewise. Use new arguments to
implement LOO classification.
(hier_barrel_prob_wi_in_ci): Add new arguments for LOO.
(check_prob_wi_in_ci): Call above function with new arguments for LOO.
(_hier_barrel_local_score): Likewise.
(PRINT_TREE_SCORES): Change macro value from 1 to 0.
(_hier_barrel_set_node_scores): Add new argument for LOO. Call
bow_barrel_score() with LOO argument.
(hier_barrel_print_scores): Call above function with new args for LOO.
(hier_barrel_score): Likewise, and add new args for LOO.
(hier_barrel_test): Call hier_barrel_score() with LOO argument.
Wed Nov 5 13:47:44 1997 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (rainbowh_options): Rename "test-files" to
"testing-files" to avoid naming conflict.
* rainbow.c (rainbow_options): Likewise.
Thu Oct 30 11:04:54 1997 Andrew McCallum <mccallum@jprc.com>
* rainbow-h.c (hier_barrel_test): Write the infogain-chosen
vocabulary in the local directory, named something like
"infogain2-hV00200" according to the the random seed and the
vocabulary size.
* rainbow-h.c (hier_barrel_prob_wi_in_ci_shrink): Change the #elif
to do the "Witten-Bell"-like weighting between parent and child
instead of the "Method of Moments"-style weighting.
(hier_barrel_test): Put two newlines at the beginning of the file so
that even if we are lexing with --skip-headers, we'll still get
the words in the file.
Tue Oct 28 13:50:35 1997 Andrew McCallum <mccallum@jprc.com>
* arrow.c (arrow_query): Complain with message if the query vector
is emtpy.
* barrel.c (bow_barrel_add_from_text_dir): Fill in the WORD_COUNT
entry in the created CDOC's.
* wi2dvf.c (bow_wi2dvf_add_di_text_fp): Make it return the number
of words in the document, instead of void.
* bow/libbow.h (bow_wi2dvf_add_di_text_fp): Now returns int
instead of void.
Thu Oct 16 09:58:14 1997 Andrew McCallum <mccallum@jprc.com>
Following changes actually made in September:
* rainbow-h.c (USE_VOCAB_SIZE_KEY): New macro.
(TEST_FILES_KEY): Likewise.
(rainbowh_options): New options "use-vocab-size" and "test-files".
(struct rainbowh_arg_state): New members VOCAB_SIZE and
TEST_FILES_FILENAME.
(rainbowh_parse_opt): Handle new options.
(hier_barrel): New member DELTA.
(hier_barrel_prune_words_not_in_map_and_set_vpc): New function.
(hier_barrel_set_local_class_model): Uncomment code to free the class
barrel before re-creating it.
(rainbowh_total_unique_words): New global variable.
(hier_barrel_prob_wi_in_ci_wittenbell): New function.
(hier_barrel_calc_delta): New function.
(hier_barrel_calc_delta_recurse): New function.
(hier_barrel_prob_wi_in_ci_shrink): New function taken from
hier_barrel_prob_wi_in_ci(). Implement Wasserman's "Method of
Moments".
(hier_barrel_prob_wi_in_ci): New function that calls one of above two.
(_hier_barrel_local_score): Find the root and total number of words.
Initialize SCORES[] to log of prior class prob. Handle case where
there is not training data for a class. If DV is NULL, deal with
skipping the word sometimes, but not always, depending on the
right case. When incrementing score for each word, deal with
classes that have no training data. Remove code for
!OVERRIDE_LOG_PROBS. When putting scores into return array, deal
with classes that have no training data.
(_hier_barrel_set_node_scores): Set the local score or not depending
on HIER_LEAF mode.
(hier_barrel_print_infogain): Function moved up in file. New first
argument FILE*.
(hier_barrel_test): Deal with TEST_FILES_FILENAME and VOCAB_MAP, etc.
(main): Initialize VOCAB_SIZE to zero. Register the KL method. Call
hier_barrel_print_infogain() with new argument. Remove
"compile-command" local emacs variable.
* vpc.c (bow_barrel_new_vpc): Deal with situations where there is
no training data for a class. Instead of asserting MAX_CI > 0,
print a warning instead. Loop over NUM_CLASSES instead of relying
on MAX_CI. Allow the class prior prob to be equal to 0.0 or 1.0.
* split.c (bow_test_set_files): New function.
* opts.c (bow_options): Rearrange some options.
* kl.c (bow_kl_score): When setting class prior prob, handle case
in which there was no training data for this class. Also handle
this case when adding contribution of a WI for each class, and
when doing WITTEN_BELL, and when normalizing score.
* int4str.c (bow_int4str_new_from_string_file): New function.
* bow/libbow.h: Declare new functions.
* Makefile.local (rainbow-h): Use $(CC) instead of gcc.
* rainbow.c (TEST_FILES_KEY): New macro.
(PRINT_DOC_NAMES_KEY): Likewise.
(rainbow_options): New options "test-files" and "print-doc-names".
(rainbow_parse_opt): Handle them.
(filename_to_classname): Comment out early return of filename. That
is, revert to old behavior.
(rainbow_test): If TEST_FILES_FILENAME is non-NULL use it instead of
randomly setting which documents are for training and which are
for testing.
(main): Initialize TEST_FILES_FILENAME to NULL. Handle
RAINBOW_DOC_NAME_PRINTING.
Thu Sep 25 17:35:01 1997 Andrew McCallum <mccallum@@jprc.com>
This version given to Rob Shapire and others at AT&T Research.
* Makefile.in (LIBBOW_H_FILES): Added bow/kl.h.
Wed Sep 17 14:51:33 1997 Karl Kleinpaste <karl@@jprc.com>
* rainbow-h.c (top): Add socket #includes.
(rainbowh_options): Add --query-server and -n options. Also,
remove #define of PRINT_TREE_SCORES, in favor of runtime -n.
(rainbowh_arg_state): Add rainbowh_query_serving to what_doing,
plus server_port_num, for --query-server, and print_tree_scores,
for -n.
(rainbowh_parse_opt): Add --query-server and -n detection.
(_hier_barrel_set_node_scores): Properly conditionalize tree score
printing.
(hier_barrel_print_scores_recurse): Add FILE *out.
(hier_barrel_print_scores): Add FILE *out.
(rainbowh_query): New routine, stripped from the mainline `if'.
(rainbowh_socket_init): Add for --query-server capability.
(rainbowh_serve): Add for --query-server capability.
(main): Init print_tree_scores; init lexer end pattern; insert
conditional call to service --query-server; slice out mainline for
rainbowh_query().
Wed Sep 17 14:22:18 1997 Andrew McCallum <mccallum@@jprc.com>
* split.c (bow_test_split): Remove the assertion that we use at
least 90% of the documents as training data.
Tue Sep 9 10:55:31 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-h.c: Change default method from prind to naivebayes.
(_hier_barrel_cdoc_write): Handle bow_file_format_version 5.
(_hier_barrel_cdoc_read): Likewise.
(hier_barrel_set_local_class_model): Call vpc function with new
argument.
(hier_barrel_set_vpc_with_weights): Likewise.
(hier_barrel_add_document): Set HBARREL->DOC_BARREL->METHOD to
HIER_DEFAULT_METHOD.
(hier_barrel_set_vpc_and_populate_lower_branches): Only recursively
set vpc in children branches if there are documents there.
(hier_barrel_prob_wi_in_ci): Assert that the CDOC->NORMALIZER has been
set. Set M_EST_M according to CDOC->NORMALIZER, which is number
of unique words.
(_hier_barrel_local_score): Clean up a little.
(main): Call hier_set_method() if BOW_ARGP_METHOD.
Sat Aug 30 19:03:17 1997 Andrew McCallum <mccallum@@jprc.com>
* kl.c (bow_kl_score): Initialize scores to class prior divided by
query document length, not just class prior. This way our
classifications match Naive Bayes, as they should.
Fri Aug 29 09:12:05 1997 Andrew McCallum <mccallum@@jprc.com>
* barrel.c (bow_barrel_add_from_text_dir): Add newline before
warning about a file being skipped because it is not text.
Before the following change we were overflowing DV->ENTRY[i].DI in
document barrel's when there were more than 32767 documents.
Karl's Yahoo experiments were trying to build models with about
60000 documents. We would get an error in vpc.c at the assertion
that "ci < num_classes".
* dv.c (bow_dv_add_di_count_weight): Warn if we overflow int,
not short.
(bow_dv_write_size): Adjust for change of COUNT and DI from
short to int.
(bow_dv_write): Likewise.
(bow_dv_new_from_data_fp): Likewise.
* bow/libbow.h (bow_cdoc): Change member CLASS from short to int.
(bow_de): Change members DI and COUNT from short to int.
* barrel.c (_bow_barrel_cdoc_write): If BOW_FILE_FORMAT_VERSION is
5 or greater, change CDOC->CLASS from short to int.
(_bow_barrel_cdoc_read): Likewise.
* bow/libbow.h (BOW_DEFAULT_FILE_FORMAT_VERSION): Changed from 4 to 5.
* io.c: Add comment about bow_file_format_version history.
Thu Aug 28 22:36:55 1997 Andrew McCallum <mccallum@@jprc.com>
* kl.c (bow_kl_score): Add class prior probabilities.
* lex-simple.c (bow_lexer_simple_get_raw_word): When we find the
NULL at the end of the document, and before we find the beginning
of a word, back up DOCUMENT_POSITION (even though will return 0
this time already). Add some assertions about DOCUMENT_POSITION.
* lex-html.c (bow_lexer_html_get_raw_word): When we find the NULL
at the end of the document, back up DOCUMENT_POSITION so we will
return 0 next time we are called. Add some assertions about
DOCUMENT_LENGTH.
Wed Aug 27 11:23:34 1997 Andrew McCallum <mccallum@@jprc.com>
* bow/libbow.h: Include <unistd.h>.
* rainbow.c (rainbow_parse_opt): Fix typo.
* rainbow.c (rainbow_parse_opt) [SERVER_KEY]: Set
DOCUMENT_END_PATTERN to a single dot on a line.
(main): Don't set DOCUMENT_END_PATTERN here for server mode.
* lex-simple.c (bow_lexer_simple_open_text_fp): Explicitly seek
the PRE_PIPE_FP to the end of the file! Otherwise, we can
sometimes read the same file over and over again in the many
`while(open_text_fp())' loops throughout the library.
* rainbow.c (rainbow_print_weight_vector): Change the test for
deciding when we need to multiply by CDOC->NORMALIZER before
printing the weight. Instead of looking specifically for
"naivebayes", look for a METHOD->NORMALIZE_WEIGHTS function
pointer that is NULL. Now this works properly for the "kl" method
too.
* kl.c (bow_kl_set_weights): Calculate the total number of
occurrences of each word; store this in DV->IDF. The the DV
weights to the weighted log odds ratio P(w|C)*log(P(w|C)/P(w|~C)).
* rainbow.c (rainbow_lisp_setup): Update for new default
arguments.
(rainbow_lisp_query): Add LOO_CV argument to bow_barrel_score().
* kl.c (bow_kl_score): Move declaration of SCORES_SUM.
Tue Aug 26 11:12:00 1997 Andrew McCallum <mccallum@@jprc.com>
* vpc.c (bow_barrel_set_vpc_priors_by_counting): Add assertion
about the PRIOR.
Mon Aug 25 14:03:58 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-ac.pl: As a diagnostic, print the number of predictions
found in the file.
* naivebayes.c (bow_naivebayes_set_weights): Set CDOC->NORMALIZER
to the number of unique terms in each class. (This is now used by
rainbow-h.)
* kl.c (bow_kl_set_weights): Add assertion about CDOC->NORMALIZER.
* foilgain.c (bow_foilgain_ci_per_wi_new): New function.
* bow/libbow.h (bow_default_method_name): New macro.
* barrel.c (bow_barrel_new): Use new macro
`bow_default_method_name' instead of "naivebayes".
Tue Aug 19 09:50:16 1997 Andrew McCallum <mccallum@@jprc.com>
* int4word.c (bow_words_set_map): Be sure to initialize the
map/counts if they haven't been initialized yet. Otherwise,
WORD_MAP_COUNTS will point nowhere an we can tromp on memory. I
was getting malloc() errors before this was fixed.
(bow_words_keep_top_by_infogain): Change so that word indices are
ordered by information gain, even when NUM_WORDS_TO_KEEP is less
than the number of words returned by bow_infogain_per_wi_new().
* wi2dvf.c (bow_wi2dvf_entry_at_wi_di): New function.
* dv.c (bow_dv_entry_at_di): New function.
* bow/libbow.h: Declare new functions.
* barrel.c (bow_barrel_add_from_text_dir): Add verbosity when a
file is skipped because istext() fails.
(bow_new_slow_barrel_printf): New function.
* vpc.c (bow_barrel_new_vpc): New argument, NUM_CLASSES. Use it
to initialize an array that is filled with counts of the number of
documents per class. Initialize CDOC->NUM_WORDS to be the number
of documents per class. This can then be used in "event=document"
models.
(bow_barrel_new_vpc_merge_then_weight): New argument, NUM_CLASSES.
(bow_barrel_new_vpc_weight_then_merge): Likewise.
* rainbow.c (rainbow_index): Use macro
bow_barrel_new_vpc_with_weights(), with new `num_classes'
argument.
(rainbow_query): Likewise.
(rainbow_test): Likewise.
(main): Likewise.
(rainbow_test_files): Likewise. If QUERY_WV is NULL, verbosify a
warning.
* bow/libbow.h (bow_method): Add NUM_CLASSES argument to
VPC_WITH_WEIGHTS.
(bow_barrel_new_vpc_with_weights): Add NUM_CLASSES argument.
(bow_barrel_new_vpc): Likewise.
(bow_barrel_new_vpc_merge_then_weight): Likewise.
(bow_barrel_new_vpc_weight_then_merge): Likewise.
* bow/naivebayes.h (bow_params_naivebayes): Remove
SCORE_WITH_LOG_PROBABILITIES.
* kl.c (bow_kl_score): Reformat error message.
* naivebayes.c (bow_naivebayes_set_weights): Only set
CDOC->WORD_COUNT if not doing BOW_BINARY_WORD_COUNTS, otherwise
leave them as the "document counts" as they were initialized in
vpc.c.
Thu Aug 14 11:46:46 1997 Andrew McCallum <mccallum@@jprc.com>
* naivebayes.c: Remove all references and code for
SCORE_WITH_LOG_PROBABILITIES. Use KL method instead.
(bow_method_crossentropy): Removed, and all related structures and
functions.
* opts.c (bow_options): Remove "naivebayes-score-with-log-probs"
option.
(parse_bow_opt): Don't handle it anymore.
* naivebayes.c: Add a naivebayes-specific command-line option by
using "argp child".
(naivebayes_argp_m_est_m): New static variable.
(naivebayes_options): New argp structure. New command-line option
"naivebayes-m-est-m".
(naivebayes_parse_opt): New function.
(naivebayes_argp: New structure.
(naivebayes_argp_child): New structure.
(_register_method_naivebayes): Add the argp child.
(bow_naivebayes_score): Comment out assertion that (loo_class == -1)
because it trips up rainbow-h.
These changes were made a while ago.
* rainbow-h.c (hier_recursive_set_rankings): Pass new LOO argument
to bow_barrel_score.
(classify_single_doc): Likewise.
(hier_barrel_set_vpc_and_populate_lower_branches): Likewise.
(hier_barrel_prob_wi_in_ci): Add two new pass-by-ref arguments that
return certain counts. Pass new arguments.
(check_prob_wi_in_ci): Pass new arguments.
(_hier_barrel_local_score): Call above function with new arguments,
and print them out.
(main): Switch back to using POPULATE_BY_SCORING and HIER_NIECE
options by default.
Wed Aug 13 16:44:07 1997 Andrew McCallum <mccallum@@jprc.com>
* lex-simple.c (bow_lexer_simple_open_text_fp): Print error
message if popen() call failed.
* opts.c (bow_argp_add_child): Change asssertion. Add call to
memset(), which should be unnecessary.
Before this code was added, some inlinks WebKB files were being
declared as "nontext" and skipped because many lines had the same
length.
* istext.c (bow_fp_is_text): Pay attention to
BOW_ISTEXT_AVOID_UUENCODE.
* opts.c (bow_istext_avoid_uuencode): Declare new global variable.
(bow_options): New option "istext-avoid-uuencode".
(parse_bow_opt): Handle it.
* bow/libbow.h (bow_istext_avoid_uuencode): New global variable
set by command-line option.
(bow_lex_pipe_command): Make it extern!
* kl.c (bow_kl_score): Give more detailed error message for LOO
negative probabilities.
Before this code was added, some WebKB files were being skipped
because the non-MIME-header part was already buffered in STDIO.
* lex-simple.c (bow_lexer_simple_open_text_fp): When using
BOW_LEX_PIPE_COMMAND, make sure that the file descriptor file
position matches the stdio FP position, otherwise we can get a
premature EOF because the stdio has already read much of the file
for buffering.
Mon Aug 11 11:51:11 1997 Andrew McCallum <mccallum@@jprc.com>
* info_gain.c (bow_infogain_per_wi_print): If NUM_TO_PRINT is 0,
then print infogain of all words, not zero words.
* bow/libbow.h (bow_model_next_wv): Declare new split function.
Mon Jul 14 11:09:04 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-stats.pl (overall_accuracy): Shorten the label before
the numbers.
* istext.c (bow_fp_is_text): Initialize
MAX_LINE_LENGTH_HISTOGRAM_LENGTH to avoid warning.
* istext.c (bow_fp_is_text): Re-enable the uuencode-block
detection. Now, in order to reject the file, insist that the
length of the lines with the most common length be greater than
or equal to 50. Hopefully this will not falsely reject HTML files
as it did before.
Tue Jul 1 08:39:25 1997 Andrew McCallum <mccallum@@jprc.com>
* kl.c (bow_kl_score): Remove assertion that SCORE_INCREMENT be
non-zero. It can be zero when PR_W_C == PR_W_D, then
LOG(PR_W_C/PR_W_D) will be zero, and SCORE_INCREMENT will be zero.
Mon Jun 30 17:41:06 1997 Karl Kleinpaste <karl@@jprc.com>
* rainbow.c (rainbow_serve): Added.
(rainbow_socket_init): Added.
(rainbow_parse_opt): Added SERVER_KEY case.
(rainbow_query): Modified FILE * handling for use of other than
stdin/stdout.
(main): Added query-server handling.
Sat Jun 28 12:22:30 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow.c (rainbow_test_files): Temporarilty comment out code
that removes some of the training documents from training until we
add a scheme that really makes the default test percentage 0.
(main): Put the call of rainbow_test_files after doing things
necessary to update the class/word weights for the command-line
options. Temporarily, ALWAYS rebuild the VPC model, even if non
of the parameters change because the weights read from disk were
bad; find out why eventually!
* prind.c (bow_prind_score): When BOW_PRINT_WORD_SCORES, also
print PR_W_C.
* prind.c (bow_prind_score): When all pre-normalized scores are
zero, set normalized scores to -1.0/#classes, don't leave them as
zero. [Perhaps we should set the scores to the class priors?
Althought this does not fall our of the PrInd derivation.]
* kl.c (bow_kl_score): When all pre-normalized scores are zero,
set normalized scores to -1.0/#classes, not -9999.
* arrow.c (arrow_query): Pass LOO_CV argument to score.
Thu Jun 26 14:48:28 1997 Andrew McCallum <mccallum@@jprc.com>
* lex-simple.c (bow_lexer_simple_open_text_fp): Attend to
BOW_LEX_PIPE_COMMAND and implement it.
* opts.c (bow_lex_pipe_command): New global variable.
(bow_options): New command-line option "lex-pipe-command".
(parse_bow_opt): Handle it.
* bow/libbow.h: Declare new global variable.
* istext.c (bow_fp_is_text): Move local variables to avoid GCC
warning.
--test-files-loo should now work.
* prind.c: Convert scoring function to take LOO_CLASS arguement.
* kl.c: Likewise.
* naivebayes.c: Likewise.
* tfidf.c: Likewise.
* evi.c: Likewise.
* rainbow.c: Call scoring function with LOO_CLASS argument.
* bow/libbow.h (bow_barrel_score): Add extra LOO_CLASS argument.
(bow_method): Likewise to (*score) member.
* rainbow-ac.pl: Make sure last confidence number gets printed
properly. Before it was always just zero.
* rainbow.c (rainbow_options): Added "test-files-loo" for
Leave-One-Out testing. Not implemented yet, however.
(struct rainbow_arg_state): New member LOO_CV.
(rainbow_query): Do proper checks before using lisp score truncation.
(rainbow_test): Likewise. Also, add (commented out) code to print
more stats.
(main): Call _register_method_evi() to make sure it gets linked in.
* Makefile.in (LIBBOW_C_FILES): Added evi.c.
* evi.c: New file.
* naivebayes.c (bow_naivebayes_set_weights): Add checks that make
sure that Sum_w Pr(w|c) is 1 for all classes.
* kl.c (bow_kl_score_loo): Implement normalized KL scores, with
Witten-Bell discounting. (NOTE: NaiveBayes does not yet have
Witten-Bell implemeted. Thus the accuracy of Witten-Bell can be
easily compared with Laplace by comparing "kl" with "naivebayes".)
* rainbow-stats.pl (confusion): Initialize $MAX_CLASSNAME_LENGTH
to the length of "classname", so that we still get proper
formatting with very short classnames.
* istext.c (bow_fp_is_text): Temporarily comment out the code that
tries to avoid files with uuencoded blocks, because the current
scheme also seems to avoid many HTML files. (Reported by Sean
Slattery.) Warning, trying to index the 20_newsgroups data in
this state will give bad results.
Mon Jun 23 11:59:50 1997 Andrew McCallum <mccallum@@jprc.com>
* prind.c (bow_prind_score): Comment fixes. Describe the
smoothing situation accurately.
* int4word.c (bow_words_keep_top_by_infogain): Don't try to "keep"
more words than are available in the BARREL! (Bug reported by
Daniel A Dipasquo <greenface+@@CMU.EDU>.) If NUM_WORDS_TO_KEEP is
greater than or equal to the number of words in the BARREL, put
all these words in the new vocabulary.
Wed Jun 11 16:40:14 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow.c (rainbow_test): Don't do CommonLisp score truncation
if the score is negative. (This change should be made to other
score-printing functions too.)
(main): Gratuitously call _register_method_kl(), so that kl.c gets
linked in with the rainbow executable.
* kl.c (_register_method_kl): Make sure we can't register the
method twice, even if this function is called twice.
* naivebayes.c (bow_naivebayes_score_loo): When using uniform
class priors, set SCORES[CI] based on log of uniform distribution
of classes, not to 1. When setting log_pr_tf, instead of using
pow() before taking the log(), just multiply after using log().
* Makefile.in (LIBBOW_C_FILES): Added kl.c.
Fri Jun 6 09:48:06 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow.c (NO_LISP_SCORE_TRUNCATION_KEY): New macro.
(rainbow_options): New option "no-lisp-score-truncation".
(rainbow_parse_opt): Handle it.
(struct rainbow_arg_state): New member USE_LISP_SCORE_TRUNCATION.
(rainbow_query): Obey it.
(rainbow_test): Likewise.
(main): Make its default value 1.
Tue Jun 3 10:31:10 1997 Andrew McCallum <mccallum@@jprc.com>
* readme.texi: Use BOWVERSION, not BOW_VERSION to match
version.texi.
Thu May 29 15:25:06 1997 Andrew McCallum <mccallum@@jprc.com>
* Version (BOW_MINOR_VERSION): Version 0.8.
* bow/libbow.h (BOW_MINOR_VERSION): Version 0.8.
* docnames.c (bow_map_filenames_from_dir): Remove local variables
no longer used.
Mon May 26 12:59:50 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow.c (main): New commented-out code for computing the
number of word co-occurrences.
Fri May 23 11:34:05 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow.c (USE_VOCAB_IN_FILE_KEY): New macro.
(rainbow_options): New option "use-vocab-in-file".
(rainbow_parse_opt): Handle it.
(struct rainbow_arg_state): New member VOCAB_MAP.
(rainbow_query): Use it to remove words from the vocabulary.
(rainbow_test): Likewise.
(main): Likewise.
* rainbow-stats.pl (prune_from_classname): New global variable. A
regular expression to be removed from the end of classnames before
gathering stats on them. This allows us to gather stats on
performance in the middle of class hierarchies.
(read_trial): Use it.
* int4str.c (bow_int4str_new_from_text_file): Return MAP instead
of NULL!
* barrel.c (bow_barrel_prune_words_not_in_map): Define MAX_WI and
use it, so we don't ask for word indices larger than
bow_num_words().
(bow_barrel_print_word_count): Also print word probability according
to counts.
* rainbow-h.c (main) [printing_word_counts]: Print word that is
being counted.
Wed May 21 15:01:51 1997 Andrew McCallum <mccallum@@jprc.com>
* barrel.c (bow_barrel_prune_words_not_in_map): Remove the words
instead of hiding them, so that future
bow_keep_top_words_by_infogain() calls won't unhide them.
This version got 46% on hier/yahoo-science (dataset with a 10
document-per-class threshold).
* rainbow-h.c (rainbowh_options): Added --use-vocab-in-file
command-line option.
(rainbowh_arg_state): Added PARENT and CI_IN_PARENT. Added HIER_LEAF.
Removed printing of leaf- and intermediate-results.
(hier_barrel_prob_wi_in_ci): New function.
(check_prob_wi_in_ci): New function.
(_hier_barrel_local_score): New function.
(_hier_barrel_set_node_scores): Use it.
(hier_barrel_print_infogain): Print FULL_NAME with interspersed
spaces, so it won't get lexed by bow_int4str_new_from_text_file().
(main): Change defaults. Before populate_by_scoring=1 and
hier_structure=hier_niece. Populate branches first thing, and
check prob_wi_ci consistency.
* naivebayes.c (bow_naivebayes_score_loo): Comment change.
* int4str.c (bow_int4str_new_from_text_file): New function.
* bow/libbow.h: Declare new functions.
Tue May 20 16:02:24 1997 Andrew McCallum <mccallum@@jprc.com>
* barrel.c (bow_barrel_prune_words_not_in_map): New function.
Mon May 19 09:52:09 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-stats.pl (confusion): Calculate longest classname and
use it to fix indentation.
* wi2dvf.c (bow_wi2dvf_add_di_wv): Set SEEK_START to special flag
2.
(bow_wi2dvf_add_wi_di_count_weight): Likewise.
(bow_wi2dvf_hide_wi): Decrement WI2DVF->NUM_WORDS in the right place.
(bow_wi2dvf_unhide_all_wi): Increment WI2DVF->NUM_WORDS.
(bow_wi2dvf_write): Unhide all words first.
(bow_wi2dvf_dv): Change assertion to deal with special flag 2.
* rainbow.c (main): Pass new argument to
bow_infogain_per_wi_print().
* rainbow-h.c: Misc changes. Print infogain during run.
(hier_barrel_set_local_class_model): Add IS_ROOT argument. Unhide
vocabulary after pruning by infogain, so lower levels get all
words.
* naivebayes.c (M_EST_M): New macro.
(M_EST_P): New macro.
(bow_naivebayes_score_loo): Use them to implement M-estimates, instead
of old Laplace smoothing.
* info_gain.c (bow_infogain_per_wi_print): Add FP argument.
* bow/libbow.h: Add argument to infogain function.
* barrel.c: Fix the math for assigning CDOC->PRIOR, and add
assertion checks.
Fri May 16 10:19:19 1997 Andrew McCallum <mccallum@@jprc.com>
This was state of code on Thursday night.
* rainbow-h.c: Add options for changing population scheme and tree
structure. Add ability to output intermediate and leaf results.
* naivebayes.c (WORD_PRIOR_COUNT): New macro. Current value 1.0.
(bow_naivebayes_score_loo): Use it.
Thu May 15 16:22:27 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow.c (rainbow_test): Assert that the ACTUAL_NUM_HITS
returned by bow_barrel_score() is the same as the
NUM_HITS_TO_RETRIEVE requested.
* split.c (bow_test_split): Use rand() properly so that the number
of test documents in each class are not so biased. Add special
code that *ensures* that the test documents are evenly distributed
across classes.
* rainbow.c (rainbow_print_weight_vector): Don't use
CDOC->NORMALIZER if the method is "naivebayes", because NaiveBayes
doesn't use it. Previously the printed values were bogus.
Wed May 14 11:02:44 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-h.c: -q RAINBOWH_QUERYING now seems to work.
* naivebayes.c (bow_naivebayes_score_loo): Add assertion that
CDOC->PRIOR is greater than zero. This restriction should be
relaxed!
* array.c (bow_array_free): Decrement length after testing for
non-zero-ness, not before. Without this change, empty arrays
would call free() on un-malloc'ed() memory.
Tue May 13 18:16:31 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-h.c: Add code for doing selective population of lower
branches. This population seems to be working. Querying/scoring
does not yet work.
* wi2dvf.c (bow_wi2dvf_hide_wi): Change assertion to "if" so that
we won't crash if we try to hide words that are already hidden.
* split.c (bow_tmp_word_struct2): New type.
(bow_model_next_wv): New function.
(bow_nontest_next_wv): New function.
* rainbow.c (rainbow_options): Fix documentation for test-files.
(rainbow_test): Choose vocabulary by info gain *after* the test/train
split. Add temporary code to test bow_naivebayes_score_loo().
Remove this later!
* naivebayes.c (bow_naivebayes_score_loo): New function, copy of
bow_naivebayes_score_loo, with extra code to do leave-one-out
testing if argument LOO is non-negative.
(bow_naivebayes_score): Call above function with -1 for LOO.
(bow_method_naivebayes): Change NORMALIZE_WEIGHTS from
bow_barrel_normalize_weights_by_summing() to NULL. The
normalizing function was not taking account of the Laplace
smoothing numbers, and was giving incorrect weights.
(bow_method_crossentropy): Likewise.
* istext.c (bow_fp_is_text): Increase NUM_LINE_LENGTHS to
NUM_TEST_CHARS to avoid potential crash.
* docnames.c (bow_map_filenames_from_dir): For directory names and
filenames, make it use names of soft links, not the directories
that the links point to.
* barrel.c (bow_barrel_add_document): New function.
* bow/libbow.h: Declare new function.
* docnames.c (bow_map_filenames_from_dir): Change commented-out
code so that, if uncommented, this function will work if you pass
it a filename instead of a directory name.
Tue May 6 15:30:30 1997 Andrew McCallum <mccallum@@jprc.com>
* Makefile.local (rainbow-h): Make it depend on libbow.a.
* rainbow-h.c: May 5 changes from Andrew Ng.
(rainbowh_unarchive): Switch order of unarchiving for vocabulary
and hier_barrel.
(hier_barrel_new_from_file): Use bow_barrel_new_from_data_file()
instead of bow_barrel_new_from_fp(), so we close FILE*'s instead
of keeping them open. Otherwise we run out of UNIX's available
open file descriptor's.
* wi2dvf.c (FREE_WHEN_HIDING_WI): New macro.
(bow_wi2dvf_hide_wi): Heed it.
(bow_wi2dvf_dv): Don't check to make sure that WI is less than
bow_num_words(). Check SEEK_START before returning a non-NULL DV,
because if SEEK_START is less than -1, the DV should be considered
`hidden'.
* opts.c (bow_exclude_filename): New global variable.
(bow_options): New option "exclude-filename".
(parse_bow_opt): Handle it.
* docnames.c (bow_map_filenames_from_dir): Make sure
BOW_EXCLUDE_FILENAME is non-NULL before passing it to strcmp().
* bow/libbow.h (bow_exclude_filename): Declare new global
variable.
* barrel.c (bow_barrel_set_cdoc_priors_to_class_uniform): Use
bow_malloc() instead of alloca(), so that bow_realloc() will work.
free() it at the end.
(bow_barrel_new_from_data_file): New function.
Mon May 5 21:08:34 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-h.c: Changes by Andrew Ng, before Andrew McCallum's
changes to close barrel FP's.
Fri May 2 09:53:12 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-h.c: Additions by Andrew Ng to implement cousin scheme.
Wed Apr 30 10:48:30 1997 Andrew McCallum <mccallum@@jprc.com>
* Makefile.in: Include Makefile.local, avoiding error if it isn't
present.
* barrel.c (bow_barrel_keep_top_words_by_infogain): Unhide and
hide the DVF's instead of removing them, so that we can call this
function mulitple times with increasing NUM_WORDS_TO_KEEP.
* wi2dvf.c (bow_wi2dvf_hide_wi): New function.
(bow_wi2dvf_unhide_all_wi): New function.
(bow_wi2dvf_dv): Handle new negative values of SEEK_START set by
BOW_WI2DVF_HIDE_WI().
* bow/libbow.h: Declare new functions.
(bow_doc_type): Add ignored_model, for rainbow-h.c.
Thu Apr 24 09:03:10 1997 Andrew McCallum <mccallum@@jprc.com>
* vpc.c (bow_barrel_set_vpc_priors_by_counting): Fix crash that
occurs if limited vocabulary causes all files in a class to be
empty.
* stoplist.c (bow_stoplist_add_word): New function.
* rainbow-stats.pl (confusion): Print percentage correct for each
category.
* istext.c (bow_fp_is_text): Also return 0 for files that have
more than 30% of their lines of the same length. This way we
avoid files containing uuencoded blocks.
* bow/libbow.h: Declare new function.
Tue Apr 22 11:19:03 1997 Andrew McCallum <mccallum@@jprc.com>
* deflexer.c (bow_default_lexer): Add cast to initialization to
avoid warning.
Add a uniform, global way of keeping track of binary file format
versions.
* io.c (bow_file_format_version): New global variable.
(bow_write_format_version_to_file): New function.
(bow_read_format_version_from_file): New function.
* bow/libbow.h (bow_file_format_version): Declare new global
variable.
(BOW_DEFAULT_FILE_FORMAT_VERSION): New macro.
(bow_write_format_version_to_file): New function declaration.
(bow_read_format_version_from_file): New function declaration.
* rainbow.c (FORMAT_VERSION_FILENAME): New macro.
(rainbow_archive): Write format version to disk.
(rainbow_unarchive): Read it from disk if the file exists, otherwise
set it to 3, which is the format version number of data before
BOW_FILE_FORMAT_VERSION was added to the library.
* rainbow.c (rainbow_options): New option "print-word-counts",
alias for "print-counts-for-words". Hide the later option from
the --help text.
* rainbow-stats.pl (confusion): Print confusion matrix in a more
readable format.
Add new command-line option to rainbow for using only 0 or 1 word
counts.
* opts.c (bow_binary_word_counts): New global variable.
(bow_options): New option "binary-word-counts".
(parse_bow_opt): Handle it.
* bow/libbow.h: Declare new global variable.
* dv.c (bow_dv_add_di_count_weight): When BOW_BINARY_WORD_COUNTS
is true, insist on keeping DV's entry count below 2, i.e. 0 or 1.
Fri Apr 18 16:09:06 1997 Andrew McCallum <mccallum@@jprc.com>
* configure.in: Add -Wno-implicit to default CFLAGS.
* rainbow.c (rainbow_lisp_query): Return if QUERY_WV is emtpy.
(Previously would have crashed.)
* tfidf.c (TFIDF_METHOD): Fix typo that defined
_register_method_tfidf_.. functions without the last underscore.
(Reported by Kamal Nigam.)
* split.c (bow_test_split): When selecting documents for test set,
and randomly pick a document that was already in the test set,
don't just scan sequentially for the next non-test document, pick
a new random number. This will avoid long contiguous stretches of
test documents.
* naivebayes.c (bow_naivebayes_score): Move the handling of
SCORE_WITH_LOG_PROBABILITIES.
* barrel.c (bow_barrel_set_cdoc_priors_to_class_uniform): Assert
that CDOC->PRIOR must be greater or equal, not just greater.
Thu Apr 10 14:54:08 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-h.c: Fix the `compile-command'.
(PRINT_TREE_SCORES): New macro.
(hier_set_method): New function.
(main): Call it if BOW_ARGP_METHOD is non-NULL.
* deflexer.c (bow_default_lexer): Initialize it to -1, so that
deflexer.o will get linked in under SunOS. Ug. See comment.
* bow/libbow.h (bow_methods): Declare extern!
Wed Apr 9 11:14:13 1997 Andrew McCallum <mccallum@@jprc.com>
* lex-html.c (bow_lexer_html_get_raw_word): Return last word in
document, even if it is not followed by a non-word character!
* lex-simple.c (bow_lexer_simple_get_raw_word): Likewise.
* rainbow.c (rainbow_lisp_setup): Call all
__attribute__((constructor)) functions here since this will be
dynamically loaded and the contructor functions won't be called
then.
* opts.c (parse_bow_opt): Remove call to
_bow_default_lexer_init(); moved to rainbow.c.
Fix a bug whereby --skip-html was a no-op.
* deflexer.c (bow_default_lexer_simple,
bow_default_lexer_indirect, bow_default_lexer_gram,
bow_default_lexer_html, bow_default_lexer_email): Change global
variable from struct's to pointers to structs.
(_bow_default_lexer_simple, _bow_default_lexer_gram,
_bow_default_lexer_html, _bow_default_lexer_email): New static
variables.
(_bow_default_lexer_init): Set BOW_DEFAULT_LEXER_INDIRECT to point
inside of BOW_DEFAULT_LEXER_GRAM, which is the BOW_DEFAULT_LEXER.
* opts.c: Now use all default lexers as pointers to struct's
instead of struct's.
* bow/libbow.h (bow_default_lexer_simple,
bow_default_lexer_indirect, bow_default_lexer_gram,
bow_default_lexer_html, bow_default_lexer_email): Change global
variable from struct's to pointers to structs.
* vpc.c (bow_barrel_new_vpc_merge_then_weight): Assert the method
name.
* Makefile.in (dist-cmu, bow-$(BOW_VERSION).tar.gz): New targets.
Tue Apr 8 08:00:00 1997 Andrew McCallum <mccallum@@jprc.com>
* Version (BOW_MINOR_VERSION): Version 0.7.
* bow/libbow.h (BOW_MINOR_VERSION): Likewise.
* rainbow.c (RAINBOW_MINOR_VERSION): Version 0.2.
* arrow.c (ARROW_MINOR_VERSION): Version 0.2.
* NEWS: Update for new version of library and rainbow.
* readme.texi: Likewise.
* Makefile.in (DIST_FILES): Add NEWS.
* Makefile.in (dist): Fix invocation of `tr' for cvs rtag.
* split.c (bow_test_next_wv): Initialize CURRENT_DI to avoid
warning.
* split.c (bow_test_split): Initialize DOC to avoid warning.
* int4word.c (bow_words_keep_top_by_infogain): Initialize
MAX_IG_WI to avoid warning.
* dv.c (bow_dv_add_di_count_weight): Only give "overflowed short"
message at BOW_VERBOSE level, not BOW_PROGRESS level.
* crossbow.c (main): Initialize NORMALIZER to zero.
* Makefile.in (dist): Create ./bow directory. Fix invocation of
argp.
(snapshot): Likewise.
* configure.in: Add -O to the default CFLAGS.
* rainbow.c (rainbow_options): Improve some option help text.
(rainbow_parse_opt) [INFOGAIN_PAIR_VECTOR_KEY]: Handle it.
* opts.c (bow_options): Improve some option help text.
* Makefile.in (version.texi): Define BOWVERSION instead of
BOW_VERSION, so makeinfo can get the value.
(%.dvi, %.info): Fix typo.
* libbow.texi: Fix typos and begin preliminary documentation.
* rainbow.c (rainbow_options): New option "repeat"/'r'.
(rainbow_parse_opt): Handle it.
(rainbow_arg_state): New member REPEAT_QUERY.
(rainbow_query): Attend to REPEAT_QUERY.
* naivebayes.c (bow_naivebayes_set_weights): Fix assertion so it
works for both naivebayes and crossentropy.
Mon Apr 7 11:00:06 1997 Andrew McCallum <mccallum@@jprc.com>
* sarray.c (bow_sarray_entry_at_keystr): If there is no index for
that KEYSTR, print an error message. This way if user mistypes a
method name to rainbow's -m option, they get a message that makes
some sense.
* opts.c (_help_filter): New function to add the names of the
available methods to the help text.
(bow_argp): Put it in.
Use strings to identify methods instead of integers. Separate
method declarations instead separate .h files.
* bow/tfidf.h, bow/naivebayes.h, bow/prind.h: New files.
* Makefile.in (LIBBOW_H_FILES): Add files bow/naivebayes.h,
bow/tfidf.h, bow/prind.h.
* naivebayes.c (bow_method_naivebayes, bow_method_crossentropy):
Use string method identifier instead of integer.
* prind.c (bow_method_prind): Likewise.
* tfidf.c (TFIDF_METHOD): Likewise.
* rainbow.c (rainbow_parse_opt) [G]: Step through methods
according to new BOW_METHODS bow_sarray, instead of old static
array.
* methods.c (bow_methods): Static array removed.
(bow_methods): Renamed from _bow_str4method, and made non-static.
* barrel.c (bow_method_id, _old_bow_methods): Put copies of what
used to be in libbow.h here, so we can unarchive old-format
barrel's.
(BOW_DEFAULT_BARREL_VERSION): Changed from 2 to 3.
(bow_barrel_new_from_data_fp): If VERSION_TAG is less than 3, read the
method id integer and use _OLD_BOW_METHOD, otherwise, read a
string and use new BOW_METHOD_AT_NAME().
(bow_barrel_write): Write the method as a string instead of as an
integer.
* Makefile.in (ALL_CPPFLAGS): -I$(srcdir) instead of
-I$(srcdir)/bow.
* All files: Include <bow/libbow.h> instead of "libbow.h".
* bow/libbow.h: Include <bow/tfidf.h>, <bow/naivebayes.h>,
<bow/prind.h>.
(bow_method_register_with_name, bow_method_at_name): Declare functions.
(bow_method_id): Typedef removed.
(bow_str_to_method_id): Macro removed.
(bow_methods): Global variable removed.
(bow_method_tfidf_words, bow_method_tfidf_log_words,
bow_method_tfidf_log_occur, bow_params_tfidf): Removed.
(bow_method_prind, bow_params_prind): Removed.
(bow_method_naivebayes, bow_params_naivebayes): Removed.
* methods.c (bow_method_at_name): Comment function.
(bow_method_register_with_name): Likewise.
* opts.c (parse_bow_opt) [m]: Use bow_method_at_name().
* naivebayes.c: Use bow_method_register_with_name(). Add new
method "crossentropy".
(bow_naivebayes_score): Pay attention to SCORE_WITH_LOG_PROBABILITIES
when setting class priors. When it is true, use inverse of
cross-entropy instead of negative!
* prind.c: Use bow_method_register_with_name().
* tfidf.c: Use bow_method_register_with_name().
* rainbow.c (main): Strip any trailing `/'s from classnames, so
FILENAME_TO_CLASSNAME() will find the classnames. (Reported by
Jason Rennie <jr6b@@syrinx.res.cmu.edu>.)
* rainbow-h.c (PRINT_COUNTS_FOR_WORD_KEY): New macro.
(rainbowh_options): New option "print-counts-for-words".
(rainbowh_parse_opt): Handle it.
(struct rainbowh_arg_state): New member PRINTING_WORD.
(hier_barrel_print_word_counts): New function.
(main): Handle new option. Do the right think for `-O' if
BOW_PRUNE_VOCAB_BY_OCCUR_COUNT_N.
* info_gain.c (LEAVE_OUT_LAST_CLASS): Macro defined once at top.
Changed from 0 to 1.
* install.texi: Explain the results of --prefix. Remove old
references to Objective C installation.
Thu Apr 3 12:50:23 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow.c (rainbow_test_files): Use macros for setting QUERY_WV
weights, so we can handle case in which the wv normalizer is NULL!
(main): Replace code for implementing word-count-printing with
call to new function.
* barrel.c (bow_barrel_set_cdoc_priors_to_class_uniform):
Initialize ci2dc entries to zero!
(bow_barrel_print_word_count): New function.
* opts.c (bow_options): Add new option
"naivebayes-score-with-log-probs".
(parse_bow_opt): Handle it.
* naivebayes.c (bow_naivebayes_score): Begin adding code to
support SCORE_WITH_LOG_PROBABILITIES parameter; not yet finished.
(bow_naivebayes_params): Add initializer for
SCORE_WITH_LOG_PROBABILITIES, initialize it BOW_NO.
* bow/libbow.h: Declare new function.
(bow_params_naivebayes): New entry SCORE_WITH_LOG_PROBABILITIES.
Wed Apr 2 10:07:30 1997 Andrew McCallum <mccallum@@jprc.com>
* configure.in: Add a check to see if __attribute__((constructor))
works. If it does not, define CONSTRUCTOR_FAILS.
* rainbow.c (rainbow_lisp_setup): Fix typo.
* Makefile.in ($(PERL_RUNNABLE_FILES)): Use % in pattern and $< in
rule so that we get the .pl file from the $(srcdir).
* rainbow-h.c (rainbowh_options): New option
"print-infogain-vector", 'I'.
(struct rainbowh_arg_state): Add state for it.
(rainbowh_parse_opt): Handle it.
(hier_barrel_write_to_file): Close the FP after writing a barrel.
(hier_barrel_set_vpc_with_weights): Construct and pass a CLASSNAMES
array.
(hier_barrel_set_cdoc_priors_to_class_uniform): New function.
(_hier_barrel_set_node_scores): Print a little header/separator if
BOW_PRINT_WORD_SCORES.
(hier_barrel_test): Initialize the QUERY_WV to NULL, so
BOW_TEST_NEXT_WV doesn't try to free unallocated memory.
(hier_barrel_print_infogain): New function.
(rainbowh_archive): New function.
(rainbowh_unarchive): New function.
(main): Use above two functions. Deal with printing infogain.
* rainbow.c: Re-written for using libargp. This should make it
work with the WebKB lisp crawler again.
* prind.c (bow_prind_score): Make sure CDOC->FILENAME is non-NULL
before trying to print it when BOW_PRINT_WORD_SCORES is true.
* opts.c (parse_bow_opt) [ARGP_KEY_INIT]: Call
_bow_default_lexer_init().
* deflexer.c (_bow_default_lexer_init): Don't make it static. Use
static local variable to make sure we don't run through it twice.
This is because we will call is explicitly in
opts.c:parse_bow_opt(), because __attribute__ ((constructor))
doesn't seem to work on SunOS.
* Makefile.in (PERL_FILES): Added rainbow-ac.pl and rainbow-pr.pl.
* (rainbow-ac.pl, rainbow-pr.pl): New files from
Dayne Freitag <dayne@@cs.cmu.edu>.
Tue Apr 1 10:11:03 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-h.c (rainbowh_parse_opt): Implement option 'M' for
use_maximum_likelihood_path.
(hier_default_method): Renamed from METHOD; all uses changed.
(hier_barrel): New member NUM_NON_REST_CDOCS, to keep track of
DOC_BARREL->CDOCS->LENGTH *before* the `rest' documents start
getting added, so that we can implement
HIER_PARENT_DI_TO_CHILD_INDEX_AND_DI properly.
(hier_barrel_new): Initialize it to -1.
(hier_barrel_add_child): Set it.
(hier_barrel_new_from_text_dir_leaf): Set it.
(hier_barrel_write_to_file): Write it.
(hier_barrel_new_from_file): Read it.
(hier_parent_di_to_child_index_and_di): Use it.
(hier_barrel_print): Print it instead of DOC_BARREL->CDOCS->LENGTH.
(hier_barrel_add_stats): New function split out from
HIER_BARREL_ADD_CHILD.
(hier_barrel_add_child): Use it.
(hier_barrel_add_rest): New function.
(hier_barrel_new_from_text_dir): Call it to add `rest' documents.
(hier_barrel_test): Allocate space for 3 as many SCORES, to make room
for the `rest' classes.
(main): Set HIER_DEFAULT_METHOD from BOW_ARGP_METHOD, if non-NULL.
* scale.c (bow_barrel_scale_weights_by_given_infogain): Only
verbosify every 100 words.
(bow_barrel_scale_weights_by_given_foilgain): Likewise.
* vpc.c (bow_barrel_set_vpc_priors_by_counting): Fix indentation.
* rainbow-h.c: Converted to do command-line argument processing
with libargp.
* opts.c (bow_options): Remove "version" 'V' option. libargp can
handle that automatically.
(_print_version): New function to print both program version and
library version.
(argp_program_version_hook): Set it to _PRINT_VERSION().
* rainbow.c (rainbow_print_usage): Function removed. Libargp does
that now.
Mon Mar 31 11:07:30 1997 Andrew McCallum <mccallum@@jprc.com>
* barrel.c (bow_barrel_set_cdoc_priors_to_class_uniform): Use
ALLOCA() instead of BOW_MALLOC() to avoid memory leak.
* Makefile.in (configure, config.status): Sprinkle with $(srcdir).
* configure.in: Move the setting of CFLAGS above AC_PROC_CC, so
that it will have an effect.
* install.texi: Mention how to set CPPFLAGS in the ./configure
line.
* vpc.c (bow_barrel_set_vpc_priors_by_counting): Properly set the
CDOC->PRIOR's.
* rainbow.c (INFOGAIN_PAIR_VECTOR_KEY): New macro.
(rainbow_options): New option "infogain-pair-vector".
(rainbow_parse_opt): Handle it.
(main): Likewise. When RAINBOW_WORD_COUNT_PRINTING, also print the
total number of words in each class.
* prind.c (bow_prind_set_weights): Get MAX_WI from MIN of
WI2DVF->SIZE and BOW_NUM_WORDS(), not just BOW_NUM_WORDS().
* opts.c (bow_uniform_class_priors): New global variable.
(bow_options): New option "uniform-class-priors".
(parse_bow_opt): Handle it.
* naivebayes.c (bow_naivebayes_set_weights): Get MAX_WI from MIN
of WI2DVF->SIZE and BOW_NUM_WORDS(), not just BOW_NUM_WORDS().
(bow_naivebayes_score): Pay attention to BOW_UNIFORM_CLASS_PRIORS.
Don't sum in score of words that don't have a DV entry!
Previously we were allowing words that `aren't in the vocabulary'
of the BARREL to contribute! This was wrong. They were
contributing according to the Laplace Estimators, and classes with
larger numbers of words were getting penalized.
* info_gain.c (bow_infogain_per_wi_new): Sum floating point
CDOC->PRIOR's instead of increment integer count of documents, so
that infogain can be calculated from documents with different
`weights'.
(bow_infogain_per_wi_new_using_pairs): New function. For now it
prints its results instead of returning them.
* barrel.c (bow_barrel_set_cdoc_priors_to_class_uniform): New
function.
* bow/libbow.h: Declare new functions.
Mon Mar 31 11:56:48 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* Makefile.in (CFLAGS, CPPFLAGS): Get values from configure.
* configure.in: Do AC_SUBST() for CPPFLAGS and CFLAGS.
Fri Mar 28 10:28:26 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-h.c: Fix spelling: "heir" -> "hier". How embarrassing!
* dv.c (bow_dv_new_from_data_fp): Fix typo in feof() assertion.
(Reported by Doreen Cheng <dcheng@@PRPA.Philips.COM>.)
* rainbow.c (PRINT_COUNTS_FOR_WORD_KEY): New macro.
(rainbow_options): New option "print-counts-for-word".
(rainbow_parse_opt): Handle it.
(main): Implement it.
* bow/libbow.h: (bow_wi2dvf): Add new element to structure:
`num_words'.
(bow_barrel): Put `is_vpc' at end of structure instead of the
beginning.
* wi2dvf.c (bow_wi2dvf_new): Initialize NUM_WORDS.
(bow_wi2dvf_add_di_wv): Increment it.
(bow_wi2dvf_add_wi_di_count_weight): Likewise.
(bow_wi2dvf_new_from_data_fp): Likewise.
(bow_wi2dvf_remove_wi): Decrement it.
(bow_wi2dvf_print_stats): Print it.
* prind.c (bow_prind_set_weights): Use BARREL->WI2DVF->SIZE and
BARREL->WI2DVF->NUM_WORDS instead of BOW_NUM_WORDS(). In
particular, this will allow us to set the Laplace estimators using
the correct number of words in the barrel, not the arbitrary
libbow-wide vocabulary size. Properly use CDOC->WORD_COUNT
instead of overloading CDOC->NORMALIZER.
(bow_prind_score): Likewise use BARREL->WI2DVF->SIZE and
BARREL->WI2DVF->NUM_WORDS instead of BOW_NUM_WORDS().
(bow_print_word_scores): Removed to opts.c.
* opts.c (bow_print_word_scores): Global variable moved here from
prind.c.
(bow_options): New option "print-word-scores".
(parse_bow_opt): Handle it.
* naivebayes.c (bow_naivebayes_set_weights): Use
BARREL->WI2DVF->SIZE and BARREL->WI2DVF->NUM_WORDS instead of
BOW_NUM_WORDS(). In particular, this will allow us to set the
Laplace estimators using the correct number of words in the
barrel, not the arbitrary libbow-wide vocabulary size.
(bow_naivebayes_score): Likewise, and add code to print scores
contributions of each word with BOW_PRINT_WORD_SCORES is non-NULL.
(SCORE_WITH_LOG_PROBABILITIES): New macro.
* barrel.c (bow_barrel_printf): Comment out the code that would
skip over documents that are not of type `model'.
Thu Mar 27 11:29:34 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-stats.pl: Make output labels more descriptive. Say
`average percentage accuracy'.
* split.c (bow_test_split): Use the micro-seconds field from
gettimeofday() instead of time() to set the random number
generator seed. Otherwise, if we re-call this function too
quickly we'll get exactly the same seed! ...because time()
returns a number of seconds.
* demos/script: New shell script file that will demo rainbow,
with running commentary.
* demos/data: New directory containing 20 articles 2 newsgroups.
This is for use with demos/script.
* install.texi: Remove mention of `checks' and `examples'
directory; they don't exist. (Reported by Doreen Cheng
<dcheng@@PRPA.Philips.COM>.)
Mon Mar 24 12:07:53 1997 Andrew McCallum <mccallum@@jprc.com>
* Makefile.in (rainbow-lisp.o): Use $(ALL_CPPFLAGS) and
$(ALL_CLFAGS) instead of non-ALL versions.
* rainbow.c (rainbow_lisp_setup): Rewrite for use with libargp.
* methods.c (bow_method_at_name): Fix typo.
(bow_method_at_index): Likewise.
* opts.c (parse_bow_opt): Use 'g' instead of 'N' for setting gram
size.
* rainbow.c (rainbow_lisp_query): Free the QUERY_WV before
returning!
* methods.c (bow_method_register_with_name): New function.
(bow_method_at_name): New function.
* arrow.c (PRINT_IDF_KEY): New macro.
(arrow_options): Add new option "print-idf".
(struct arrow_arg_state): New enum ARROW_PRINTING_IDF.
(arrow_index): Prune the vocabulary if
BOW_PRUNE_VOCAB_BY_OCCUR_COUNT_N is non-zero.
(main): Add code to print idf values.
* lex-simple.c (bow_alpha_lexer, bow_alpha_only_lexer,
bow_white_lexer): Initialize STEM_FUNC to 0 instead of
BOW_STEM_PORTER.
* tfidf.c (bow_tfidf_set_weights): Comment out code that sets
total_word_count. Do the DF_TRANSFORM on DF, not on IDF!
Otherwise we get negative IDF's.
* rainbow-h.c (use_maximum_likelihood_path): New global variable.
(_heir_barrel_set_node_scores): Use it.
(main): Set it when -M passed on command line.
(num_top_words): Moved from main-local variable to global.
(heir_barrel_test): Reduce vocab by infogain.
Fri Mar 21 14:02:39 1997 Andrew McCallum <mccallum@@jprc.com>
* bow/libbow.h (bow_lexer_simple): Add entry
TOSS_WORDS_LONGER_THAN.
(bow_wv_set_weights_to_count_times_idf): Declare new function.
* wv.c (bow_wv_set_weights_to_count_times_idf): New function.
* tfidf.c (bow_tfidf_set_weights): Comment out code saying that
TFIDF is broken. Rewrite the way IDF is calculated.
(bow_tfidf_score): Set and normalize the QUERY_WV weights here (even
though it is redundant) so that we can properly use the IDF from
the BARREL when normalizing weights. Normalize the QUERY_WV
weight when incrementing CURRENT_SCORE.
* prind.c (bow_prind_set_weights): Skip a document if it does not
of type model, both when setting NORMALIZER and TOTAL_TERM_COUNT,
and when setting weights.
(bow_prind_score): Skip a document if it does not of type model.
* lex-simple.c (bow_lexer_simple_postprocess_word): Add code to
toss words longer than SELF->TOSS_WORDS_LONGER_THAN. Set WORDLEN
at beginning. It appeared that it was getting used uninitialized
before!
(bow_alpha_lexer, bow_alpha_only_lexer, bow_white_lexer): Add value
for new field TOSS_WORDS_LONGER_THAN.
* opts.c (APPEND_STOPLIST_FILE_KEY): New macro.
(bow_options): Added "append-stoplist-file"
(parse_bow_opt): Handle new option.
* int4str.c (_str2id): Return the absolute value of the old return
value. Sometimes with really long strings, the return value was
going negative.
(_str_hash_lookup): Assert that ID is non-negative.
Thu Mar 20 11:47:49 1997 Andrew McCallum <mccallum@@jprc.com>
These changes by Karl Kleinpaste <karl@@jprc.com>
* int4word.c (bow_words_reread_from_file): Use fopen() instead of
bow_fopen(), so we are sure not to call abort().
* wv.c (bow_wv_sprintf): Fix function to account for length
troubles properly.
(bow_wv_sprintf_words): New function, prints the words themselves,
rather than the word indices.
* bow/libbow.h: Declare new function.
* naivebayes.c (bow_naivebayes_set_weights): Add commented-out
code that forces all counts to either 0 or 1. This was used on
some experiments with Shumeet.
* lex-html.c (bow_lexer_html_get_raw_word): Add a ! to the
FALSE_TO_END condition test, so we don't end the tokenization too
early.
Tue Mar 18 14:47:35 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow.c (rainbow_parse_opt) [ARGP_KEY_END]: Print a useful
error when only one classname is given.
(main): Check for rainbow_infogain_printing properly.
* opts.c (parse_bow_opt) [ARGP_KEY_END]: Check for the existance
of BOW_DATA_DIRNAME in a way that works even when the directory is
owned by someone else.
* bow/libbow.h (bow_fread_string): Assert that the string length
is non-negative.
* barrel.c (_bow_barrel_version): New variable.
(BOW_DEFAULT_BARREL_VERSION): New macro.
(bow_barrel_new_from_data_fp): Read the version number instead of a
null_tag.
(bow_barrel_write): Likewise, for writing.
* arrow.c (main): Remove redundant code that is now in opts.c.
Mon Mar 17 12:09:32 1997 Andrew McCallum <mccallum@@jprc.com>
* Makefile.in (%.o:%.c): Fix the order on this pattern rule.
($(DEMO_EXECUTABLES):%:%.o): Put $(DEMO_EXECUTABLES) at the beginning
of this pattern, so it matches only those files.
* arrow.c: Don't include getopt.h; we're using argp.h instead.
(arrow_index): Fix typo.
* configure.in: Don't look for getopt.h anymore. We don't need it
now that we are using libargp.
* configure.in: AC_INIT looking for int4str.c instead of libbow.h.
* Makefile.in (%): Use this pattern to make DEMO_EXECUTABLES
instead of listing them all. This avoids making all the .o's for
one of the DEMO_EXECUTABLES.
* rainbow.c: Converted to use argp command-line argument
processing.
* opts.c (bow_argp_method): Renamed from bow_default_method.
(parse_bow_opt) [ARGP_KEY_INIT]: Add words to stoplist.
* deflexer.c (_bow_default_lexer_init): Initialize
bow_default_lexer to BOW_DEFAULT_LEXER_GRAM, not BOW_LEXER_GRAM!
* bow/libbow.h (bow_argp_method): Renamed from bow_default_method.
* arrow.c (arrow_parse_opt) [q]: Set query.filename.
(arrow_index): BOW_DEFAULT_METHOD renamed to BOW_ARGP_METHOD.
* arrow.c (arrow_index): Set the method according to
BOW_DEFAULT_METHOD.
* opts.c: Fleshed out into first working version.
* error.c: Comment fix. Include libbow.h and stdio.h.
* deflexer.c (_bow_default_lexer_init): New constructor function.
(bow_default_lexer_simple, bow_default_lexer_indirect,
bow_default_lexer_gram, bow_default_lexer_html,
bow_default_lexer_email): New variables, default instantiations of
lexers.
* bow/libbow.h: Add argp declarations.
(bow_argp_children): New variable.
(bow_prune_vocab_by_infogain_n): New variable.
(bow_prune_vocab_by_occur_count_n): New variable.
(bow_default_method): New variable.
(bow_data_dirname): New variable.
* arrow.c: Convert to using argp for command-line processing.
* Makefile.in: Change all instances of `libbow.h' to `bow/libbow'.
(includedir): Add `/bow' to end.
(LIBBOW_C_FILES): Add opts.c.
(ALL_CPPFLAGS): add -I$(srcdir)/bow and -I$(srcdir)/argp.
(rainbow-lisp.o): Use $< instead of rainbow.c, so VPATH will find it
when compiling in a different directory than the source.
* bow/libbow.h (STRINGIFY): New macro.
(bow_default_lexer_simple, bow_default_lexer_indirect,
bow_default_lexer_gram, bow_default_lexer_html,
bow_default_lexer_email): Declare default instantiations of
lexers.
Fri Mar 14 11:01:14 1997 Andrew McCallum <mccallum@@jprc.com>
* Makefile.in (LIBBOW_C_FILES): Renamed defparser.c to deflexer.c.
* deflexer.c: Renamed from defparser.c.
Add the `argp' subdirectory, and incorporate it into the Makefile.
* HACKING: Add argp autoconf instruction.
* configure.in: Call AC_CONFIG_SUBDIRS to configure argp also.
* Makefile.in (ALL_LIBS): Move it closer to $(DEMO_EXECUTABLES)
target. $(DEMO_EXECUTABLES): Make this target depend on
argp/libargp.a.
(install): Call make install in argp directory also.
(dist, snapshot): Call make in argp directory to include its files too.
Wed Mar 12 20:00:27 1997 Andrew McCallum <mccallum@@jprc.com>
* Makefile.in (CPPFLAGS): Don't include $(DEFS) here, it's now in
ALL_CPPFLAGS.
* Makefile.in (ALL_CPPFLAGS): New variable.
(ALL_CFLAGS): New variable.
(.c.o): New pattern rule that uses above new variables. Now Kamal can
safely type `make CPPFLAGS=-DNDEBUG'.
* rainbow-h.c (_heir_barrel_set_node_scores): Don't threshhold the
scores to 0/1.
(strdup): New function. Implement this local version to help with
debugging. Consider removing it later.
* libbow.h (bow_params_prind): Remove variable SCALE_BY_FOILGAIN.
It isn't needed since we have a function pointer for it in BOW_METHOD.
* prind.c (bow_prind_params): Remove BOW_NO for scaling.
* rainbow.c (rainbow_lisp_setup): Remove setting of
BOW_PRIND_SCALE_BY_INFOGAIN; it now defaults to on.
(rainbow_print_usage): Change the sense of -G. It now turns off
foilgain scaling, instead of turning on. (Actually, it was the
default before this anyway.)
(main): Given -G, zero-out the SCALE_WEIGHTS entries in all the
methods.
Tue Mar 11 11:58:03 1997 Andrew McCallum <mccallum@@jprc.com>
* Version (BOW_MINOR_VERSION): Version 0.6.
* libbow.h: Likewise.
* Makefile.in (DIST_FILES): Add TODO. Remove p.inc.
(p-alpha.o, p-alonly.o, p-white.o): Targets removed.
* rainbow.c (rainbow_query): Use bow_barrel_ macros instead of
indexing into the methods structure manually.
* crossbow.c: Add copyright info.
* readme.texi: Fill out.
* libbow-desc.texi: Add description.
* install.texi: Add pointer to the README. Say that it requires GCC.
* HACKING: Update CVS repository machine name.
* tfidf.c (bow_tfidf_set_weights): Insert dislaimer explaining
that TFIDF is broken.
Tue Mar 11 11:31:39 1997 Rahul Sukthankar <rahuls@@syzygy.jprc.com>
* Makefile.in (DEMO_C_FILE): Added crossbow.c.
* crossbow.c: New file.
Mon Mar 10 18:52:03 1997 Andrew McCallum <mccallum@@jprc.com>
* int4str.c (_str2id): Keep return value smaller using modulus.
This fixes bug Rosie Jones encountered with negative hash values.
(_str_hash_lookup): Assert that H is non-negative.
* Makefile.in (LIBBOW_C_FILES): Added lex-email.c.
Fri Mar 7 10:54:09 1997 Andrew McCallum <mccallum@@jprc.com>
* int4word.c (bow_words_reread_from_file): Make sure LAST_FILE is
non-NULL.
Tue, 18 Feb 1997 20:15:42 -0500 Jason Rennie <jr6b@@andrew.cmu.edu>
* lex-email.c: New file. Created lexer for e-mail/newsgroup messages
* lex-html.c: Changed code to allow words separated by HTML tags
to be tokenized as single words. <FONT SIZE=+2>B</FONT>ig is now
tokenized as "Big". Nested brackets are now ignored. This should
more closely model the way HTML is interpreted.
* rainbow.c: Added rainbow_email_lexer as a bow_lexer_indirect.
Added '-M' option to allow user to make use of
rainbow_email_lexer. rainbow_email_lexer will remove
"Newsgroups:" and "Path:" headers from message.
* libbow.h (bow_email_headers_to_remove): Declare new global variable.
(bow_email_lexer): Likewise.
Tue Mar 4 11:51:53 1997 Andrew McCallum <mccallum@@jprc.com>
* libbow.h (bow_barrel_scale_weights): Don't call underlying
function if it's NULL.
(bow_barrel_normalize_weights): Likewise.
(bow_wv_set_weights): Likewise.
(bow_wv_normalize_weights): Likewise.
* vpc.c (bow_barrel_new_vpc_merge_then_weight): Use macros for
weight setting.
(bow_barrel_new_vpc_weight_then_merge): Likewise.
* rainbow-h.c (_heir_barrel_cdoc_write): Write WORD_COUNT.
(_heir_barrel_cdoc_read): Read it.
(heir_dir_is_leaf): Check the return status from CHDIR(), and print
appropriate error message.
(heir_barrel_keep_top_words_by_infogain): Return immediately if
num_words_to_keep is 0 or the children count is 0.
(heir_barrel_set_vpc_with_weights): Return immediately if the children
count is 0.
(_heir_barrel_set_node_scores): Add temporary #if'ed code to make
score either 1 or 0, so winner takes all.
(heir_barrel_score_recurse): New argument DEPTH. All callers changed.
(main): Change default NUM_TOP_WORDS from 3000 to 0. Add new command
line argument -m and -N.
Changes made with Sean Slattery.
* naivebayes.c (bow_naivebayes_set_weights): Store class-wide word
count in CDOC->WORD_COUNT instead of overloading CDOC->NORMALIZER.
(bow_naivebayes_score): Use CDOC->WORD_COUNT instead of
CDOC_NORMALIZER. Use it to fix PR_W_C in case where that word
doesn't appear in the class. Instead of (1.0 / MAX_WI) use (1.0 /
(MAX_WI + CDOC->WORD_COUNT)). Don't normalize the weight by
CDOC->NORMALIZER because it it set to already by normalized
correctly, including the words that don't appear in the in class.
(bow_method_naivebayes): Change the weight normalizing function from
BOW_NORMALIZE_WEIGHTS_BY_SUMMING to NULL, because we don't use
CDOC->NORMALIZER anymore.
* libbow.h (bow_cdoc): Add WORD_COUNT.
* barrel.c (_bow_barrel_cdoc_write): Write WORD_COUNT.
(_bow_barrel_cdoc_read): Read it.
* vpc.c (bow_barrel_new_vpc): Assert MAX_CI is positive, otherwise
this means we didn't find any classes.
Wed Feb 26 11:08:50 1997 Andrew McCallum <mccallum@@jprc.com>
* HACKING: Fix sandbox's name.
Wed Feb 19 11:27:55 1997 Andrew McCallum <mccallum@@jprc.com>
* barrel.c (bow_barrel_keep_top_words_by_infogain): Return
immediately if NUM_WORDS_TO_KEEP is 0.
Tue Feb 18 13:39:34 1997 Andrew McCallum <mccallum@@jprc.com>
* libbow.h (bow_str_to_method_id): Use a temporary variable, to we
use statements like ARGI++ as an argument.
* info_gain.c (bow_infogain_per_wi_new): Change assertion to
handle round-off error.
* rainbow-h.c: Include <math.h>, <time.h>.
(heir_barrel): Add components INDEX_IN_PARENT, NUM_LEAVES, FULL_NAME.
(heir_barrel_new): Set them.
(heir_dir_is_leaf): Use chdir() so that symlinks are dealt with
properly. Free() the results of scandir().
(heir_barrel_new_from_text_dir_leaf): Set FULL_NAME and add assertions.
(_heir_barrel_new_from_text_dir_recurse): New parameter PARENT_NAME.
Move the chdir() to handle symlinks properly. Don't make a
SUBDIRNAME.
(heir_barrel_new_from_text_dir): New function.
(heir_barrel_write_to_file): Write new heir_barrel components.
(heir_barrel_new_from_file): Read them.
(heir_barrel_free): Free FULL_NAME.
(heir_barrel_keep_top_words_by_infogain): New function.
(heir_parent_di_to_child_index_and_di): New function.
(heir_di_to_classname): New function.
(heir_barrel_test_split): New function.
(_heir_barrel_set_node_scores): Use bow_barrel_score() instead to
bow_get_best_matches().
(heir_barrel_print_scores_recurse): Return void not int. Print all on
same line.
(heir_barrel_score_recurse): New function.
(heir_barrel_score): New function.
(heir_barrel_test): New function.
(heir_barrel_print_weight_vectors): Change formatting.
(set_vocabulary_from_file): New function (unused).
(main): Allow user to set DATADIR (-d) and NUM_TOP_WORDS (-T),
test (-t).
Compile with -Wall.
Mon Feb 17 10:36:32 1997 Andrew McCallum <mccallum@@jprc.com>
* configure.in: Remove check for <float.h>, all ANSI compilers
should have it.
* split.c: Remove SunOS declarations of rand() and srand().
(RAND_MAX): Define macro, if not already defined. These two
changes needed to compile on SunOS.
* naivebayes.c (bow_naivebayes_set_weights): Uncomment assertion
about METHOD->ID.
* rainbow.c (rainbow_lisp_setup): Add `-N' to effective arguments.
(rainbow_lisp_query): Fix typo in BOW_FOPEN() call.
* rainbow.c (rainbow_query): Check for QUERY_WV being NULL, and
output more useful messages in that case.
* naivebayes.c (bow_naivebayes_score): Rearrange the code for
stepping through a DV so we always get the CDOC. This change
should have no effect on the outcome.
* lex-simple.c (bow_lexer_simple_open_text_fp): Fix test for
matching END_PATTERN_PTR. Don't push the DOCUMENT_END_PATTERN
back on the input stream after we find it; this is a stylistic
choice.
* docnames.c (bow_map_filenames_from_dir): Pass relative instead
of absolute directory names to recursive calls. Before I was
having trouble with symbolic links. This seems to fix it.
* int4word.c (bow_words_keep_top_by_infogain): Fix assertions; its
OK to have infogain equal to 0.
* prind.c: Comment fixes.
* foilgain.c (bow_foilgain_per_wi_ci_new): Use malloc() for
POS_PER_WI_CI and NEG_PER_WI_CI, instead of using stack. We were
overflowing the stack before.
Tue Feb 11 12:15:30 1997 Andrew McCallum <mccallum@@jprc.com>
* naivebayes.c (bow_naivebayes_score): When word doesn't appear in
the class vector, make Pr(w|C) include CDOC->NORMALIZER.
(Suggested by Sean Slattery).
* naivebayes.c (bow_naivebayes_score): Fix constant in assertion.
* configure.in: When perl5 isn't found, PERL will be "", not ":".
Deal with it properly.
* libbow.h: Don't bother with HAVE_FLOAT, just always include
<float.h>.
(bow_get_best_matches): Remove declaration. The function no longer
exists.
* rainbow.c (rainbow_test): Use macros for accessing method
functions.
* split.c: Fix author comment.
* info_gain.c (bow_infogain_per_wi_new): Use double instead of
float, because before we were loosing resolution and getting
negative IG's.
(bow_entropy): Likewise.
Mon Feb 10 16:25:04 1997 Andrew McCallum <mccallum@@jprc.com>
* docnames.c (bow_map_filenames_from_dir): Use perror() when can't
open directory.
Fri Feb 7 11:00:50 1997 Andrew McCallum <mccallum@@jprc.com>
* int4word.c (bow_words_read_from_file): Fix typo.
* rainbow.c (rainbow_lisp_query): Use bow_barrel_score instead of
bow_get_best_matches.
These changes by Tony Brusseau <brusseau@@jprc.com>, with
modifications by <mccallum@@jprc.com>.
* wv.c (bow_wv_new_from_text_string): New function.
(bow_wv_sprintf): New function.
* int4word.c (bow_words_set_map): Add new argument indicating if
old map should be freed. All callers changed.
(bow_words_reread_from_file): New function.
* docnames.c: Include <stdio.h>.
(bow_map_filenames_from_dir): Add WindowsNT backslashes to first
assertion.
* libbow.h: Declare new functions.
Thu Feb 6 18:33:21 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow.c: Updated for below library changes.
* arrow.c: Likewise.
* libbow.h: Declare many new functions, variables and types,
including:
(bow_boolean): New type.
(bow_wv_set_weights_to_count): New function declaration.
(bow_wv_normalize_weights_by_vector_length): Likewise.
(bow_wv_normalize_weights_by_summing): Likewise.
(bow_str_to_method_id): Macro renamed from bow_str2method.
(bow_method_id): New enum, replacing bow_method.
(bow_method): Now a struct.
(bow_barrel_set_weights, bow_barrel_scale_weights,
bow_barrel_normalize_weights, bow_new_vpc_with_weights,
bow_barrel_score, bow_wv_set_weights, bow_wv_normalize_weights):
New macros.
(bow_methods): New global variable declaration.
(bow_params_*): New types.
(bow_score): Renamed from bow_doc_score.
* wv.c (bow_wv_set_weights_to_count): New function.
* weight.c: File removed.
* vpc.c (bow_barrel_new_vpc_merge_then_weight): New function.
(bow_barrel_new_vpc_weight_then_merge): New function.
(bow_barrel_set_vpc_priors_by_counting): Renamed from
_bow_barrels_set_naivebayes_vpc_priors.
* tfidf.c: File contents totally replaced to implement TFIDF.
Functions removed from weight.c and other places.
(bow_tfidf_set_weights): Function renamed.
(bow_tfidf_score): Function renamed from bow_get_best_matches().
(bow_tfidf_params_{words,log_words,log_occur}): New variables.
(bow_method_tfidf_{words,log_words,log_occur}): New global variables.
* prind.c (bow_prind_uniform_priors): Global variable removed.
(bow_prind_scale_by_infogain): Likewise.
(bow_prind_normalize_scores): Likewise.
(bow_prind_set_weights): Renamed from _bow_barrel_set_prind_weights.
(bow_prind_score): Renamed from _bow_score_prind_from_wv, and updated
for library changes.
(bow_prind_params): New variable.
(bow_method_prind): New global variable.
* score.c: File removed.
* naivebayes.c (_bow_barrels_set_naivebayes_vpc_priors): Function
removed. Replacement in vpc.c.
(bow_naivebayes_set_weights): Minor updates for library changes.
(bow_naivebayes_params): New variable.
(bow_method_naivebayes): New global variable.
* info_gain.c (bow_barrel_scale_weights_by_info_gain): Function
removed. Replacement is now in scale.c.
(bow_barrel_scale_weights_by_foilgain): Likewise.
(bow_foilgain_per_wi_ci_new): Likewise. Replacement now in foilgain.c.
(bow_foilgain_free): Likewise.
* barrel.c (bow_barrel_new): Make the default method naivebayes,
instead of tfidf_log_occur.
(bow_barrel_new_from_data_fp): Get the METHOD pointer from
BOW_METHODS.
(bow_barrel_write): Write the ID.
* Makefile.in (CPPFLAGS): Add $(DEFS).
(ALL_INCLUDE_FLAGS, ALL_CPPFLAGS, ALL_CFLAGS, ALL_LDFLAGS): Variables
removed.
(LIBBOW_C_FILES): Added foilgain.c, methods.c, normalize.c, scale.c,
tfidf.c. Removed score.c, weight.c.
(DEMO_EXECUTABLES): Don't use ALL_LDFLAGS.
Tue Feb 4 14:21:08 1997 Andrew McCallum <mccallum@@jprc.com>
* split.c (bow_test_split): Properly deal with the fact the rand()
returns an int between 0 and RAND_MAX, and the previously-use
drand48() returned a double between 0 and 1.
Mon Feb 3 12:44:18 1997 Andrew McCallum <mccallum@@jprc.com>
Following changes made by Tony Brusseau for WindowsNT compatibility.
* configure.in: Check for <float.h>.
* libbow.h: Include float.h if we have it; otherwise include
values.h and redefine its macros.
(htonl, htons, ntohl, ntohs): Temporarily define as identity for
WindowsNT.
(bow_fwrite_string): Cast sizeof() to int.
* split.c: Use rand() and srand() instead of drand48() and
srand48().
* rainbow.c (rainbow_test_files): Make DIRLEN unsigned, to avoid
warning under WindowsNT.
* primes.c: Use unsigned int's instead of int's in several places,
to avoid warnings under WindowsNT.
* dv.c: Don't include <netinet/in.h>. Use <limits.h>-style
MAX'es.
* naivebayes.c: Use <limits.h>-style MAX'es.
* int4word.c: Likewise.
* configure.in: Look for the wsock32 library.
* prind.c (_bow_score_prind_from_wv): Don't die if SCORES_SUM is
zero; just leave zero scores on all classes.
* rainbow.c (main): Add `s' to getopt call.
(rainbow_index): Add newline to end of "No text files" message.
Fri Jan 31 11:28:54 1997 Andrew McCallum <mccallum@@jprc.com>
* Makefile.in (rainbow-lisp.o): New target.
* rainbow.c (rainbow_lisp_setup): New function. (From Kamal.)
Surround this and rainbow_lisp_query by #if RAINBOW_LISP.
(main): Surround by #if !RAINBOW_LISP.
* HACKING: Change the cvsroot in directions for networked
`pserver' use.
* libbow.h: If under WinNT, include <winnt.h>, otherwise include
<values.h>.
* bitvec.c (BITSPERBYTE): Surround it with an #ifndef.
* prind.c: Don't include <values.h>.
* naivebayes.c: Likewise.
* int4word.c: Likewise.
* dv.c: Likewise.
* bitvec.c: Likewise.
(BITSPERBYTE): New macro.
Wed Jan 29 10:10:11 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow.c (rainbow_lisp_query): New function.
* Makefile.in (maintainer-clean): Add config.* files.
* configure.in: Add quotes around $PERL so test will still work if
$PERL is empty.
* rainbow-h.c (_heir_barrel_set_node_scores): Add some temporary
progress printing.
(heir_barrel_print_scores_recurse): Renamed.
(heir_barrel_print_scores): New function.
(heir_barrel_print_foilgain): New function.
(heir_barrel_print_weight_vectors): New function.
(main): Add new options for calling new functions.
* prind.c (_bow_score_prind_from_wv): Add three checks for NaN.
* int4word.c (bow_words_write_to_file): New function.
(bow_words_read_from_file): New function.
* libbow.h: Declare new word functions.
* rainbow-h.c (method): New global variable.
(heir_barrel): Added component SCORE.
(heir_barrel_new_from_text_dir_leaf): Set doc_barrel method.
(heir_barrel_new_from_text_dir): Likewise.
(main): Take -i and -q arguments.
* libbow.h (bow_fwrite_string): Change type of LEN from int to
short.
* vpc.c (bow_barrel_new_vpc): Don't abort if CLASSNAMES is NULL.
Tue Jan 28 15:49:46 1997 Andrew McCallum <mccallum@@jprc.com>
* rainbow-h.c: New file.
* barrel.c (bow_barrel_write): Handle case in which BARREL is NULL.
(bow_barrel_new_from_data_fp): Likewise. (xxx Although there is some
strangeness with FGETC returning -1, which I am currently
ignoring. I should look at this again...)
* defparser.c (bow_default_lexer): Temporary fix to confusion
about constant initializers and pointers.
* Makefile.in: Fix copyright.
* barrel.c: Fix header comment.
(bow_barrel_printf): Print the word as well as the word index.
* docnames.c (bow_map_filenames_from_dir): Remove commented-out
code for checking whether the file contains text.
Thu Jan 23 13:47:01 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* barrel.c (bow_barrel_printf): Don't print with paren's, so it
will be easier to process with AWK.
* rainbow.c (main): Add new option -B for printing barrel word
vectors in ASCII.
(rainbow_print_usage): Document it.
* barrel.c (bow_barrel_printf): New function.
* libbow.h: Declare new barrel function.
Sat Jan 18 08:14:59 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* rainbow.c (main): Add -N option for turning off normalization of
PrInd scores by setting BOW_PRIND_NORMALIZE_SCORES.
(rainbow_print_usage): Document it.
* vpc.c (FOILGAIN): New macro, defined to be 1. Switching back to
doing foilgain by default.
* prind.c (bow_prind_normalize_scores): Change to 1. Now
normalizing scores by default.
* libbow.h: Declare prind normalization global variable.
* vpc.c (bow_barrel_new_vpc_with_weights): Condition choice of
weight-scaling on FOILGAIN. Now default is to do scaling by
information gain (again).
* prind.c (bow_prind_normalize_scores): New global variable.
Default: 0, don't normalize. Note, this is different than what we
were doing before, and the default should be changed back to 1.
(_bow_score_prind_from_wv): Use it.
* rainbow.c (rainbow_index): Don't exit with error when a class
directory is empty, just print a message.
(rainbow_usage): Fix description of -G to Foil-gain, not info-gain.
* prind.c: Improve formating of score printing.
* rainbow.c (main): New command-line argument `-P' sets
BOW_PRINT_WORD_SCORES.
(rainbow_print_usage): Document it.
* prind.c (bow_print_word_scores): Define new global variable.
(_bow_score_prind_from_wv): Use it to decide when to print
per-word/class score information.
* libbow.h (bow_print_word_scores): Declare new global variable.
Fri Jan 17 09:39:30 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* rainbow.c (printing_class): Global variable renamed from
weight_vector_printing_class.
(rainbow_print_foilgain): New function.
(main): Call it, and add -F option for doing so.
(rainbow_print_usage): Document it.
* vpc.c (bow_barrel_new_vpc_with_weights): Use new foilgain
function for PrInd, instead of infogain function.
* info_gain.c (bow_barrel_scale_weights_by_info_gain): Set max_wi!
Previously it was uninitialized.
(bow_foilgain_per_wi_ci_new): New function.
(bow_foilgain_free): New function.
(bow_barrel_scale_weights_by_foilgain): New function.
* libbow.h: Declare new foilgain functions.
* docnames.c (bow_map_filenames_from_dir): Add assertion that
checks for conditions under which directory-vs-file detection is
unreliable. I should figure out why this isn't working as
expected.
* prind.c (bow_prind_scale_by_infogain): New global variable.
* libbow.h (bow_prind_scale_by_infogain): New declared global
varible.
* rainbow.c (main): Add -G for setting
bow_prind_scale_by_infogain.
* vpc.c (bow_barrel_new_vpc_with_weights): Use
BOW_PRIND_SCALE_BY_INFOGAIN.
* rainbow.c (rainbow_query): Only re-build the
RAINBOW_CLASS_BARREL if -m or -T arguments require its change.
Thu Jan 16 13:35:31 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* vpc.c (bow_barrel_new_vpc_with_weights): For PrInd, scale
weights by information gain. (Oooh, this may not be kosher;
perhaps remove it later.)
* rainbow.c (rainbow_print_weight_vector): Multiply the weight by
its normalizer.
* info_gain.c (bow_barrel_scale_weights_by_info_gain): Leave more
space for verbosifying progress, because the numbers are big.
* prind.c (_bow_barrel_set_prind_weights): Remove the information
gain scaling, i.e. undo previous change.
* info_gain.c (bow_barrel_scale_weights_by_info_gain): Change the
arguments so that the information gain array is passed in, not
calculated inside the function.
* libbow.h: Change arguments to weight info gain scaling function.
* naivebayes.c (_bow_score_naivebayes_from_wv): Scale DV WEIGHT by
CDOC NORMALIZER!
* prind.c (_bow_score_prind_from_wv): Scale DV WEIGHT by CDOC
NORMALIZER!
* rainbow.c (rainbow_query): If score is less than 1e-35, then
just print zero. Do this for the sake of CommonLisp, which can't
read numbers smaller than 1e-35.
(rainbow_test): Likewise.
(rainbow_test_files): Likewise.
Wed Jan 15 10:52:13 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* scan.c (bow_scan_fp_for_string): Change the for() to a while()
to clean up the handling of STRING_PTR incrementation.
* rainbow.c (method): Fix its handling from last change.
(rainbow_print_weight_vector): New function.
(main): Call it.
(rainbow_print_usage): Add -W.
* prind.c (_bow_barrel_set_prind_weights): Added comment.
* lex-simple.c (bow_lexer_simple_get_raw_word): Back up
DOCUMENT_POSITION to point at terminating character. Add more
comments.
* int4str.c (_str_hash_lookup): Add assertion.
(_str_hash_add): Add assertions.
(bow_str2int): Increment MAP->STR_ARRAY_LENGTH++ first, then return -1
the value, instead of returning the value++. This makes the
intermediate calls more clear and safe.
* barrel.c (bow_barrel_keep_top_words_by_infogain): Improve
verbosity and reduce number of times printed.
* lex-simple.c (bow_lexer_simple_open_text_fp): When checking to
see if we should realloc to increase the DOCUMENT buffer size,
make sure we leave room for the terminating '\0' that we'll add
later in the function! (Wow! This was a wild bug that's been
around for a while, but only recently caused occasional crashes.
The crashes were in totally unrelated functions in int4str.c. The
GDB `watch' command came to the rescue!)
* scan.c (bow_scan_fp_for_string): Make it ignore Carriage-Return
'\r' characters, so we can reliably scan for MIME header
separators.
* barrel.c (bow_barrel_keep_top_words_by_infogain): Make it more
efficient by using qsort().
* rainbow.c (method): Initialize to -1.
(DEFAULT_METHOD): New macro, equal to bow_method_naivebayes.
(rainbow_index): Use new macro.
(rainbow_query): Cause `-m' to have an effect here.
(rainbow_test): Likewise.
(rainbow_test_files): Likewise.
(rainbow_print_usage): Rearrange flags to reflect new contexts in
which -T, -m, and -U are valid.
(main): Don't use the length of argv[0] to determine value of
BOW_VERBOSITY_USE_BACKSPACE.
Tue Jan 14 09:35:20 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* prind.m: Comment fixes.
* rainbow.c (test_percentage): Initialize it to 0, not 30.
(DEFAULT_TEST_PERCENTAGE): New macro, equal to 30.
(rainbow_test): Set TEST_PERCENTAGE to DEFAULT_TEST_PERCENTAGE if its
zero. Set RAINBOW_CLASS_BARREL->METHOD to METHOD so we can give
the -m option when using -t.
(rainbow_test_files): Use TEST_PERCENTAGE and NUM_TEST_DOCS to
determine how many training examples to ignore. Set
RAINBOW_CLASS_BARREL->METHOD to METHOD so we can give the -m
option when using -t.
* dv.c (bow_dv_add_di_count_weight): Add parens around MAXSHORT,
which seems to be needed on SunOS.
* dv.c (_bow_dv_index_for_di): Reverse direction of for()-loop
that scoots document entries up to make room! (Reported by Kamal
Nigam.)
* wi2dvf.c (bow_wi2dvf_dv): Gracefully handle arithmetic overflow
of in COUNT. Print a warning the first time it happens.
* rainbow.c (rainbow_test_files): Close the FP in the nested
function TEST_FILE! Append "/" to the DIR. Use
FILENAME_TO_CLASSNAME when setting CURRENT_CLASS.
(main): Add 'x' to the getopt call.
* configure.in: Look for perl5 before looking for perl.
Mon Jan 13 09:51:32 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* rainbow.c (rainbow_test_files): New function.
(main): Added local variable WHAT_DOING; use it. Add command-line
option -x; call rainbow_test_files. Add command-line option -b.
* lex-html.c: Use bow_verbose, instead of bow_quiet to print
message about unterminated `<'.
* docnames.c (bow_map_filenames_from_dir): Make it work even when
DIRNAME is actually a filename.
Sun Jan 12 12:37:32 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* rainbow.c (filename_to_classname): Make it work even when there
isn't a `/' in the FILENAME.
(rainbow_index): Use filename_to_classname().
* install.texi: Mention the need for GNU make.
Add missing @@end enumerate.
* Makefile.in (default): Make it depend on all the
DEMO_EXECUTABLES and the PERL_RUNNABLE_FILES, instead of just
rainbow.
* libbow.h: Update copyright.
* primes.c (_bow_nextprime): Replace bzero by memset, for SunOS.
* prind.c (_bow_barrel_set_prind_weights): Remove warning about
old do-nothing loop. Remove the loop.
* rainbow.c (rainbow_query): Set NUM_HITS_TO_SHOW equal to the
number of classes, instead of just 2. Simplify output so it is
more machine readable.
(main): Require a length of 39, not 10, for argv[0] in order to turn
off BOW_VERBOSITY_USE_BACKSPACE. (This was a hack so we don't get
a lot of \b's inside gdb inside emacs.) Fix the getopt string to
include a `:' after `v'.
* Makefile.in (snapshot): cvs tag the repository.
* rainbow.c: Deal with systems that don't have getopt.h.
* weight.c (_bow_add_to_normalizer_total): Add case for
BOW_METHOD_PRIND.
(_bow_total_to_normalizer): Likewise.
Sat Jan 11 17:57:16 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* arrow.c: Fix typo in last change.
Fri Jan 10 11:03:51 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* configure.in: Look for getopt.h.
* arrow.c: Deal with systems that don't have getopt.h.
* rainbow-stats.pl (overall_accuracy): Print stderr for both
verbosity levels!
Thu Jan 9 11:46:41 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* rainbow-stats.pl (overall_accuracy): Print standard error, not
standard deviation.
Wed Jan 8 11:19:13 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* info_gain.c (bow_infogain_per_wi_new): Assert info gain is >= 0,
not > 0.
* barrel.c (bow_barrel_keep_top_words_by_infogain): Likewise.
* rainbow.c (rainbow_index): Rearrange
REUSE_ARCHIVED_BARREL_COUNTS logic so it works now.
* vpc.c (bow_barrel_new_vpc): Don't assert DV, just continue if
it's NULL.
Tue Jan 7 09:45:54 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* libbow.h: Declare new barrel info gain function.
Mon Jan 6 10:28:17 1997 Andrew McCallum <mccallum@@cs.cmu.edu>
* barrel.c (bow_barrel_keep_top_words_by_infogain): New function.
* lex-gram.c (bow_lexer_gram_open_text_fp): Return NULL if LEX is
NULL.
* wi2dvf.c (bow_wi2dvf_remove_wi): New function.
(bow_wi2dvf_write): Use new SEEK_START convention, in which it is
-1 when DV is NULL; previously, when DV was NULL, it was equal to
the previous SEEK_START.
(bow_wi2dvf_new_from_data_fp): Likewise.
* libbow.h: Declare new wi2dvf function.
* rainbow.c, weight.c, vpc.c, score.c, libbow.h: Separate
PrTFIDF from PrInd (Fuhr's Probabilistic Indexing).
* prind.c: New file, for Fuhr's Probabilistic Indexing method.
* rainbow.c (rainbow_index): Prune words by info gain in barrel,
not in the word vocabulary, so that `-L' can work properly.
Fri Jan 3 12:56:53 1997 Andrew McCallum <mccallum@@pad>
* rainbow.c (main): Add the -L option, for turning off lexing the
text files, and instead using the word counts in the archived
barrel.
(rainbow_print_usage): Likewise.
(reuse_archived_barrel_counts): New global variable, controlling this.
* lex-html.c (bow_lexer_html_get_word): Change type of argument
SELF to match BOW_LEXER.
* rainbow.c (main): Add the -s option, for turning off use of the
stoplist.
(rainbow_print_usage): Likewise.
* rainbow.c (main): Add the -U option, for turning off uniform
priors in PrTFIDF.
(rainbow_print_usage): Likewise.
* rainbow-stats.pl: Add test of $#ARGV to the `-s' test, so it
actually works the way it's supposed to.
Wed Jan 1 16:54:28 1997 Andrew McCallum <mccallum@@pad>
* lex-html.c (bow_lexer_html_get_raw_word): Print warning when we
find an unterminated open bracket `<'. Verbosify about close
bracket warning with priority of BOW_VERBOSE, not BOW_PROGRESS.
`rainbow -i -H -S' now seems to be working.
* rainbow.c (rainbow_underlying_lexer): New global variable.
(rainbow_html_lexer): New global variable.
(rainbow_print_usage): Overhauled to accurately describe the valid
arguments.
(main): Rearrange and clean up argument handling.
* libbow.h: Change type of argument SELF in BOW_LEXER_SIMPLE
word-getting subfunctions.
* lex-simple.c (bow_lexer_simple_open_text_fp): Deal with EOF in
FP. Deal with zero-length documents. After we find END_PATTERN,
move the DOCUMENT_POINTER back to the beginning of of the
END_PATTERN.
(bow_lexer_simple_postprocess_word): Change type of SELF from
BOW_LEXER to BOW_LEXER_SIMPLE.
(old_bow_lexer_simple_get_word): Old, unused function removed.
* lex-html.c (bow_lexer_html_get_raw_word): Keep a count of the
HTML bracket nestings, instead of keeping track as a boolean.
(bow_lexer_html_get_word): Postprocess word using the underlying
lexer from SELF, not SELF itself.
Tue Dec 31 12:36:21 1996 Andrew McCallum <mccallum@@pad>
* int4word.c (bow_num_words): If WORD_MAP has not yet been
created, return 0, instead of raising an error.
* lex-html.c (bow_lexer_html_get_raw_word): Look for end by
comparing to 0, not EOF. Fix termination condition of
true-to-start loop. Change type of SELF to BOW_LEXER_SIMPLE from
BOW_LEXER.
(bow_lexer_html_get_word): Change type of SELF to BOW_LEXER_INDIRECT
from BOW_LEXER.
* lex-simple.c (bow_lexer_simple_get_raw_word): Look for end by
comparing to 0, not EOF!
* libbow.h (bow_str2method): Add "tfidf" as a synonym for
tfidf_log_occur.
* rainbow-stats.pl: Now the `-s' argument causes it to print only
accuracy average and standard deviation.
(verbosity): New variable.
Mon Dec 30 15:34:25 1996 Andrew McCallum <mccallum@@pad>
* wi2dvf.c (bow_wi2dvf_add_di_text_fp): Loop over all documents
(LEX's) in the file.
* int4word.c (bow_words_add_occurrences_from_text_dir): Likewise.
* wv.c (bow_wv_new_from_lex): New function.
(bow_wv_new_from_text_fp): Use it. Handle NULL lex.
* libbow.h: Declare new WV function.
* lex-simple.c: Remove the N-gram lexer.
* lex-gram.c: lex-indirect.c, lex-html.c: New files.
* Makefile.in (LIBBOW_C_FILES): Added lex-gram.c, lex-html.c,
lex-indirect.c.
* libbow.h: Declare new lexer functions, types and variables.
Sun Dec 29 13:04:17 1996 Andrew McCallum <mccallum@@pad>
* lex-simple.c: Make all the instances of BOW_LEXER use a NULL
DOCUMENT_END_PATTERN.
(bow_lexer_simple_open_text_fp): Instead of scanning the FP twice,
have it fill and grow the document buffer as it reads the FP for
the first time. (This now seems to work on STDIN, although I
haven't tried non-NULL DOCUMENT_END_PATTERN's with it; I'm not
sure if FSEEK works on STDIN.)
* libbow.h (bow_lexer): Comment the start and end patterns.
* scan.c (bow_scan_fp_for_string): If STRING is the empty string,
return immediately instead of scanning to EOF. The NULL string
still scans to EOF.
Fri Dec 27 20:00:30 1996 Andrew McCallum <mccallum@@pad>
The changes for the new lexer. It now seems to be working.
* libbow.h (bow_lex): New type, replacing BOW_PARSE.
(bow_lexer): New type, replacing BOW_PARSER.
(bow_lexer_simple): New type. New lexers based on this.
(bow_lex_gram): New type.
(bow_lexer_gram): New type.
(bow_default_lexer): Renamed from BOW_DEFAULT_PARSER.
(bow_stem_porter): Renamed from BOW_STEM.
(bow_isalpha): New function declaration.
(bow_isgraph): Likewise.
* Makefile.in (LIBBOW_C_FILES): Remove p-alpha.c, p-alonly.c,
p-gram.c, p-white.c. Add lex-simple.c.
(DEMO_C_FILES): Remove robin.c.
* defparser.c (bow_default_lexer): Renamed from
BOW_DEFAULT_PARSER.
* int4word.c (bow_words_add_occurrences_from_text_dir): Use new
lexer instead of old parser.
* wi2dvf.c (bow_wi2dvf_add_di_text_fp): Likewise.
* wv.c (bow_wv_new_from_text_fp): Likewise.
* rainbow.c (rainbow_lexer): New global variable.
(main): Use new lexer instead of old parser. Using BOW_ALPHA_LEXER
as the underlying lexer instead of the old BOW_ALPHA_ONLY_PARSER.
* stem.c (bow_stem_porter): Renamed from BOW_STEM.
* lex-simple.c: New file.
* scan.c (bow_scan_fp_for_string): If STRING is NULL or
zero-length, then instead of immediately returning zero, scan
through the FP until EOF.
Thu Dec 26 12:20:44 1996 Andrew McCallum <mccallum@@pad>
The last version before the many `lexer' changes.
* int4word.c (bow_words_write): Write the WORD_MAP_COUNTS also.
(bow_words_read_from_fp): Read and create them.
* dv.c (bow_dv_add_di_count_weight): Assert that the new count is
greater than zero.
* prtfidf.c (_bow_barrel_set_prtfidf_weights): Assert that DV->IDF
is greater than zero.
Tue Dec 24 17:36:09 1996 Andrew McCallum <mccallum@@pad>
* rainbow-stats.pl (calculate_accuracy): Use printf and %g instead
of print.
(overall_accuracy): Calculate and print standard deviation also.
* rainbow.c (main): Use BOW_GRAM_PARSER_PARSER.
* p-gram.c (bow_gram_parser_parser): New global variable.
(bow_gram_parser_open_text_fp): Set it to BOW_DEFAULT_PARSER if it's
NULL. Use it.
(bow_gram_parser_close): Use it.
(bow_gram_parser_get_word): Likewise.
* libbow.h: Declare BOW_GRAM_PARSER_PARSER.
* prtfidf.c (bow_prtfidf_uniform_priors): New global variable.
Default is to use *uniform* class prior probabilities.
(_bow_barrel_set_prtfidf_weights): Don't set the DV->IDF here, we'll
use its current value later.
(_bow_score_prtfidf_from_wv): Move the test for !DV. Pay attention to
BOW_PRTFIDF_UNIFORM_PRIORS, and do the right thing.
Sun Dec 22 13:35:03 1996 Andrew McCallum <mccallum@@pad>
* .cvsignore: Added executables arrow, robin, rainbow-stats.
* Makefile.in (INSTALL_FILES): New variable.
(install): Use it. Fix removing of old executables. Install Perl
files.
* libbow.h (bow_str2method): New macro.
(bow_words_keep_top_by_infogain): Declare function.
* int4word.c: Add comment.
* rainbow.c (rainbow_index): Remove words with occurrences less
than X even if NUM_TOP_WORDS_TO_KEEP is non-zero.
(main): New command-line argument `-R'. Use new bow_str2method().
* stoplist.c: Turn back on the builtin stoplist.
Fri Dec 20 15:53:50 1996 Andrew McCallum <mccallum@@pad>
* prtfidf.c: New file.
Tue Dec 17 18:19:55 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* rainbow.c (main): Added prtfidf for `-m'.
* score.c (bow_get_best_matches): Do the right thing for prtfidf,
call _bow_score_prtfidf_from_wv.
* stoplist.c (init_stopwords): Temporarily turn off the builtin
stoplist, for use with the demo data. Yipes, this needs to
be turned back on!
* vpc.c (bow_barrel_new_vpc): Do the right thing for prtfidf;
treat it like naivebayes.
(bow_barrel_new_vpc_with_weights): Likewise.
* weight.c (bow_barrel_set_weights): Call
_bow_barrel_set_prtfidf_weights when appropriate.
* Makefile.in (LIBBOW_C_FILES): Added prtfidf.c.
Mon Dec 16 13:11:59 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* arrow.c (arrow_unarchive): Add verbosification.
* info_gain.c (bow_infogain_per_wi_new): Fix verbosification.
* info_gain.c (bow_infogain_per_wi_new): Add verbosifying.
* rainbow.c (num_top_words_to_keep): Set to zero as a default.
(rainbow_index): Make it possible to call both occurrence pruning and
infogain pruning.
(main): New command-line argument `-m'.
* vpc.c (bow_barrel_new_vpc_with_weights): Create the VPC barrel,
and then normalize the weights, otherwise we get -1 normalizers!
* score.c (bow_get_best_matches): Delete more leftover naivebayes
code. Assert that the normalizer is greater than 1.
* rainbow.c (num_top_words_to_keep): New global variable set from
command line.
(rainbow_index): New nest function DO_INDEXING. Use it. Add term
pruning according to information gain.
(main): New command line argument `-T' to set num top words.
* robin.c (robin_index): Do the right thing when WI is -1.
* wi2dvf.c (bow_wi2dvf_add_di_text_fp): Likewise.
* wv.c (bow_wv_new_from_text_fp): Likewise.
* int4word.c (bow_words_keep_top_by_infogain): Implemented.
(bow_words_add_occurrences_from_text_dir): Do the right thing when WI
is -1.
* info_gain.c (bow_infogain_per_wi_new): Set info gain to 0 when
the DV for that word NULL.
* barrel.c (bow_barrel_new): Add new argument. Separate
capacities for the cdocs array and the wi2dvf.
* libbow.h: Declare new argument in bow_barrel_new.
* arrow.c (arrow_index): Use new extra argument to barrel_new.
* vpc.c (bow_barrel_new_vpc): Use new extra argument to
barrel_new.
* naivebayes.c (_bow_score_naivebayes_from_wv): Removed ununsed
local variable.
Wed Dec 11 15:19:44 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* Makefile.in (diff): Ignore the non-zero exit status from `diff'.
* Makefile.in (dist): Call cvs rtag.
(diff): New target.
(clean): Delete *.info and *.dvi.
(maintainer-clean): Delete $(PERL_RUNNABLE_FILES), configure, README,
and INSTALL.
* int4word.c (bow_words_keep_top_by_infogain): New function; not
yet implemented.
* naivebayes.c (_bow_score_naivebayes_from_wv): Incoporate P(w|C)
for all words in query document, not just those in the DV.
* rainbow.c: Added more comments.
(rainbow_wi2dvf_sum_classes): Function removed.
* Version (BOW_MINOR_VERSION): Version 0.5.
* libbow.h (BOW_MINOR_VERSION): Likewise.
This version given to Kamal.
Tue Dec 10 20:22:29 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* rainbow-stats.pl: New file. Changed from Sean's version to
include scientific notation in number regular expression.
Naive-Bayes code runs without crashing, but it provides horrible
results on the CIA type data. Average accuracy of 7%. It almost
always chooses Defense_Forces.
* naivebayes.c (_bow_barrel_set_naivebayes_weights): Rewrite from
scratch, avoiding the use of heaps.
* libbow.h: Include <limits.h>, so we get PATH_MAX.
* naivebayes.c: New file.
* Makefile.in (LIBBOW_C_FILES): Added naivebayes.c.
* weight.c: Remove the NaiveBayes code to naivebayes.c.
* score.c (bow_get_best_matches): Likewise.
* vpc.c (bow_barrel_new_vpc): Remove the NaiveBayes prior-setting
to naivebayes.c.
* rainbow.c: Remove the commented-out pre-vpc code. Change the
default method to naivebayes.
Mon Dec 9 10:11:05 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
CIA type data shows performance improvement from 1-grams to
1/2-grams: 79% to 68% accuracy.
* wi2dvf.c (bow_wi2dvf_dv): Fix assertion for when doing the last WI.
* vpc.c (bow_barrel_new_vpc): Verbosify and fix off-by-one error
in class index handling.
* rainbow.c (rainbow_classnames): New global variable.
(rainbow_unarchive): Set it.
(rainbow_index): Verbosify while we read files for word pruning.
* libbow.h (PATH_MAX): Avoid warning in surrounding #if.
(bow_fopen): Use perror() as well as bow_error.
* int4word.c (bow_words_add_occurrences_from_text_dir): Keep track
of the text file count, and verbosify.
* barrel.c (bow_barrel_new): Create the new wi2dvf with
bow_num_words(), not CAPACITY.
(_bow_barrel_cdoc_free): Only free FILENAME if it's non-NULL.
* array.c (bow_array_entry_at_index): Fix off-by-one error in
assertion.
* rainbow.c: Make it work with new vpc function, but old code is
still there commented-out.
* libbow.h: Declare new vpc and infogain functions.
* info_gain.c: Comment new functions.
* score.c (bow_get_best_matches): Add code to do NaiveBayes;
thanks to Dunja, who helped.
* vpc.c (bow_barrel_new_vpc): Totally rewritten. Now simpler and
faster. Don't create a dv_heap, just go through the wi2dvf by
words.
(bow_barrel_new_vpc_with_weights): New function.
* weight.c (_bow_barrel_set_weights_naivebayes): Renamed from
_bow_barrel_set_weights_sans_idf. Verify that the class priors
are set.
(bow_barrel_set_weights): Use new function name.
* split.c (drand48, srand48) [__sun__]: Add prototypes.
Fri Dec 6 17:57:46 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* wi2dvf.c (bow_wi2dvf_print_stats): Don't use //-style comments.
* weight.c (bow_barrel_set_weight_normalizers): Likewise.
* libbow.h: Add inclusions and declarations needed for SunOS;
thanks to Sean.
* rainbow.c (prune_words_with_occurrences_less_than): New global
variable.
(rainbow_index): Use it.
(rainbow_query): Set and normalize the QUERY_WV weights!
(rainbow_test): Likewise.
* dv.c (bow_dv_default_capacity): Value changed from 4 to 2, in an
effort to reduce memory use.
* email.c (bow_email_get_replyid): Don't insist that the opening
`<' is on the same line as "In-Reply-To:".
* int4word.c (word_map_counts, word_map_counts_size): New static
variables.
(bow_word2int_do_not_add): New static variable.
(_bow_int4word_initialize): New function.
(bow_word2int): Use it. Pay attention to bow_word2int_do_not_add.
(bow_words_set_map): New function.
(bow_word2int_add_occurrence): New function.
(bow_words_occurrences_for_wi): New function.
(bow_words_remove_occurrences_less_than): New function.
(bow_words_add_occurrences_from_text_dir): New function.
* libbow.h: Declare new bow_words_ functions.
* robin.c (robin_index): Use new function
bow_word2int_add_occurrence().
* weight.c (bow_barrel_set_weight_normalizers): Free the heap
before returning!
* wi2dvf.c (bow_wi2dvf_add_di_text_fp): Use new function
bow_word2int_add_occurrence().
* wv.c (bow_wv_new): Initialze normalizer to 1.
(bow_wv_new_from_text_fp): Likewise.
(bow_wv_new_from_text_fp): Use new bow_word2int_add_occurrence().
Thu Dec 5 09:49:46 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* score.c (bow_get_best_matches): Make sure QUERY_WV->NORMALIZER
is non-zero.
* weight.c (bow_wv_set_weight_normalizer): Make sure TOTAL is
non-zero.
* rainbow.c: Include <errno.h> for DEC Alpha's.
* arrow.c: Likewise.
Wed Dec 4 10:44:52 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* Makefile.in (PERL): New variable.
(LIBBOW_C_FILES): Added p-gram.c.
(PERL_FILES): New variable.
(PERL_RUNNABLE_FILES): New variable.
(DIST_FILES): Added PERL_FILES.
(all): Add dependancy on PERL_RUNNABLE_FILES.
(PERL_RUNNABLE_FILES): New rule.
* configure.in: Look for perl in path.
* rainbow.c (infogain_words_to_print): New global variable, set by
command-line arguments.
(main): Set BOW_DEFAULT_PARSER to BOW_GRAM_PARSER; set
BOW_GRAM_PARSER_GRAM_SIZE to 1. New command line options, -g, -I,
-h. Call BOW_INFOGAIN_PER_WI_PRINT.
* info_gain.c (bow_infogain_per_wi_new): New function.
(bow_infogain_per_wi_print): New function.
(bow_barrel_scale_by_info_gain): Use new function above.
* p-gram.c: New file.
* libbow.h (bow_parser_skip_net_header): Declare new global
variable.
(bow_gram_parser): Declare new parser struct.
(bow_gram_parser_gram_size): Declare new global variable.
* defparser.c (bow_parser_skip_net_header): Define and initialize
to 0.
* p.inc (BOW_P_OPEN_NAME): If BOW_PARSER_SKIP_NET_HEADER is
non-zero, scan into the FP past the first "\n\n", in order to skip
over the email/news header.
Tue Dec 3 10:20:11 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* wv.c (bow_wv_count_for_wi): Use bow_wv_entry_for_wi() instead of
duplicating code.
* wi2dvf.c (bow_wi2dvf_new): Initialize the FP to NULL!
(bow_wi2dvf_dv): Assert that WI isn't larger than the WI2DVF->SIZE.
Assert that IDF isn't NaN; twice.
* split.c (bow_test_new_heap): Drastically simplify.
(bow_test_next_wv): Free the old *WV if isn't non-NULL. Use
bow_wv_new() instead of creating it with malloc by hand. When
we've reached the end of the heap, free the *WV.
* score.c (bow_get_best_matches): Make CURRENT_SCORE a double
instead of a float. Assert that IDF isn't NaN. Don't normalize
the query WV. Most important: avoid a memory leak by freeing the
HEAP when we are done with it!
* rainbow.c (rainbow_wi2dvf_sum_classes): Set the class IDF from
the doc IDF. Still add in the count and weight, even if the
weight is zero. This means the wi2dvf will expand to the proper
size so we can meaningfully get DV's from it.
(rainbow_set_weights): Don't scale by info gain.
(rainbow_test): Initialize the QUERY_WV to NULL so bow_test_next_wv()
will know not to free an uninitialized value.
* heap.c (bow_dv_heap_free): New function.
(bow_make_dv_heap_from_wv): Add assertion checking for IDF NaN.
* dv.c (bow_dv_new_from_data_fp): Add comment about FP assertion.
* weight.c: Assert that IDF is not NaN. Don't print progress
verbosity every time through the loop---it's slowing us
down---only print it every 10 times through the loop.
* dv.c (bow_dv_new): Initialize the IDF to zero!
(bow_dv_write_size): Include the IDF size in the return value.
(bow_dv_write): Write the IDF.
(bow_dv_new_from_data_fp): Read the IDF.
* rainbow.c: Keep two barrels: one for classes, one for documents.
(num_trials, test_percentage, method): New global variables set by
command-line switches.
(rainbow_archive): Deal with both barrels.
(rainbow_unarchive): Likewise.
(rainbow_set_weights): New function...
(rainbow_wi2dvf_sum_classes): ...using code pulled from here.
(filename_to_classname): New function.
(rainbow_test): New function.
(main): Add new command line switches -t, -p. Call rainbow_test().
* arrow.c: Use new weight normalization functions.
* split.c: Renamed functions to all begin with `bow_test_'. Use
argument `barrel' instead of `cdoc' and `wi2dvf'.
* weight.c: Use bow_method instead of bow_idf_type and
bow_normalize_type. All functions changed.
* score.c (bow_get_best_matches): Rename some variables. Add
mechanics of NaiveBayes. Normalize query vector. Normalize
non-NaiveBayes outside the loop.
* barrel.c (bow_barrel_new): Fix initialization of METHOD.
(bow_barrel_add_from_text_dir): Initialize the PRIOR.
(_bow_barrel_cdoc_write): Write the PRIOR.
(_bow_barrel_cdoc_read): Read the PRIOR.
(bow_barrel_new_from_data_fp): Read the METHOD properly.
* libbow.h: Remove types as arguments to some weight functions.
Rename the test/train split functions.
(bow_cdoc): Added member PRIOR.
(bow_barrel): Added member METHOD.
(bow_method): New enum.
(bow_idf_type): Removed.
(bow_normalize_type): Removed.
* Makefile.in (LIBBOW_C_FILES): Added split.c.
* barrel.c (bow_barrel_new): Set RET->METHOD to default of
BOW_METHOD_TFIDF.
(bow_barrel_new_from_data_fp): Read METHOD.
(bow_barrel_write): Write METHOD.
* weight.c (_bow_add_to_normalizer_total): New function.
(_bow_total_to_normalizer): New function.
(bow_barrel_set_weight_normalizers): Use them.
(bow_wv_set_weights): Function moved here from wv.c.
(bow_wv_set_weight_normalizer): Likewise.
* wv.c: Weight and normalizer functions moved to weight.c.
* libbow.h: Move the WV weight-setting functions next to the
barrel weight-setting functions.
(bow_cdoc): Rename memeber LENGTH to NORMALIZER, for clarity.
(bow_barrel_set_weight_normalizers): Renamed from
bow_barrel_normalize_weights, since it doesn't actually change the
weight values.
* wv.c: Include <math.h>
(sqrtf): New macro.
(bow_wv_set_normalizer): Renamed from bow_wv_set_norm(). New argument
TYPE. Obey new argument.
(bow_wv_set_weights): New argument TYPE. Obey it. Don't call the
normalizer function.
(bow_wv_write): Use new name WV->NORMALIZER.
(bow_wv_new_from_data_fp): Likewise.
* score.c (bow_get_best_matches): Use renamed NORMALIZER member.
* barrel.c (_bow_barrel_cdoc_write): Likewise.
(_bow_barrel_cdoc_read): Likewise.
* libbow.h: Rename and add new arguments to WV functions.
Mon Dec 2 13:14:10 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* arrow.c (arrow_unarchive): Don't close the barrel FP, because we
still have yet to read the DV's from it!
* barrel.c (bow_barrel_add_from_text_dir): Print warning if we end
up finding more binary files than text files.
* score.c: Some formatting and comment changes.
* weight.c: Some comment and variable name changes.
(_bow_add_to_idf): Renamed from bow_add_to_total.
(_bow_barrel_set_weights_nb): New function for doing Naive Bayes.
(bow_barrel_set_weights): Call it if necessary.
* libbow.h: Declare new vpc function.
(bow_idf_nb): New idf type.
* Makefile.in (LIBBOW_C_FILES): Added vpc.c.
* vpc.c (bow_barrel_new_vpc): Renamed from bow_barrel2vpc_barrel.
Replace use of printf() with bow_verbosify(). Minor formatting
changes.
Mon Dec 2 13:09:10 1996 Sean Slattery <jslttery@@anther.learning.cs.cmu.edu>
* vpc.c: New file - implements vector per class models. Basically,
take a barrel and produce a vector per class barrel from it.
Tue Nov 26 15:56:07 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* weight.c (_bow_add_to_total): Renamed to include a prefixing
`_'. Declared `static inline'.
(bow_barrel_set_weights): Overhauled and simplified. I'm not sure
I haven't broken it, though. Previously some of the `if()else'
clauses seemed contradictory to me.
* score.c (bow_get_best_matches): Add comment about my perceived
pending need for normalization of the query vector.
* dv.c (_bow_dv_index_for_di): Add 1 to the DV length when it was
zero!
(bow_dv_add_di_count_weight): New function, replacing
bow_dv_add_di_count.
(bow_dv_add_di_weight): Function removed.
* libbow.h: Declare new wi2dvf function, and remove old ones.
* rainbow.c (rainbow_wi2dvf_sum_classes): Use new dv function.
* wi2dvf.c (bow_wi2dvf_add_wi_di_count_weight): New function,
replacing bow_wi2dvf_add_wi_di_count. Use new dv function.
(bow_wi2dvf_add_di_wv): Use new dv function.
(bow_wi2dvf_add_di_text_fp): Likewise.
* Makefile.in (LIBBOW_C_FILES): Added scan.c; although this will
be taken away once I change parsing to use strings and librx.
* scan.c: New file.
* heap.c (bow_make_dv_heap_from_wi2dvf): Add silly assert()ion.
* email.c (bow_email_get_date): Don't cause error when Date isn't
found, just return 0.
* dv.c (_bow_dv_index_for_di): New function that captures the guts
of preparing a spot to add a count or weight.
(bow_dv_add_di_count): Use it.
(bow_dv_add_di_weight): Use it.
* info_gain.c (bow_entropy): Ensure COUNTS[i] isn't zero before
calculating entropy.
Mon Nov 25 11:51:35 1996 Sean Slattery <jslttery@@anther.learning.cs.cmu.edu>
* weight.c: Added support of bow_prtfidf weighting which gives an
idf = sqrt(total occurances/occurances).
(bow_barrel_set_weights): added code to calculate the total number
of occurances and changed idf calculations that had those pesky
1.0's with the real totals intended. Doing things this way ensures
the weights don't go below 0.
* score.c (bow_get_best_matches_euclidian): New function. Gets
best matches badet on the euclidean distance between vectors
instead of the cosine of the angel between them.
* dv.c (bow_dv_add_di_weight): Made this function capable of
updating weights that occur before the last element entered. It
assumes the documents are in the list in ascending order of their
indices. Should make this change to the bow_dv_add_di_count
function as well, but this was the minimum I needed to get vector
per class stuff done.
Mon Nov 18 14:17:55 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* wv.c (bow_wv_set_norm): Initialize TOTAL to zero! It was
uninitialized.
(bow_wv_write): Write the NORM!
(bow_wv_new_from_data_fp): Read it.
(bow_wv_write_size): Adjusted for writing NORM.
* stoplist.c (bow_stoplist_add_from_file): Screaming verbosify
each word that's added.
* heap.c (bow_make_dv_heap_from_wv): Get the DV using
bow_wi2dvf_dv(), not by accessing the structure directly.
Otherwise, we will no properly properly read in the DVF from disk.
* weight.c (bow_barrel_set_weights): Likewise.
* p.inc (BOW_P_GET_WORD_NAME): Also check if word is on the
stoplist *after* stemming.
Tue Nov 5 12:15:37 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* weight.c (bow_barrel_set_weights): Use the total number of
documents instead of 1.0.
Fri Nov 1 11:27:23 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* p.inc: Fix the handling of BOW_P_STOPLIST_CHECKER.
* int4str.c (bow_int4str_new_from_fp): Make it work even for
strings that contain spaces, (but not newlines).
(bow_int4str_write): Make sure the strings don't contain newlines.
Add generalizable parsing facilities.
* libbow.h (bow_parse, bow_parser): New types. Add new parsing funcs.
(bow_get_word): Function removed. Use new parsing facilities instead.
* Makefile.in (DIST_FILES): Added p.inc.
* Makefile.in (LIBBOW_C_FILES): Added p-alpha.c, p-alonly.c, p-white.c.
* p.inc, p-alpha.c, p-alonly.c, p-white.c: New files.
* Makefile.in (LIBBOW_C_FILES): Added defparser.c. Removed
getword.c.
* wi2dvf.c (bow_wi2dvf_add_di_text_fp): Use new parser.
* wv.c (bow_wv_new_from_text_fp): Use new parser.
* arrow.c (arrow_index): Use new bow_barrel_add_from_text_dir
function.
Thu Oct 31 15:18:49 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* Version (BOW_MINOR_VERSION): Version 0.4.
* libbow.h (BOW_MINOR_VERSION): Version 0.4.
* rainbow.c: Add output filename feature. Use bow_idf_words,
which unlike bow_idf_log_words, seems to work.
(rainbow_index): Scale by information gain.
* barrel.c (bow_barrel_add_from_text_dir): Add new EXCEPT_NAME
argument. Deal with NULL EXCEPT_NAME.
* libbow.h: Add new argument to barrel function.
* weight.c (bow_barrel_set_weights): Add prefix and postfix
verbosity strings.
(bow_barrel_normalize_weights): Add verbosifying.
* info_gain.c (bow_barrel_scale_by_info_gain): Add verbosifying.
* barrel.c (bow_barrel_add_from_text_dir): Don't print the number
of "binary files".
* rainbow.c: Added some verbosifying.
* rainbow.c: Don't close the rainbow_barrel fp. Set the weights
in the right place. Put the indexing code in main(). Now running
to completion.
* dv.c (bow_dv_new_from_data_fp): Add new assertion that should
help us catch closed FP's.
* docnames.c (bow_map_verbosity_level): New global variable.
(bow_map_filenames_from_dir): Use it.
* barrel.c (bow_barrel_add_from_text_dir): Renamed from
bow_barrel_new_from_text_dir. Don't create a new barrel, just add
to a pre-existing one.
* libbow.h: Declare renamed function.
* libbow.h (bow_fwrite_string): Handle the NULL string for
argument S.
(bow_fread_string): Match bow_fwrite_string handling of NULL.
Mon Oct 28 12:03:11 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* info_gain.c (bow_barrel_scale_by_info_gain): Renamed from
bow_wi2dvf_scale_by_info_gain.
* libbow.h: Rename info gain function to use `barrel'.
* rainbow.c: Totally rewritten to be a document classifier.
* wi2dvf.c (bow_wi2dvf_add_di_wv): Increase wi2dvf size with a
MAX(), so we are guaranteed to be big enough.
(bow_wi2dvf_add_wi_di_count): Likewise.
(bow_wi2dvf_add_wi_di_weight): Likewise.
(bow_wi2dvf_write): Incorporate initial seek position into
calculations, in case we are writing to a file that already has
other stuff at the beginning.
* barrel.c (bow_barrel_free): New function.
(bow_barrel_new_from_text_dir): Print shorter verbosity.
* libbow.h: Declare bow_barrel_free().
* arrow.c (arrow_index): Set the weights.
(main): Raise error if no text documents found.
* libbow.h: Declare bow_barrel_new(), and fix typo.
* wi2dvf.c (bow_wi2dvf_add_wi_di_weight): New function.
* libbow.h: Declare new wi2dvf function.
* dv.c (bow_dv_add_di_weight): New function.
* libbow.h: Declare new dv function.
* weight.c (bow_barrel_set_weights): Renamed from
bow_wi2dvf_set_weights.
(bow_barrel_normalize_weights): Renamed from
bow_wi2dvf_normalize_weights.
* libbow.h: Rename weight functions to use `barrel'.
* barrel.c (bow_barrel_new_from_text_dir): Take new CLASS
argument. Set the `class' of the new cdoc's accordingly.
* libbow.h: Add new argument to barrel function.
Fri Oct 25 13:05:16 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* arrow.c (main): Create the data directory if it doesn't exist
already.
* Version (BOW_MINOR_VERSION): Version 0.3.
* sarray.c (bow_sarray_new_from_data_fp): Renamed from
bow_sarray_new_from_fp.
* libbow.h: Rename sarray function.
* Makefile.in (version.texi): Use renamed BOW_ variables.
(libbow.h): New target with rules that keep it up to date with
./Version.
* libbow.h (BOW_MAJOR_VERSION): New macro.
(BOW_MINOR_VERSION): New macro.
(BOW_VERSION): New macro.
* Version (BOW_MAJOR_VERSION): New variable.
(BOW_MINOR_VERSION): New variable.
(BOW_VERSION): Use them; renamed from LIBBOW_VERSION.
* arrow.c: New file.
* Makefile.in (DEMO_C_FILES): Added arrow.c.
* barrel.c: New file.
* libbow.h: Declare barrel archiving functions.
* stoplist.c (bow_stoplist_add_from_file): Add a verbosify
message.
Wed Oct 23 16:45:46 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* libbow.h (bow_barrel): New type. Use it in all places where a
WI2DVF and CDOCS were used together; several function arguments
changed.
* wi2dvf.c (bow_wi2dvf_new_from_text_dir): Function removed.
Similar function is now in barrel.c.
* weight.c (bow_wi2dvf_set_weights): Use bow_barrel.
(bow_wi2dvf_normalize_weights): Likewise.
* score.c (bow_get_best_matches): Use bow_barrel.
* info_gain.c (bow_wi2dvf_scale_by_info_gain): Use new bow_barrel
type.
* Makefile.in (LIBBOW_C_FILES): Add barrel.c.
* array.c (bow_array_new_from_data_fp): Renamed from
bow_array_new_from_fp.
* sarray.c (bow_sarray_new_from_fp): Use renamed function
bow_array_new_from_data_fp.
Tue Oct 22 14:03:38 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* email.c (_scan_fp_for_string): Make `\n' at the beginning of the
search string match the beginning of the file.
* error.c (_bow_error) [__linux__]: Call abort() instead of exit()
because it lets us find ourselves in GDB. Still don't do it for
non-Linux systems, because apparently on other systems there was a
problem with flushing stderr when calling abort().
* docnames.c (bow_map_filenames_from_dir): Don't verbosify the
directory names if we're not BOW_VERBOSITY_USE_BACKSPACE.
* email.c: To several functions add new argument that negates
test, or that insists on a search all on one line.
(_bow_email_get_email_address): New function.
(bow_email_get_sender): New function.
(bow_email_get_recipient): New function.
(bow_email_get_date): New function.
* libbow.h: Declare new email functions.
Mon Oct 21 12:08:45 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* libbow.h (bow_parse_news_headers): Add missing semi-colon to
declaration.
* info_gain.c (bow_entropy): Get the "document vector" with
bow_wi2dvf_dv(), not by following the pointer directly.
Otherwise, we won't properly read the DV in from the file, and may
get inappropriate NULLs.
* Makefile.in (LIBBOW_C_FILES): Add info_gain.c.
* HACKING: Correct directions for checking out bow from CVS.
Sat Oct 19 00:49:08 1996 Sean Slattery <jslttery@@anther.learning.cs.cmu.edu>
* news.c: Function for parsing news article headers. Useful for looking
for crosspostings for multiple classifications.
(bow_parse_news_headers): Added a getc to dump the first whitespace
character after the : proceeding the header
(bow_headers2newsgroups): New function to grok the bow_sarray returned
by bow_parse_new_headers and return a bow_array of strings
corresponding to every newsgroup mentioned in the newsgroup line.
* libbow.h: Added def'n for new function
* libbow.h: Added bow_parse_news_headers
Fri Oct 18 10:50:23 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* info_gain.c (log2f): #define it if ./configure determined that
we don't have it.
(bow_entropy): Use log2f instead of log2.
(MIN): Macro removed. It's now in libbow.h.
Fri Oct 18 21:32:42 1996 Sean Slattery <jslttery@@anther.learning.cs.cmu.edu>
* array.c (bow_array_append): Changed test from array->length >
array->size to array->length >= array->size. When array->length =
array->size, we're run out of space.
(bow_array_init): Assigned array->free_func to free_func. Otherwise
free_func is not initialised and the bow_array_free function will
sometimes crash.
Fri Oct 18 10:50:23 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* libbow.h: Declare new functions.
(bow_wv): LENGTH entry renamed to NORM.
* wv.c (bow_wv_set_norm): New function.
(bow_wv_set_weights): New function.
* error.c (_bow_error): Call exit(-1) instead of abort(). It
makes a prettier error message on the console.
* heap.c (bow_make_dv_heap_from_wi2dvf): Separate the index into
words and index into the heap so that we can handle wi2dvf's that
have some NULL "document vectors".
* libbow.h (MIN): New macro.
(MAX): New macro.
(bow_verbosity_use_backspace): New global variable declaration.
* error.c (bow_verbosity_use_backspace): New global variable.
(bow_verbosify): Use it.
* weight.c (MIN): Remove definition. It's now in libbow.h.
* libbow.h (bow_wi2dvf_normalize_weights): Change from `normalise'
to American spelling. The *.c file had already been changed.
Fri Oct 18 01:15:57 1996 Sean Slattery <jslttery@@anther.learning.cs.cmu.edu>
* info_gain.c (bow_wi2dvf_scale_by_info_gain): New file,
information gain routine.
(bow_entropy): Cast some of the arithmitic to floats - dividing one
integer by another tends to go to 0 here.
* libbow.h: added definition for above.
* split.c: (bow_next_test_wv) Free heap when we've exhausted the
test set (for tidyness)
* heap.c (bow_make_dv_heap_from_wi2dvf): Changed malloc to
bow_malloc
(bow_make_dv_heap_from_wv): Changed malloc to bow_malloc
Thu Oct 17 11:08:42 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* score.c (bow_get_best_matches): Add an assert()'ion that WI
match the word index of our current location in the word vector.
* libbow.h: Rename local variables from num_written to num_read
where appropriate.
* heap.c (bow_make_dv_heap_from_wv): Fix typo: continue when DV is
NULL, not the other way around.
* wi2dvf.c (bow_wi2dvf_write): Don't close the FP at the end! We
didn't open it.
(bow_wi2dvf_new_from_data_file): Don't close the FP, it will still be
needed to read the DV's.
* libbow.h (bow_wi2dvf_write_data_file,
bow_wi2dvf_new_from_data_file): Re-add declarations for these
functions.
* heap.c (bow_make_dv_heap_from_wv): WV->LENGTH is not the number
of entries in the word vector, it is the Euclidean length! Change
all uses of WV->LENGTH to WV->NUM_ENTRIES.
* libbow.h (bow_wv): Renamed element `length' to `total' in an
attempt to choose a less confusing name. Other naming suggestions
welcome.
* array.c (bow_array_new_from_fp): Set the LENGTH of the new
array; before it was uninitialized!
* libbow.h (bow_fwrite_string): Properly calculate the number of
characters written.
(bow_fread_string): Likewise, and parenthesis indexing of S for proper
termination.
(bow_idf_type): Added `bow_idf' as prefix to enum members, and removed
`total' from end.
* weight.c: Use new bow_idf enum names.
(bow_wi2dvf_set_weights): Handle the case in which a document vector
in the WI2DVF is NULL.
* libbow.h: Include <assert.h>
* heap.c (bow_make_dv_heap_from_wv): Make it work even when not
all the words in WV have document vectors in WI2DVF. Keep
separate indices into the word vector and into the heap.
* int4str.c (HEADER_STRING): New macro.
(bow_int4str_write): Write it to the FP.
(bow_int4str_new_from_fp): Expected it from the FP.
* array.c (HEADER_STRING): New macro.
(bow_array_write): Write it to the FP.
(bow_array_new_from_fp): Expected it from the FP.
* bmalloc.c: Remove previous contents. Now get the functions
directly from libbow.h.
* io.c: Likewise.
* Makefile.in (io.o bmalloc.o): Indicate that they now depend
(completely) on libbow.h.
* libbow.h (_BOW_MALLOC_INLINE_EXTERN): New macro for compiling
these extern inline functions in library .o files.
(_BOW_IO_INLINE_EXTERN): Likewise.
(bow_fwrite*, bow_fread): Assert the return values.
* int4docn.c (bow_docnames_write): Take FILE* argument instead of
const char *.
(bow_docnames_read_from_fp): Renamed frombow_docnames_read(), likewise
as above.
* libbow.h: Change argument types and function name for
bow_docnames archiving.
* docnames.c (bow_map_filenames_from_dir): Use renamed
bow_verbosity_level enum.
* libbow.h (bow_error): Don't print anything if
bow_verbosity_level indicates bow_silent.
Thu Oct 17 14:36:44 1996 Sean Slattery <jslttery@@anther.learning.cs.cmu.edu>
* split.c (bow_next_test_wv): Function now takes a pointer to a pointer
to a bow_wv. It sets this to point to a pointer to the wv it creates
and returns the integer document index to the test document
described by this word vector.
(bow_test_split): Fixed bug in counting that meant we sometimes ended
up with fewer test docs than asked for.
(bow_test_split): Random number generator is now seeded with time.
* libbow.h: (bow_next_test_wv) Argument change as above.
Wed Oct 16 08:35:45 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* libbow.h (bow_screaming): Renamed from bow_shutup_already.
Commented all bow_verbosity_levels.
* wi2dvf.c (bow_wi2dvf_write_data_file): Close the FP at the end!
(bow_wi2dvf_new_from_data_file): Likewise.
(bow_wi2dvf_new_from_data_fp): Don't assert feof(), because there may
be multiple things written to one file.
* io.c (bow_fread_string): Add parenthesis in order to dereference
string pointer properly.
* libbow.h: Comment changes to #include lines.
(bow_fopen): New macro.
* wi2dvf.c (bow_wi2dvf_new_from_data_fp): Renamed from
bow_wi2dvf_new_from_fp. All callers changed.
* libbow.h: Renamed function.
* Makefile.in (LIBBOW_C_FILES): Added heap.c.
* io.c (bow_fwrite_string): New function from libbow.h.
(bow_fread_string): Likewise.
Wed Oct 16 14:03:07 1996 Sean Slattery <jslttery@@anther.learning.cs.cmu.edu>
* weight.c: Checked for case total == 0 which can occur if no
documents in the model had this word. Without this check, we get
a floating point error when trying to divide by total
* score.c: (bow_get_best_matches) Added support for a bow_array of
cdocs.
* weight.c: Messed up loop test on outer loop - Reset it to the max_wi
which Andrew changed it to before.
* libbow.h: Added defs for functions in split.c
* split.c: New file with functions for dealing with test sets.
* weight.c: Added bow_array *cdoc arguments to
bow_wi2dvf_set_weights
(so we can only do docs in the model), and to
bow_wi2dvf_normalize_weights where we only calculate the length of
docs in the model and we store the length in the corresponding
cdoc structure.
* libbow.h: Added include of string.h to stop compiler complaint
on alpha
Wed Oct 16 08:35:45 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* int4word.c (bow_words_write): Now takes FILE* argument instead
of filename.
(bow_words_read_from_fp): Renamed from bow_words_read_from_file, and
likewise as above.
* libbow.h (bow_words_read_from_fp): Renamed from bow_words_read.
* libbow.h: Change argument types of bow_words_write function.
(bow_error): Enclose expansion in parenthesis, so that it parses
properly when put inside an `else' statement without brackets.
* wi2dvf.c (bow_wi2dvf_write): New function.
(bow_wi2dvf_write_data_file): Use it. This function is now deprecated.
(bow_wi2dvf_new_from_fp): New function.
(bow_wi2dvf_new_from_data_file): Use it. This function deprecated.
(bow_wi2dvf_free): New function.
* libbow.h: Declare new functions. Remove deprecated functions.
* sarray.c (bow_sarray_write): New function.
(bow_sarray_new_from_fp): New function.
* libbow.h: Declare new functions.
* wv.c (bow_wv_new): New function.
(bow_wv_write_size): New function.
(bow_wv_write): New function.
(bow_wv_new_from_data_fp): New function.
* libbow.h: Declare new functions.
* weight.c (bow_wi2dvf_normalize_weights): Use renamed variable
wv_length.
* array.c (bow_array_write): New function.
(bow_array_new_from_fp): New function.
* libbow.h: Declare new functions.
* int4str.c (bow_int4str_write): Make second argument a FILE*
instead of a filename.
(bow_int4str_new_from_fp): New function.
* libbow.h: Declare new function. Update argument type.
Tue Oct 15 10:07:42 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* wv.c (bow_wv_entry_for_wi): New function.
* libbow.h: Declare new function.
* libbow.h: Update for function name changes.
(bow_class): New structure.
* weight.c (bow_wi2dvf_normalize_weights): Renamed from
bow_normalize_word_vectors. Minor format, comments and variable
name changes.
* configure.in: Check for existance of log2f() and sqrtf()
functions.
* weight.c (bow_wi2dvf_set_weights): Renamed from
bow_assign_tfidf_weights because it is specific to wi2dvf
structures, and we could imagine having a di2wvf structure in the
future, and because we could imagine non-TFIDF weight-setting
schemes. Don't loop over all word indices up to bow_num_words(),
only loop up to the min of that and size of WI2DVF. Raise an
error if there is an unrecognized TYPE. Fix bow_verbosify() call.
Mon Oct 14 16:17:26 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* libbow.h: Indentation and comment fixes.
* rainbow.c (main): Don't exit() prematurely. Actually write the
data file and read it back in again.
* wi2dvf.c (bow_wi2dvf_write_data_file): Use sizeof(int) instead
of sizeof(long) since it better matches reality.
* dv.c (bow_dv_write_size): Sum short's, not int's, or else we'll
lie about the results of bow_dv_write.
* getword.c (bow_get_word) [NON_ALPHA_IN_WORD]: New macro
selecting new code that will reject a word if it contains any
non-alphabetic characters. Current default is to include this
code.
* bitvec.c (bow_bitvec_new): Properly initialize all values to 0,
not to 1.
Fri Oct 11 17:37:28 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* bitvec.c: Finish and debug implementation.
* libbow.h: Add bow_bitvec declarations.
* Makefile.in (LIBBOW_C_FILES): Added bitvec.c.
* bitvec.c: New file.
Fri Oct 11 17:14:12 1996 Sean Slattery <jslttery@@anther.learning.cs.cmu.edu>
* libbow.h: Resolved a conflict in bow_cdoc / bow_doc definition.
Thu Oct 10 09:44:55 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* stoplist.c (bow_stoplist_add_from_file): Don't raise an error if
we can't open the file. This way, we can simply call the function
with several "guessed" filenames.
* libbow.h: Update comment for stoplist function.
* getword.c (bow_get_word): Delineate words by space characters
and non-printable characters, not by non-alphabetic characters,
(but still reject words with "too many" digits). This is an
effort to return entire email addresses and URL's as single words.
* stoplist.c: Totally re-written using a bow_int4str.
(bow_stoplist_present): Renamed from bow_on_stoplist.
(bow_stoplist_add_from_file): New function.
* libbow.h: Declare new stoplist functions.
* Makefile.in (LIBBOW_C_FILES): Added stopwords.c.
* getword.c (bow_get_word): Use renamed stoplist function.
* email.c (bow_email_get_receivedid): New function.
* libbow.h: Declare new email functions.
* rainbow.c (main): Use new function name
bow_wi2dvf_write_data_file().
* Makefile.in ($(DEMO_EXECUTABLES):): Depend on all the
$(DEMO_O_FILES).
* int4word.c (bow_num_words): Print error if WORD_MAP has not yet
been initialized.
* docnames.c (bow_map_filenames_from_dir): Don't forget to copy
the CWD and the D_NAME into the FILENAME!
* wv.c (bow_wv_count_for_wi): Return 0 if WV is NULL.
* Makefile.in (LIBBOW_C_FILES): Added email.c.
(DEMO_EXECUTABLES:): Changed rule to make $*.o separately.
* email.c: New file.
* libbow.h: Declared new email functions.
* wi2dvf.c (bow_wi2dvf_write_data_file): Renamed from
bow_wi2dvf_write().
* libbow.h: Rename function declaration.
* wv.c (bow_wv_count_for_wi): New function.
* libbow.h: Declare new function.
* sarray.c (bow_sarray_index_at_keystr): New function.
* libbow.h (bow_sarray_index_at_keystr): Declare new function.
* sarray.c: New file.
* docs.c: Old file, no longer used.
* Makefile.in (LIBBOW_C_FILES): Add sarray.c. Remove docs.c.
Temporarily remove heap.c because it hasn't been checked into the
CVS, and I don't have access to it.
* libbow.h (bow_sarray): New typedef, and new function
declarations.
(bow_cdoc): Renamed from bow_doc. SEEK_START and SEEK_LENGTH elements
removed. Many users will need to define their own "document
entries" with different elements; this is just one example
typically used for classification.
(bow_docs): Typedef removed.
(bow_cdocs): New macro, a bow_array of cdoc's. Also add macro's for
functions.
(bow_wi2dvf_add_di_text_fp): Declare new function.
* int4str.c (bow_int4str_init): New function.
(bow_int4str_new): Use it.
* array.c (bow_array_default_capacity): Renamed from
bow_array_default_size.
(bow_array_init): Use new name.
(bow_array_append): Renamed from bow_array_add_at_index, since the
user really doesn't have a choice of index anyway. No INDEX
argument now.
* wi2dvf.c (bow_wi2dvf_add_di_text_fp): New function.
(bow_wi2dvf_new_from_text_dir): Use it.
Wed Oct 9 15:53:52 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* array.c (bow_array_add_at_index): Include ENTRY_SIZE in
calculation of realloc() size.
Tue Oct 8 14:38:59 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* libbow.h (bow_array): New structure and suite of functions.
(bow_docs): Use it.
* Makefile.in (LIBBOW_C_FILES): Added array.c. Renamed doc.c to
docs.c.
* array.c: New file.
* docs.c: New file.
* Makefile.in (LIBBOW_C_FILES): Added doc.c.
Tue Oct 8 15:27:45 1996 Sean Slattery <jslttery@@anther.learning.cs.cmu.edu>
* libbow.h: Added definitions for heap functions, weight functions
and scoring functions. Also added length field to bow_doc
structure.
* Makefile.in (LIBBOW_C_FILES): Added score.c, weight.c and
heap.c.
* heap.c: New file.
* weight.c: New file.
* score.c: New file.
Mon Oct 7 12:14:50 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* docnames.c (bow_map_filenames_from_dir): New function.
(bow_doc_list_append): Use it to do most of the work.
* libbow.h: Declare new function.
* Makefile.in (snapshot): New target.
* getword.c (bow_get_word): Avoid returning a post-stemmed word of
length 1.
* libbow.h (bow_wv): Renamed member "length" to "num_entries".
Added member "length", meaning Euclidean length of the vector.
(bow_doc): Added member "class". Removed member "wv".
* wv.c: Use new member name "num_entries".
* wi2dvf.c: Likewise.
* Makefile.in (DIST_FILES): Added HACKING.
Sat Oct 5 18:26:47 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* Version (LIBBOW_VERSION): Version 0.2.
* libbow.texi: Cleaned up and added some sections.
* dv.c (bow_dv_add_di_count): Fix bugs in calculation of DV_INDEX.
In an effort to reduce wasted memory, don't reallocate double the
previous SIZE, but 3/2 the previous size; this almost cuts in half
the amount of wasited "document vector" memory; (perhaps
multiplying 4/3 would help even more?).
* wi2dvf.c (bow_wi2dvf_dv): Use new function name
bow_dv_new_from_data_fp().
(bow_wi2dvf_print_stats): Fix typo. Also print average number of
unused document vector entries.
(bow_wi2dvf_new_from_text_dir): Don't use "word vectors". Instead
grab each word individually from a text file, and add it to the
map using bow_wi2dvf_add_wi_di_count().
(bow_wi2dvf_add_wi_di_count): Newly implemented.
* libbow.h (bow_dv_new_from_data_fp): Renamed from
bow_dv_new_from_fp.
* dv.c (bow_dv_add_di_count): Don't use a new "document entry" if
the "document vector" already has an entry for the given DI.
* wi2dvf.c (bow_wi2dvf_print_stats): Print stats about number of
used and unused "document entries" to get a better idea of memory
usage.
* rainbow.c (main): Use getopt() to enable setting of
bow_verbosity_level.
Wed Oct 2 11:20:58 1996 Andrew McCallum <mccallum@@cs.cmu.edu>
* libbow.h (bow_wi2dvf_add_wi_di_count): New function declaration;
not yet implemented.
* rainbow.c (main): Don't set bow_verbosity_level to bow_quiet.
* docnames.c: Change many FL variable names to DL.
(bow_doc_list_append): Don't set *DL to NULL at the beginning, because
it won't work recursively.
* wi2dvf.c (bow_wi2dvf_new_from_text_dir): Add assertion that
verifies length of the document list.
* Version 0.0. CVS rtag with `release-0-0'.
* wi2dvf.c (bow_wi2dvf_new_from_text_dir): Clean up and count text
files and binary files differently.
* rainbow.c (main): Comment out setting to bow_quiet.
* Makefile.in: Include Version.
(version.texi): Fix dependancy.
(dist): Fix it.
* docnames.c (bow_doc_list_append): Don't print extra newline.
(bow_doc_list_length): New function.
* libbow.h (bow_de): Define di and count as short int's, not
int's.
(bow_fwrite_short): New function.
(bow_fread_short): New function.
(bow_doc_list_length): Declare new function.
* dv.c (bow_dv_write): Write di and count as short ints.
(bow_dv_new_from_fp): Read them as short ints.
* io.c (bow_fwrite_short): New function.
(bow_fread_short): New function.
* Version: New file.
* libbow-desc.texi: New file.
* Makefile.in (clean): Fix name of libbow.a; also remove the
$(DEMO_EXECUTABLES).
* rainbow.c (main): Print messages during stages of wi2dvf map
testing. Clean up the other test code.
* wi2dvf.c: (bow_wi2dvf_print_stats): New function.
* dv.c: (bow_dv_default_capacity): Decreased from 512 to 4 in an
attempt to avoid exhausted memory.
(bow_dv_count): New global variable.
(bow_dv_new): Increment it.
(bow_dv_free): Decrement it.
* libbow.h (bow_malloc): New function.
(bow_realloc): New function.
(bow_free): New function.
* wv.c: Use bow_malloc() instead of malloc().
* stoplist.c: Likewise.
* primes.c: Likewise.
* int4str.c: Likewise.
* docnames.c: Likewise.
* dv.c: Likewise.
* wi2dvf.c: Likewise.
* Makefile.in (LIBBOW_C_FILES): Added bmalloc.c.
* Placed under CVS with release-tag `first'.
|