1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361
|
/*
Copyright (C) 1999-2002 Ricardo Ueda Karpischek
This is free software; you can redistribute it and/or modify
it under the terms of the version 2 of the GNU General Public
License as published by the Free Software Foundation.
This software is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this software; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
USA.
*/
/*
book.c: Documentation only
*/
/*
This module does not contain code, but only documentation blocks
that are'nt (currently) attached to specific pieces of code on
the other modules.
*/
/* (tutorial)
NAME
----
clara - a cooperative OCR
SYNOPSIS
--------
clara [options]
DESCRIPTION
-----------
Welcome. Clara OCR is a free OCR, written for systems supporting
the C library and the X Windows System. Clara OCR is intended for the
cooperative OCR of books. There are some screenshots available at
CLARA_HOME.
This documentation is extracted automatically from the comments
of the Clara OCR source code. It is known as "The Clara OCR
Tutorial". There is also an advanced manual known as "The Clara
OCR Advanced User's Manual" (man page clara-adv(1), also
available in HTML format). Developers must read "The Clara OCR
Developer's Guide" (man page clara-dev(1), also available in HTML
format).
CONTENTS
--------
Making OCR
Starting Clara
Some few command-line switches
Training symbols
Saving the session
OCR steps
Classification
Note about how Clara OCR classification works
Building the output
Handling broken symbols
Handling accents
Browsing the book font
Useful hints
Fun codes
AVAILABILITY
CREDITS
*/
/* (book)
NAME
----
clara - a cooperative OCR
SYNOPSIS
--------
clara [options]
DESCRIPTION
-----------
Welcome. Clara OCR is a free OCR, written for systems supporting
the C library and the X Windows System. Clara OCR is intended for
the cooperative OCR of books. There are some screenshots
available at CLARA_HOME.
This documentation is extracted automatically from the comments
of the Clara OCR source code. It is known as "The Clara OCR
Advanced User's Manual". It's currently unfinished. First-time
users are invited to read "The Clara OCR Tutorial". Developers
must read "The Clara OCR Developer's Guide".
CONTENTS
--------
Welcome to Clara OCR
Early historical notes
Design notes
Supported Alphabets
Clara vs the others
The requirements
How to download and compile Clara
Compilation and startup pitfalls
A first OCR project
Scanning and thresholding
Manual and histogram-based (global)
Classification-based (local)
Classification-based (global)
Avoiding or correcting skew
The work directory
Building the book font
Skeleton tuning
Classification tentatives
Alignment tuning
Complex procedures
Using two directories
Adding a page
Multiple books
Adding a book
Removing a page
Dealing with classification errors
Rebuilding session files
Importing revision data
How to use the web interface
Revision acts maintenance
Analysing the statistics
Upgrading Clara OCR
Reference of the Clara GUI
The application window
Tabs and windows
The Application Buttons
The Alphabet Map
Reference of the menus
File menu
Edit menu
View menu
Alphabets menu
Options menu
PAGE options menu
PAGE_FATBITS options menu
OCR steps menu
Reference of command-line switches
AVAILABILITY
CREDITS
*/
/* (devel)
NAME
----
clara - a cooperative OCR
SYNOPSIS
--------
clara [options]
DESCRIPTION
-----------
Welcome. Clara OCR is a free OCR, written for systems supporting
the C library and the X Windows System. Clara OCR is intended for the
cooperative OCR of books. There are some screenshots available at
CLARA_HOME.
This documentation is extracted automatically from the comments
of the Clara OCR source code. It is known as "The Clara OCR
Developer's Guide". It's currently unfinished. First-time users
are invited to read "The Clara OCR Tutorial". There is also an
advanced manual known as "The Clara OCR Advanced User's Manual".
CONTENTS
--------
Introducing the source code
Language and environment
Modularization
The memory allocator
Security notes
Runtime index checking
Background operation
Global variables
Path variables
Bitmaps
Execution model
Return codes
Internal representation of pages
Closures
Symbols
The sdesc structure and the mc array
The preferred symbols
Font size
Symbol alignment
Words and lines
Acts and transliterations
Symbol transliterations
Transliteration preference
Transliteration class computing
The zones
Heuristics
Skeleton pixels
Symbol pairing
The build step
Resetting
Synchronization
The function list_cl
The GUI
Main characteristics
Geometry of the application window
Geometry of windows
Scrollbars
Displaying bitmaps
HTML windows overview
Graphic elements
XML support
Auto-submission of forms
The Clara API
Redraw flags
OCR statuses
The function setview
The function redraw
The function show_hint
The function start_ocr
How to change the source code (examples)
How to add a bitmap comparison method
How to write a bitmap comparison function
How to add an application button
Bugs and TODO list
AVAILABILITY
CREDITS
*/
/* (book)
Early historical notes
----------------------
For some years now we have tested and used OCR softwares, mainly
for old books. Popular OCR softwares (those bundled with
scanners) are useful tools. However, OCR is not a simple
task. The results obtained using those programs vary largely
depending on the the printed document, and, for most texts we're
interested on, the results are really poor or even unusable. In
fact, it's not a surprise that many digitalization projects
prefer not to use OCR, but typists only.
For a programmer, it is somewhat intuitive that OCR could achieve
good results even from low quality texts, when an add-hoc
approach is used, focusing one specific book (for
instance). Within this approach, OCR becomes a matter of finding
one software adequate for the texts you're trying to OCR, or
perhaps develop a new one. So a free and easy to customize OCR
(on the source code level) would be a valuable resource for text
digitalization projects.
Dealing with graphics is not among our main occupations, but
after analysing many scanned materials, we began to write some
simple and specialized recognition tools. More recently (in the
third quarter of 1999) a simple X interface linked to a naive
bitmap comparison heuristic was written. From that prototype,
Clara OCR evolved. Since then, many new ideas from various
persons helped to make it better.
Design notes
------------
It's not a bad idea to enumerate some principles that have driven
Clara OCR development. They'll make easier to understand the
features and limitations of the software (these principles may
change along time).
1. Clara is an OCR for printed texts, not for handwritten
texts.
2. Clara was not designed to be used to OCR one or two single
pages, but to OCR a large number of documents with the same
graphic characteristics (font, size, etc). So it can take
advantage of a fine (and perhaps expensive) training. This will
be tipically the case when OCRing an entire book.
3. We chose not support directly multiple graphic formats, but
only Jeff Poskanzer's raw PBM and PGM. Non-PBM/PGM files will be
read through filters.
4. Clara OCR wants to be a tool that makes viable the sum and
reuse of human revision effort. Because of this, on the OCR model
implemented by Clara, training and revision are one same
thing. The revision is a sum of punctual and independent acts and
alternates with reprocessing steps along a refinement process.
5. The Clara GUI was implemented and behaves like a minimalistic
HTML viewer. This is just an easy and standard way to implement a
forms interface.
6. We have tried to make the source code portable across
platforms that support the C library and the Xlib. Clara has no
special provision to be ported to environments that do not
support the Xlib. We avoided to use a higher level graphic
environment like Motif, GTK or Qt, but we do not discourage
initiatives to add code to Clara OCR adapt or adapt better to
these or other graphic environments.
7. We generally try to make the code efficient in terms of RAM
usage. CPU and disk usage (for session files) are less prioritary.
Clara vs the others
-------------------
Clara differs from other OCR softwares in various aspects:
1. Most known OCRs are non-free and Clara is free. Clara focus
the X Windows System. Clara offers batch processing, a web
interface and supports cooperative revision effort.
2. Most OCR softwares focus omnifont technology disregarding
training. Clara does not implement omnifont techniques and
concentrate on building specialized fonts (some day in the
future, however, maybe we'll try classification techniques that
do not require training).
3. Most OCR softwares make the revision of the recognized text a
process totally separated from the recognition. Clara
pragmatically joins the two processes, and makes training and
revision one same thing. In fact, the OCR model implemented by
Clara is an interactive effort where the usage of the heuristics
alternates with revision and visual fine-tuning of the OCR,
guided by the user experience and feeling.
4. Clara allows to enter the transliteration of each pattern
using an interface that displays a graphic cursor directly over
the image of the scanned page, and builds and maintains a mapping
between graphic symbols and their transliterations on the OCR
output. This is a potentially useful mechanism for documentation
systems, and a valuable tool for typists and reviewers. In fact,
Clara OCR may be seen as a productivity tool for typists, instead
of a typical OCR.
5. Most OCR softwares are integrated to scanning tools offerring
to the user an unified interface to execute all steps from
scanning to recognition. Clara does not offer one such integrated
interface, so you need a separate software (e.g. SANE) to
perform scanning.
6. Most OCR softwares expect the input to be a graphic file
encoded in tiff or other formats. Clara supports only raw
PBM/PGM.
*/
/* (book)
Scanning and thresholding
-------------------------
Clara OCR cannot scan paper documents by itself. Scanning must be
performed by another program. The Clara OCR development effort is
using SANE (http://www.mostang.com/sane) to produce 600 or 300
dpi images. The Clara OCR heuristics are tuned to 600 dpi.
Scanners offer three scanning modes: black-and-white (also known
as "bitmap" or "lineart", however the meaning of these words may
vary depending on the context), "grayscale" and "color". Clara
OCR requires black-and-white or grayscale input. Both
black-and-white and grayscale images may be saved in a variety of
formats by scanning programs. However, only PBM (for
black-and-white) and PGM (for grayscale) formats are
recognized. Generally grayscale 600 or 300 dpi will be the best
choice, but black-and-white 600 dpi may be good for new, high
quality printed materials. If your scanning program do not
support the PBM or PGM formats, try to save the images in TIFF
format and convert to PBM or PGM using the command tifftopnm. If
for some reason the TIFF format cannot be used, choose any other
format that preserves all data (don't use "compressing" formats
like JPEG), and for which a conversion tool is available, to
convert it to PBM or PGM.
Remark: Programs that scan or handle (e.g. rotate) images may
sometimes perform unexpected tasks, as applying dithering or
reducing algorithms by themselves. An image transformed to become
nice or small may be useless for OCR purposes.
Remark: The PBM and PGM formats do not carry the original resolution
(dots-per-inch) at which the image was scanned. As some
heuristics require that information, Clara OCR expects to be
informed about it through the command-line switch -y (so take
note of the resolution used).
Grayscale means that each pixel assumes one gray "level",
typically from 0 (black) to 255 (white). This is a good choice
for scanning old or low-quality printed materials, because it's
possible to use specialized programs to analyse the image and
choose a "threshold", in such a way that all pixels above that
threshold will be considered "white", and all others will be
considered black (when scanning in black-and-white mode, the
threshold is chosen by the scanning program or by the user). The
threshold may be global (fixed for the entire page) or local
(vary along the page).
In most cases grayscale will achieve better results. However, as
grayscale images are much larger than black-and-white images, 300
dpi (instead of 600 dpi) may be mandatory when using grayscale
due to disk consumption requirements.
Remark: Try to limit yourself to the optical resolution oferred by
the scanner. Most old scanners are 300 dpi, but the scanning
software obtains higher resolutions through interpolation. Newer
scanners may be optical 600 dpi or 1200 dpi or more.
Remark: the page 143 of Manuel Bernardes Branco Dictionary that
we're using along these tests was scanned using the SANE
scanimage command:
scanimage -d microtek2:/dev/sga --mode gray -x 150 -y 210
--resolution 300 > 143.pgm
Thresholding is not the only method for converting grayscale
images to black-and-white (such conversion is also called
"binarization"), but it's the current method used by Clara OCR.
In practice, a too low threshold will brake many symbols on their
thin parts, and a too high threshold will link symbols together
(in the figure, an "a-i" link and a broken "u").
XX
XX
XXXXX XXX XXX XXX
X XX XX XX XX
XX XX XX XX
XXXXXXX XX XX XX
X XX XX XX XX
X XX XX XX XX
XXXXX XXXXXXX XX XXXX
It's a hard task to detect broken and linked symbols. The Clara OCR
heuristics that handle these cases are incipient, so thresholding must
must be carefully performed, in order to not compromise the OCR
results. If the printing intensity, the noise level or the paper
quality vary from page to page, thresholding must be performed on a
per-page basis.
Remark: Now you can try avoid links in segmentation step.
Just set "Try avoid links" parameter in Tune tab. (Normal values <=1)
The four thresholding methods currently avaliable are: manual
(global), histogram-based (global), classification-based (local),
classification-based (global).
Manual and histogram-based (global)
-----------------------------------
Histogram-based thresholding is the default method. It computes
automatically a thresholding value based on the distribution of
grayshades. To use it, just enter the TUNE tab and select (it's
selected by default) the "use histogram-based global
thresholder". To make a try, load a PGM image and press OCR or
ask the Segmentation OCR step.
Remark: You can correct the automatic-detected threshold with
"Threshold factor" in Tune tab.
A global thresholding value can be manually specified. This
corresponds to the "use manual global thresholder" entry. The
choice of the thresholding value is performed through a visul
interface called "instant thresholding". To use it, load one PGM
image and select the "Instant thresholding" entry (Edit
menu). Then use '<', '>', '+' and '-' to change the thresholding
value. When ok, press ESC. Note that the selected value will be
applied only when the segmentation step runs.
Classification-based (local)
----------------------------
Global thresholding does not address those cases where the
printing intensity (or paper properties) vary along one same
page. Local thresholding methods are required on such
cases. Clara OCR implements a classification-based local
(per-symbol) thresholder. Saying that it's classification-based
means that the OCR engine is used to choose the threshold. In
other words, the threshold chosen is that for which the
classifier successfully recognized the symbol (in fact, this is a
brute-force approach).
The local binarizer can be manually applied at any symbol. To do
so, load one PGM page and click any symbol directly on the PAGE
tab. Two thresholding values will be chosen. The pixels found to
be "black" for each one are painted "black" (smaller value) and
"gray" (larger value). At this moment, it's possible to add the
thresholded symbol as a pattern (just press the key corresponding
to its transliteration). Remember that this thresholder relies on
the classifier, so if the OCR is not trained, you'll get no
benefit.
Two versions of the local binarizer were developed, a "weak" one
and a "strong" one. The "weak" one just tries to change the
threshold on those symbols not successfully classified using the
default threshold. The "strong" one (unfinished) also tries to
criticize locally the segmentation results. By default, the weak
version is used. To try the strong one, check the corresponding
checkbox at the TUNE tab.
Remark: As an alternative, use the "Balance" feature + global thresholding.
Classification-based (global)
-----------------------------
Clara OCR includes a simple threshold selection script to compute
global best thresholds based on classification results. Let's try
it on our 2-page book. Just create a directory, cd to it and run
the selthresh script informing the resolution and the names of
the images:
$ cd /home/clara/books/BC
$ mkdir pbm
$ cd pbm
$ selthresh -y 300 -l 0.45 0.55 ../pgm/*pgm
selthresh: scaling 2 times
Best thresholds:
143-l.pgm 0.49
143-r.pgm 0.51
In this case, selthresh will require around 4 minutes to
complete on a 500MHz CPU. For larger collections of pages,
selthresh may take much longer to complete (hours or days). If
needed, the execution can be safely interrupted using Control-C
(it's ok to shutdown the machine while selthresh is
running). The execution can be safely restarted from the point
where it was interrupted typing again the same command:
$ cd /home/clara/books/MBB/pbm
$ selthresh -y 300 -l 0.40 0.55 ../pgm/*pgm
The option -l is used to inform an interval of thresholds to
try. By now, selthresh is unable to choose by itself a "good"
interval. The user must manually check the results for some
thresholds in order to make a choice. For instance, to examine
the results for threshold 0.4 on page 143-l.pgm, try:
$ pgmtopbm -threshold -value 0.4 ../pgm/143-l.pgm >143-l.pbm
$ display 143-l.pbm
Change the threshold, repeat and, once found a threshold value
that produces a "nice" visual result, specify to -l the interval
centered at that threshold, and total width 0.1 or 0.2. The same
interval may be used for all pages because selthresh will warn
about a bad interval choice. Example:
$ selthresh -y 300 -l 0.30 0.35 ../pgm/143-l.pgm
selthresh: scaling 2 times
Best thresholds:
143-l.pgm 0.32 (bad interval, try -l 0.30 0.4)
If a "bad interval" warning appears on the final output for some
pages, it's ok to restart selthresh informing a new, wider
interval, as suggested by selthresh. Only the suspicious pages
will be re-examined. In fact, selecting a narrow initial interval
(and making it larger as required) may be a good strategy to
reduce the total running time.
Once the best thresholds are known, use pgmtopbm to produce the
black-and-white images. It's also a good idea to approach the
resolution to 600 dpi using pnmenlarge. Yet pnmenlarge does not
add information to the image, the classification heuristics will
behave better. In our case, the command should be
$ cd /home/clara/books/BC/pbm
$ pnmenlarge 2 ../pgm/143-l.pgm | \
pgmtopbm -threshold -value 0.49 >143-l.pbm
$ pnmenlarge 2 ../pgm/143-r.pgm | \
pgmtopbm -threshold -value 0.51 >143-r.pbm
Remark: it's not a bad idea to visualize the PBM files, or at least
some of them. Yet selthresh produced good results for us, your
mileage may vary.
In order to capture the output of selthresh (to extract the
per-page best thresholds), it's ok to re-generate it as many
times as needed (just repeat the same selthresh command,
because once all computations become performed, the script will
just read the results from selthresh.out and output the results).
A final warning: selthresh may be fooled by too dark images. So
if the right limit is much larger than it should be, selthresh
may produce bad results. So be careful concerning the right limit
of the interval. As a practical advice, keep in mind that the
best threshold for most images is less then 0.6. In the near
future we'll use statistical measurements to choose the interval
to analyse, in order to prevent such problems and to make
unnecessary a manual choice.
remark: the tarball also includes an alternative selthresh, named
slethresh_fidian.pl. It contains instructions on how to use it.
Avoiding or correcting skew
---------------------------
Sometimes the printing is skewed relatively to the paper
margins. Skew is a problem to the OCR heuristics. As the Clara
OCR engine just detects components by pixel contiguity and builds
classes of symbols, in practice the effect of skew will be a
larger number of patterns, and therefore a larger revision cost.
In some cases, a careful manual scanning can solve the
problem. When acceptable, a set-square solves the problem: just
align one text line at one set-square rule and the edge of the
scanner glass at the other rule (we're supposing that the
bookbinding was disassembled).
The bundled preprocessor now includes a method to compute and
correct skew, but it's not on by default. To activate it, enter
the TUNE tab and select the "Use deskewer" checkbox. Now
deskewing will be applied when the OCR button is pressed (or when
the "Preprocessing" OCR step is requested). Note that
preprocessing is called only once per page, so if the page was
already preprocessed, it won't be deskewed.
Skeleton tuning
---------------
Currently, symbol classification can be performed by three
different classifiers: skeleton fitting, border mapping or pixel
distance. The choice is done on the TUNE tab. Border mapping is
currently experimental. Pixel distance has been used as an
auxiliar classifier. Skeleton fitting is a more mature code and
is highly customizable. It's the default classification method by
now.
When using skeleton fitting, two symbols are considered similar
when each one contains the skeleton of the other. So the
classification result depends strongly on how skeletons are
computed. As an example, the figure presents one symbol
("e"). The symbol black pixels are the dots ('.'). The skeleton
black pixels are stars ('*').
.......
..******..
.*. ..*..
..*. ...*.
.*.. ...*..
..*.........*..
..***********..
..*. ....
..*.
..*..
..*... ...
..*..........
..********..
.........
Clara OCR offers seven different methods for computing
skeletons. Each method has tunable parameters. The choice of the
method and the parameters can be done through a visual inteface
on the TUNE (SKEL) tab. To try it, first save the session (menu
"File"), then enter that tab. At least one pattern must
exist. Vary the parameters and observe the results. Press the
left and right arrows to navigate through the patterns, and use
the "zoom" button to choose a comfortable image size. The last
selection will be used for all skeleton computations. To discard
it, exit Clara OCR without saving the session.
Instead of trying the TUNE (SKEL) tab, it's possible to specify
skeleton computation parameters through the -k command-line
switch. Note however that if a selection was performed through
the TUNE (SKEL) tab, that selection will override the parameters
informed to -k, so be careful.
Clara OCR has an auto-tune feature to choose the "best" skeleton
computation parameters. To use it, check the "Auto-tune skeleton
parameters" entry on the TUNE tab. This feature is currently left
off by default because manual tuning can achieve better
results. Examples:
1. Quality printing without thin details
use -k 2,1.4,1.57,10,3.8,10,4,4
or -k 0,1.4,1.57,10,3.8,10,4,4
2. Quality printing with thin details
use -k 2,1.4,1.57,10,3.8,10,1,1
or -k 4,,,,,,3,
3. Poor printing without thin details
use -k 2,1.4,1.57,10,3.8,10,1,1
4. Poor printing with thin details
use -k 2,1.4,1.57,10,3.8,10,1,1
Yet the pattern computation parameters may change along the way,
it's wise to choose adequate skeleton computation parameters
before OCRing, and keep them fixed along the project. Every time
Clara OCR is started, inform the same parameters chosen. In our
case, we can use the default parameters. To do so, just enter
Clara OCR as before:
$ cd /home/clara/books/BC/pbm
$ clara &
Classification tentatives
-------------------------
To classify the book symbols (i.e. to discover the
transliteration of unknown symbols using the patterns), enter
Clara OCR, select "Work on all pages" ("Options" menu) and press
the OCR button using the mouse button 1, or press the mouse
button 3 and select "Classification". The classification may be
performed many times. Each time, different parameters may be
tried to refine the results already achieved.
When the classification finishes, observe the pages 5.pbm and
6.pbm. Much probably, some symbols will be greyed. In other
words, the classifier was unable to classify all symbols. The
statistics presented on the PAGE (LIST) tab may be useful now. To
reduce the number of unknown symbols there are three choices: add
more patterns, change the skeleton computation parameters, or try
another classifier.
To add more patterns, just train some greyed symbols and
reclassify all pages again. The reclassification will be faster
than the first classification because most symbols, already
classified, won't be touched.
To change the skeleton computation parameters, exit Clara OCR,
restart it informing the new parameters through -k, select
"Re-scan all patterns" ("Edit" menu), select "Work on all pages"
("Options" menu) and reclassify. May be easier to choose and set
the new parameters using the TUNE (SKEL) tab, as explained
earlier. However, remember that the parameters chosen through the
TUNE (SKEL) tab override the parameters informed through -k.
To try another classifier, first select the "Re-scan all
patterns" entry on the "Edit" menu. Then enter the TUNE tab and
select the classifier to use from the available choices
(skeleton-base, border mapping and pixel distance). The pixel
distance may be a good choice. Then reclassify all pages.
The "Re-scan all patterns" is required because for each symbol
Clara OCR remembers the patterns already tried to classify it,
and do not try those patterns again. However, when the skeleton
computation parameters change, or when the classifier changes,
those same patterns must be tried again. Maybe in the future
Clara OCR will decide by itself about re-scanning all patterns.
Symbol properties
-----------------
The bottom five buttons (alphabet, pattern type, "bold", "italic"
and "bad") carry the properties of the current symbol. If the
"PAGE" window is on the plate, the current symbol is the one
identified by the graphic cursor. If the window "PATTERN" is on
the plate, the current symbol is the pattern being exhibited. In
all other cases, the current symbol is undefined.
Let's comment in detail the symbol properties carried by those
five buttons:
a. The possible values for the alphabet are: latin, greek,
cyrillic, hebrew, arabic, kana, number, ideogram or "other". In
order to limit the available alphabets, the button circulates
only the values selected on the "Alphabet" menu.
b. The "pattern types" are the fonts and font sizes used by the
book. Example: 12pt roman and 12pt arial for the text, and 8pt
roman for the footnotes. In this case we have three "types"
identified as "1", "2" and "3".
c. Each one of the bold, italic and "bad" flags may be on or
off. The "bad" flag identifies a symbol not to be used as
pattern.
The user can inform Clara OCR about any of these properties for
the current symbol, just selecting the desired value on the
corresponding button (click it one or more times). The pattern
type, however, is read-only by default. To allow changing its
value, use the "pattern types are read-only" entry on the
"Options" menu.
In most cases, Clara OCR will compute automatically the
properties of each symbol, so it's not required to set them
manually. But just like the transliterations, Clara OCR will need
some initial information, so the user must identify some symbols
as being bold or italicized.
Merge tuning
------------
merge internal fragments
merge pieces on one same box
merge close fragments
recognition merging
learned merging
Complex procedures
------------------
To OCR an entire book is a long process. Perhaps along it a
problem is detected. Bad choice of skeleton computation
parameters, or a bad page contaminating the bookfont, some files
loss due to a crash, etc. How to solve them?
Clara OCR does not offer currently a complete set of tools to
solve all these problems. In some cases, a simple solution is
available. In others, a solution is expected to become available
in future versions. This session will depict some practical
cases, and explain what can be done and what cannot be done for
each one.
Fixing transliterations
-----------------------
Fixing pattern transliterations
Fixing symbol transliterations
Removing patterns and synchronizing pages
-----------------------------------------
Removing references to that pattern
on the loaded page
on other pages
on the patter types
Removing a page
---------------
From the stats presented by the PAGE (LIST) tab it's possible to
detect problems on specific pages. A low factorization may be a
simptom of a bad choice of brightness for that page. In such a
case, it's probably a good idea to remove completely that page.
To remove a page is a delicate operation. Clara OCR currently
does not offer a "remove page" feature. Basically, it should
remove all patterns from that page, remove the revision data
acquired from that page, and remove the page image and its
session file.
Dealing with classification errors
----------------------------------
What to do when the OCR classifies incorrectly a large quantity of
symbols? (to be written)
Importing revision data
-----------------------
When OCRing a large book, a good approach is to divide its pages
into a number of smaller sections and OCR each one. So for a book
with, say, 1000 pages, we could OCR pages 1-200, then 201-400,
etc.
After finishing the first section, of course we desire reuse on
the second section the training and revision effort already
spent. This is not the same as adding the pages 201-400 to the
first section, because we do not want handle the pages 1-200
anymore.
Basically we need to import the patterns of the first section
when starting to process the second. Well, Clara OCR is currently
unable to make this operation.
How to use the web interface
----------------------------
The Clara OCR web interface allows remote training of symbols. To use
it, a web server able to run perl CGIs (e.g. Apache) is
required. Let's present the steps to activate the web interface for a
simple case, with only one book (named "book1"). Basically, one needs
to create a subtree anywhere on the server disk (say,
"/home/clara/www/"), owned by the user that will manage the project
(say, "clara"), with subdirectories, "bin", "book1" and
"book1/doubts":
$ id
uid=511(clara) gid=511(clara) groups=511(clara)
$ cd /home/clara/
$ mkdir www
$ cd www
$ mkdir bin book1
$ mkdir book1/doubts
Then copy to the directory "bin" the files clara.pl and sclara.c from
the Clara OCR distribution (say, /usr/local/src/clara), edit clara.c
to change the hardcoded definition of the root directory to
"/home/clara/www", compile it and make it setuid:
$ cd bin
$ cp /usr/local/src/clara/clara.pl .
$ cp /usr/local/src/clara/sclara.c .
$ emacs sclara.c
$ grep '^char *root' sclara.c
char *root = "/home/clara/www";
$ cc -o sclara -static sclara.c
$ rm sclara.c
$ chmod a+s sclara
Edit the script clara.pl. Example for the clara.pl configuration
section (the script clara.pl contains default definitions for some of
these variables, please comment out those definitions):
$CROOT = "/home/clara/www";
$U = "/cgi-bin/clara";
$book[0] = 'Author, <I>Test 1</I>, City, year';
$subdir[0] = "book1";
$LANG = 'en';
$opt = '-W -R 10 -b -k 2,1.4,1.57,10,3.8,10,4,1';
Now copy the PBM files to the directory "book1", create low-quality
jpeg previews, gzip the PBM files, and select some patterns:
$ cd /home/clara/www/book1
$ cp /usr/local/src/clara/imre.pbm .
$ pbmreduce 8 imre.pbm | convert -quality 25 - imre.jpg
$ gzip -9 imre.pbm
$ clara -k 2,1.4,1.57,10,3.8,10,4,1
(load one PBM file, train some symbols, save the session and quit the
program).
Now we need to process the PBM files in order to create some
"doubts". The script clara.pl also requires a symlink to the clara
binary (change the path /usr/local/bin/clara as required):
$ cd /home/clara/www/bin
$ ln -s /usr/local/bin/clara clara
$ ./clara.pl -s book1
$ rm ../book1/*html
$ ./clara.pl -p
Now your server must be instructed to exec /home/clara/www/bin/clara.pl
when a visitor requests "/cgi-bin/clara" (if you prefer another URL,
change the clara.pl customization too). An easy way to accomplish that
is creating a symlink on the default directory for CGIs. The default
directory of CGIs is platform-dependent (e.g. /home/httpd/cgi-bin,
/usr/local/httpd/cgi-bin, /var/lib/apache/cgi-bin, etc). Example:
# cd /home/httpd/cgi-bin
# ln -sf /home/clara/www/bin/clara.pl clara
Try to access the URL "/cgi-bin/clara" on your web server. The correct
behaviour is successfully loading a page entitled "Prototype of the
Cooperative Revision". If you have problems, be aware about some
common problems:
1. Apache expects to be explicitly allowed to follow symlinks. The
file access.conf should contain, in our case, a section similar to the
following:
<Directory /home/httpd/cgi-bin>
AllowOverride None
Options ExecCGI FollowSymLinks
</Directory>
2. The directory /home/clara must be world readable:
# ls -ld /home/clara
drwxr-xr-x 4 clara clara 1024 Sep 17 09:56 /home/clara
If you succeeded, congratulations! Note that from time to time it'll
be necessary to reprocess the pages, adding to the session files the
data collected from the web, just like done before:
$ cd /home/clara/www/bin
$ ./clara.pl -p
$ ./clara.pl -s book1
Revision acts maintenance
-------------------------
Types of revision acts (to be written).
Discarding deduced data (to be written).
*/
/* (devel)
Bugs and TODO list
------------------
(Some) Major tasks
1. Vertical segmentation (partially done).
2. Heuristics to merge fragments.
3. Spelling-generated transliterations
4. Geometric detection of lines and words
5. Finish the documentation
6. Simplify the revision acts subsystem
Minor tasks
1. Change sprintf to snprintf.
2. Fix assymetric behaviour of the function "joined".
3. Optimize bitmap copies to copy words, not bits, where possible
(partially done).
4. Support Multiple OCR zones (partially done).
5. Make sure that the access to the data structures is blocked
during OCR (all functions that change the data structures must
check the value of the flag "ocring").
6. Use 64-bit integers for bitmap comparisons and support
big-endian CPUs (partially done).
7. Clear memory buffers before freeing.
8. Allow the transliterations to refer multiple acts (partially
done).
9. Rewrite composition of patterns for classification of linked
symbols.
10. The flea stops but do not disappear when the window lost and
regain focus.
11. Substitute various magic numbers by per-density and
per-minimum-fontsize values.
12. Synchronization destroys the result of partial matching
because partial matching assigns to the symbol only one
pattern as its best match.
*/
/* (book)
Welcome to Clara OCR
--------------------
Clara is an optical character recognition (OCR) software, a
program that tries to identify the graphic images of the
characters from a scanned document, converting their digital
images to ASC, ISO or other codes.
The name Clara stands for "Cooperative Lightweight chAracter
Recognizer".
Clara offers two revision interfaces: a standalone GUI and and a
web interface, able to be used by various different reviewers
simultaneously. Because of this feature Clara is a "cooperative"
OCR (it's also "cooperative" in the sense of its free/open status
and development model).
*/
/* (book)
The requirements
----------------
Clara OCR will run on a PC (386, 486 or Pentium) with GNU/Linux
and Xwindows. Clara OCR will hopefully compile and run on a PC
with any unix-like operating system and Xwindows. Currently Clara
OCR won't run on big-endian CPUs (e.g. Sparc) nor on systems
lacking X windows support (e.g. MS-Windows). Higher-level
libraries like Motif, GTK or Qt are not required.
A relatively fast CPU is recommended (300MHz or more). Memory
usage depends on the documents, and may range from some few
megabytes to various tenths os megabytes The normal operation
will create session files on your hard disk, so some megabytes of
free disk space are required (a large project may require plents
of gigabytes). Clara OCR can read and write gzipped files (see
the -z command-line switch).
If you need to build the executable and/or the documentation,
then an ANSI C compiler (with some GNU extensions) and a (version
5) perl interpreter are required.
How to download and compile Clara
---------------------------------
For those who need to download and compile the source code
(hopefully this will be unnecessary for most users as soon as
Clara binary distributions become available), it may be
downloaded from CLARA_HOME. It's a
compressed tar archive with a name like clara-x.y.tar.gz (x.y is
the version number).
The compilation will generally require no more than issue the
following commands on the shell prompt:
$ gunzip clara-x.y.tar.gz
$ tar xvf clara-x.y.tar
$ cd clara-x.y
$ make
$ make doc
Now you can copy the executable (the file "clara") to some
directory of binaries (like /usr/local/bin), and the man page
(file "clara.1") to some directory of man pages (like
/usr/local/man/man1). By now there is no "make install" to
perform these copies automatically.
If some of these steps fail, please try to obtain assistance from
your local experts. They will solve most simple problems
concerning wrong paths or compiler options. You can also read the
subsection "Compilation and startup pitfalls".
Compilation and startup pitfalls
--------------------------------
This subsection is intended to help people that are experiencing
fatal errors when building the executable or when starting
it. After each error message we'll point out some hints.
Bear in mind that most hints given below are very elementary
concerning Unix-like systems. If you have problems, try to read
all hints because details explained once are not repeated. If you
cannot understand them, please try to ask your local experts, or
try to read an introductory book on Unix things. Please don't
email questions like these to the Clara developers, except when
the hint suggests it.
1. Path-related pitfalls
$ make
bash: make: command not found
The shell could not find the "make" utility. Maybe there is no
such utility installed on your system, or maybe the path to it is
unknown to the shell. You can try to find the "make" utility with
a command like
$ find /usr -name make -print
The following command will display the current path:
$ echo $PATH
Remember that on Unix-like systems the environment is
per-process. So if you change the PATH variable on the shell
prompt within an xterm, this won't affect the other running
shells (on the other xterms). Remember that the Unix shells
expect to be explicitly informed about which variables must be
exported to subprocesses (use "export" in Bourne-like shells and
"setenv" on C-like shells).
$ make
gcc -I/usr/X11R6/include -g -c gui.c -o gui.o
make: gcc: Command not found
make: *** [gui.o] Error 127
The make utility could not find the gcc compiler. Check if gcc is
installed. If not, check if some other C compiler is installed
(for instance, "cc"), and edit the makefile to chage the value of
the CC variable.
If you don't know what I'm speaking about, take a look on the
directory where the Clara source codes are, and you'll see there
a file named "makefile". This file contains the names of the
tools to be used and rules to build the Clara executable. It
contains also important paths, like those where the system
headers (files .h) and libraries can be found. If the names or
the paths don't reflect those on your system, you need to edit
the makefile accordingly.
$ make
gcc -I/usr/X11R6/include -g -c gui.c -o gui.o
In file included from gui.c:16:
gui.h:12: X11/Xlib.h: No such file or directory
make: *** [gui.o] Error 1
The compiler could not find the header Xlib.h. Maybe your system
does not include such header, or maybe it is on another directory
not explicited on the makefile through the INCLUDE variable.
$ make
gcc -o clara clara.o skel.o gui.o mc.o ...
/usr/bin/ld: cannot open -lX11: No such file or directory
make: *** [clara] Error 1
The linker could not find the X11 library. Maybe your system does
not include such library, or maybe it is on another directory not
explicited on the makefile through the LIBPATH variable.
2. Compilation pitfalls
$ make
gcc -I/usr/X11R6/include -g -c clara.c -o clara.o
clara.c:70: parse error before `int'
make: *** [clara.o] Error 1
A syntax error on the line 70 of the file clara.c. Double check
if the sources were not changed. Try to obtain the sources
again. If you're a programmer, try to fix the problem. In any
case, report it to claraocr@claraocr.org.
$ make
clara.c: In function `process_cl':
clara.c:2293: `ZPS' undeclared (first use in this function)
clara.c:2293: (Each undeclared identifier is reported only once
clara.c:2293: for each function it appears in.)
make: *** [clara.o] Error 1
A reference to an undeclared variable. Double check if the
sources were not changed. Try to obtain the sources again. If
you're a programmer, try to fix the problem. In any case, report
it to claraocr@claraocr.org.
3. Runtime pitfalls
$ clara &
[1] 1924
bash: clara: command not found
The Clara executable does not exist or is not on the path. Most
Unix systems don't include the current directory ("./") on the
path, so if you're trying to start Clara from the directory where
it was compiled, specify the current directory ("./clara").
$ ./clara &
[1] 1922
_X11TransSocketUNIXConnect: Can't connect: errno = 111
cannot connect to X server
Clara could not connect the X server. The X Windows System is a
client-server system. The applications (xterm, xclock, etc)
connect to a display server (the X server). If the server is not
running, clients cannot connect to it. In some cases, it's
required to inform explicitly the client about the server it must
connect, using the environment variable DISPLAY.
$ ./clara
Segmentation fault (core dumped)
If you can reproduce the problem, report it
to claraocr@claraocr.org. If you're a programmer and Clara was
compiled with the -g option, try a debugger to locate the point
of the source code where the segmentation fault happened. Using
gdb, it's quite easy:
$ gdb clara
(gdb) run
Now try to reproduce the steps that led to the segmentation
fault.
*/
/* (tutorial)
Making OCR
----------
This section is a tutorial on the basic OCR features offerred by
Clara OCR. Clara OCR is not simple to use. A basic knowledge
about how it works is required for using it. Most complex
features are not covered by this tutorial. If you need to compile
Clara from the source code, read the INSTALL file and check (if
necessary) the compilation hints on the Clara OCR Advanced User's
Manual.
Starting Clara
--------------
So let's try it. Of course we need a scanned page to do so. Clara
OCR requires graphic format PBM or PGM (TIFF and others
must be converted, the netpbm package contains various conversion
tools). The Clara distribution package contains one small PBM
file that you can use for a first test. The name of this file is
imre.pbm. If you cannot locate it, download it or other files
from CLARA_HOME. Alternatively, you can produce your own 600-dpi
PBM or PGM files scanning any printed document (hints for
scanning pages and converting them to PBM are given on the
section "Scanning books" of the Clara OCR Advanced User's
Manual).
Once you have a PBM or PGM file to try, cd to the directory where
the file resides and fire up Clara. Example:
$ cd /tmp/clara
$ clara &
In order to make OCR tests, Clara will need to write files on
that directory, so write permission is required, just like some
free space.
Remark: As to version CLARA_VERSION, Clara OCR heuristics are tuned
to handle 600 dpi bitmaps. When using a different resolution,
inform it using the -y switch:
$ clara -y 300 &
Then a window with menus and buttons will appear on your X
display:
+-----------------------------------------------+
| File Edit OCR ... |
+-----------------------------------------------+
| +--------+ +----+ +--------+ +-------+ |
| | zoom | |page| |patterns| | tune | |
| +--------+ +-+ +-+ +-+ +-+ |
| +--------+ | +-------------------------+ | |
| | zone | | | | | |
| +--------+ | | | | |
| +--------+ | | | | |
| | OCR | | | WELCOME TO | | |
| +--------+ | | | | |
| +--------+ | | C L A R A O C R | | |
| | stop | | | | | |
| +--------+ | | | | |
| . | | | | |
| . | | | | |
| | | | | |
| | | | | |
| | +-------------------------+ | |
| +-----------------------------+ |
| |
| (status line) |
+-----------------------------------------------+
Welcome aboard! The rectangle with the welcome message is called
"the plate". As you already guessed, the small rectangles with
the labels "zoom", "OCR", "stop", etc, are "the buttons". The
"tabs" are those flaps labelled "page", "patterns"
and "tune". On the menu bar you'll find the File menu, the Edit
menu, and so on. Popup the "Options" menu, and change the current
font size for better visualization, if required.
Press "L" to read the GPL, or select the "page" tab, and
subsequently, select on the plate the imre.pbm page (or any other
PBM or PGM file, if any). The OCR will load that file showing the
progress of this operation on the status line on the bottom of
the window.
note: the "page" tab is the flap labelled "page". This is
unrelated to the "tab" key.
When the load operation completes, Clara will display the
page. Press the OCR button and wait a bit. The letters will
become grayed and the plate will split into three windows. Move
the pointer along the plate and you'll see the tab label follow
the current window: "page", "page (output)" or "page
(symbol)". Move the pointer along the entire application window,
and, for most components, you'll see a short context help message
on the status line when the pointer reaches it (the buttons, for
instance). Dialogs (user confirmations) also use the status line
(like Emacs), instead of dialog boxes.
You can resize both the Clara application window or each of the
three windows currently on the plate ("page", "page (output)" and
"page (symbol)"). To resize the windows, select any point between
two of them and drag the mouse. The scrollbars can become hidden
(use the "hide scrollbars" on the View menu).
When the tab label is "page", press the "zoom" button using the
mouse button 1 and the scanned image will zoom out. If you use
the mouse button 3, the image will zomm in (the behaviour of the
"zoom" button depends on the current window).
Now try selecting the "page" tab many times, and you will
circulate the various display modes shared by this tab. These
modes are and will be referred as "PAGE", "PAGE (fatbits)" and
"PAGE (list)". Each display mode may have one or more windows
We've chosen this uncommon approach because an excess of tabs
transforms them in a useless decoration. The other tabs also
offer various modes, some will be presented later by this
tutorial.
Some few command-line switches
------------------------------
Besides the -y option used in the last subsection, Clara accepts
many others, documented on the Clara OCR Advanced User's
Manual. By now, from the various different ways to start Clara,
we'll limit ourselves to some few examples:
clara
clara -h
In the first case, Clara is just started. On the second, it will
display a short help and exit.
clara -f path
clara -f path -w workdir
The option -f informs the relative or absolute path of a scanned
page or a directory with scanned pages (PBM or PGM files). The
option -w informs the relative or absolute path of a work
directory (where Clara will create the output and data files).
clara -i -f path -w workdir
clara -b -f path -w workdir
The option -i activates dead keys emulation for composition of
accents and characters. The -b switch is for batch
processing. Clara will automatically perform one OCR run on the
file informed through -f (or on all files found, if it is the
path of a directory) and exit without displaying its window.
clara -Z 1 -F 7x13
Clara will start with the smallest possible window size.
A full reference of command-line switches is given on the section
"Reference of command-line switches" of the Clara OCR Advanced
User's Manual.
Training symbols
----------------
Yes, Clara OCR must be trained. Training is a tedious procedure,
but it's a must for those who need a customizable OCR, apt to
adapt to a perhaps uncommon printing style.
Before training, a process called segmentation must be
performed. Press the right button of the mouse over the OCR
button, select "Segmentation" on the menu that will pop out and
wait the operation complete.
Now, on the "page" tab, observe the image of the document
presented on the top window. You'll see the symbols greyed,
because the OCR currently does not know their
transliterations. Try to select one symbol using the mouse (click
the mouse button 1 over it). A black elliptic cursor will appear
around that symbol. This cursor is called the "graphic
cursor". You can move the graphic cursor around the document
using the arrow keys.
Now observe the bottom window on the "page" tab. That window
presents some detailed information on the current symbol (that
one identified by the graphic cursor). When the "show web clip"
option on the "View" menu is selected, a clip of the document
around the current symbol, is displayed too. In some cases, this
clip is useful for better visualization. The name "web clip" is
because this same image is exported to the Clara OCR web
interface when cooperative training and revision through the
Internet is being performed.
To inform the OCR about the transliteration of one symbol, just
type the corresponding key. For instance, if the current symbol
is a letter "a", just type the "a" key. Observe that the trained
symbol becomes black. Each symbol trained will be learned by the
OCR, its bitmap will be called a "pattern", and it will be used
as such when trying to deduce the transliteration of unknown
symbols.
Remark: in our test, the user chose the symbol to be trained. However,
Clara OCR can choose by itself the symbols to be trained. This feature
is called "build the bookfont automatically" (found on the "tune"
tab). To use it, select the corresponding checkbos and classify the
symbols as explained later.
Finally, when the transliteration cannot be informed through one
single keystroke or composition (for instance when you wish to
inform a TeX macro as being the transliteration of the current
symbol), write down the transliteration using the text input
field on the bottom window (select it using the mouse before).
Symbol properties
-----------------
Obs: most features described in this paragraph are still
experimental.
The bottommost three buttons (in this order: alphabet, pattern
type, and "bad") show properties of the current symbol.
If a symbol is defective, it's generally useful not use it as a
pattern. In such a case, when informing the symbol
transliteration, press the ESC key once before training that
symbol (or press the BAD button). The OCR will mark that symbol
as "bad".
The behaviour of the "alphabet" button is as follows: by default
it is in the state "other". If the current symbol is trained as a
latin letter ('a', 'b', 'c', etc), this property is automatically
set to "latin". If the current symbol is trained as a decimal
digit ('0', '1', etc), this property is automatically set to
"number". If the button state is manually set to "greek" and a
letter is input from a latin keyboard, it will be automatically
mapped to the corresponding greek letter ("a" to "alpha", "b" to
"beta", etc). Note that the alphabet button circulates only those
alphabets selected on the "Alphabets" menu. By now, Clara OCR
does not include mappings for other alphabets.
The "pattern types" button presents the classification of the
symbol regarding the font types (Clarendom, Times, etc) and sizes
(9pt, 10pt, etc) found on the book. It's not mandatory to
classify the patterns, and there is some preliminar code to
perform this classification automatically. However, it's
currently expected to be performed manually, if desired. For
instance: first train some symbols, all of same type and
size. All just created patterns are put on type 0. Then use the
"set pattern type" on Edit menu to change their types from 0 to
some other at your choice.
Saving the session
------------------
Before going further, it's important to know how to save your
work. The file menu contains one item labelled "save
session". When selected, it will create or overwrite three files
on the working directory: "patterns", "acts" and "page.session",
where "page" is the name of the file currently loaded, without
the "pbm" or "pgm" tag (in out example, "imre"). So, to remove
all data produced by OCR sessions, remove manually the files
"*.session", "patterns" and "acts".
Note that the files "patterns" and "acts" are shared by all PBM
or PGM pages, so a symbol trained from one page is reused on the
other pages. The ".session" files however are per-page. Pages
with the same graphic characteristics, and only them, must be put
on one same directory, in order to share the same patterns.
When the "quit" option of the "File" menu is selected, the OCR
prompts the user for saving the session (answer pressing the key
"y" or "n"), unless there are no unsaved changes.
OCR steps
---------
The OCR process is divided into various steps, for instance
"classification", "build", etc. These steps are acessible clicking
the mouse button 3 over the OCR button. Each one can be started
independently and/or repeated at any moment. In fact, the more
you know about these steps, the better you'll use them.
Clicking the "OCR" button with the mouse button 1, all steps will
be started in sequence. The "OCR" button remains on the
"selected" state while some step is running.
Yet we won't cover this stuff in the tutorial, a basic knowledge
on what each step perform is required for fine-tuning Clara OCR.
The tuning is an interactive effort where the usage of the
heuristics alternates with training and revision, guided by the
user experience and feeling.
Classification
--------------
After training some symbols, we're ready to apply the just
acquired knowledge to deduce the transliteration of non-trained
symbols. For that, Clara OCR will compare the non-trained symbols
with those trained ("patterns"). Clara OCR offers nice visual
modes to present the comparison of each symbol with each
pattern. To activate the visual modes, enter the View menu and
select (for instance) the "show comparisons" option.
Now start the "classification" step (click the mouse button 3
over the OCR button and select the "classification" item) and
observe what happens. Depending on your hardware and on the size
of the document, this operation may take long to complete
(e.g. 5 minutes). Hopefully it'll be much faster (say, 30
seconds).
When the classification finishes, observe that some nontrained
symbols became black. Each such symbol was found similar to some
pattern. Select one black symbol, and Clara will draw a gray
ellipse around each class member (except the selected symbol,
identified by the black graphic cursor). You can switch off this
feature unselecting the "Show current class" item on the "View"
menu.
In some cases, Clara will classify incorrectly some symbols. For
instance, a defective "e" may be classified as "c". If that
happens, you can inform Clara about the correct transliteration
of that symbol training it as explained before (in this example,
select the symbol and press "e"). This action will remove that
symbol from its current class, and will define a new class,
currently unitary and containing just that symbol.
Note about how Clara OCR classification works
---------------------------------------------
The usual meaning of "classification" for OCRs is to deduce for
each symbol if it is a letter "a" or the letter "b", or a digit
"1", etc. As the total number of different symbols is small (some
tenths), there will be a small quantity of classes.
However, instead of classifying each symbol as being the letter
"a", or the digit "1", or whatever, Clara OCR builds classes of
symbols with similar shapes, not necessarily assigning a
transliteration for each symbol. So as sometimes the bitmap
comparison heuristics consider two true letters "a" dissimilar
(due to printing differences or defects), the Clara OCR
classifier will brake the set of all letters "a" in various
untransliterated subclasses.
Therefore, the classification result may be a much larger number
of classes (thousands or more), not only because of those small
differences or defects, but also because the classification
heuristics are currently unable to scale symbols or to "boldfy"
or "italicize" a symbol.
Note that each untransliterated subclass of letters "a" depends
on a punctual human revision effort to become transliterated
(trained). This is not an absurd strategy, because the revision
of each subset corresponds to part of the unavoidable human
revision effort required by any real-life digitalization
project. This is one of the principles that make possible to see
Clara OCR not as a traditional OCR, but as a productivity tool
able to reduce costs. Anyway, we expect to the future more
improvements on the Clara OCR classifier, in order to lessen the
number of subclasses created.
Building the output
-------------------
Now we're ready to build the OCR output. Just start the
"build" step. The action performed will be basically
to detect text words and lines, and output the transliterations,
trained or deduced, of all symbols. The output will be presented
on the "PAGE (output)" window.
Each character on the "PAGE (output)" window behaves like a
HTML hyperlink. Click it to select the current symbol both
on the "PAGE" window and on the "PAGE (symbol)" window. Note
that the transliteration of unknow symbols is substituted by
their internal IDs (for instance "[133]").
The result of the word detection heuristic can be visualized
checking the "show words" item on the "View" menu.
Handling broken symbols
-----------------------
Remark: As to version CLARA_VERSION the merging heristics are only
partially implemented, and in most cases they won't produce any effect.
The build heuristics also try to merge the pieces of broken
symbols, just like the "u", the "h" and the "E" on the figure
(observe the absent pixels). Some letters have thin parts, and
depending on the paper and printing quality, these parts will
brake more or less frequently.
XXX XXXXXXXXXXX
XX XXX X
XX XXX
XX XXX
XXX XXX XX XXX XXX X
XX XX XXX X XXX XXXX
XX XX XX XX XXX X
XX XX XX XX XXX
XX XX XX XX XXX
XX XX XX XX XXX X
XX XXXX XXXX XXX XXXXXXXXXXX
Clara OCR offers three symbol merging heuristics:
geometric-based, recognition-based and learned. Each one may be
activated or deactivated using the "tune" tab.
Geometric merging applies to fragments on the interior of the
symbol bounding box, like the "E" on the figure, and to some other
cases too.
The recognition merging searches unrecognized
symbols and, for each one, tries to merge it with some
neighbour(s), and checks if the result becomes similar to some
pattern.
Finally, learned merging will try to reproduce the
cases trained by the user. To train merging, just select the
symbol using the mouse button 1
(say, the left part of the "u" on the figure), click the mouse
button 3 on the fragment (the right part of the "u"), and select
the "merge with current symbol" entry. On the other hand, the
"disassemble" entry may be used to break a symbol into its
components.
Remark: do not merge the "i" dot with the "i" stem. See the
subsection "handling accents" for details.
Handling accents
----------------
Now let's talk about accents.
As a general rule, Clara OCR does not consider accents as parts
of letters, so merging does not apply to accents. Accents are
considered individual symbols, and must be trained
separately. The "i" dot is handled as an accent. Clara OCR will
compose accents with the corresponding letters when generating
the output. The exception is when the accent is graphically
joined to the letter:
XXX
XX XXX
XX XX
XX
XXXX XXXX
XX XX XX XX
XX XX XX XX
XXXXXXXXXX XXXXXXXXXX
XX XX
XX XX
XX XX XX XX
XXXX XXXX
In the figure we have two samples of "e" letter with acute
accent. In the first one, the accent is graphically separated
from the letter. So the accent transliteration will be trained or
deduced as being "'", the letter transliteration
will be trained or deduced as beig "e". When generating the output,
Clara OCR will compose them as the macro "\'e" (or as the ISO
character 233, as soon as we provide this alternative behaviour).
On the second case the accent isn't graphically separable from
the letter, so we'll need to train the accented character as the
corresponding ISO character (code 233) or as the macro "\'e". As
the generation of accented characters depend on the local X
settings, the "Emulate deadkeys" item on the "Options" menu may
be useful in this case. It will enable the composition of accents
and letters performed directly by Clara OCR (like Emacs
iso-accents-mode feature).
Browsing the book font
----------------------
As explained earlier, trained symbols become patterns (unless you
mark it "bad"). The collection of all patterns is called "book
font" (the term "book" is to distinguish it from the GUI
font). Clara OCR stores all pattern in the "patterns" file on the
work directory, when the "save session" entry on the "File" menu
is selected.
Clara OCR itself can choose the patterns and populate the book
font. To do so, just select the "Build the font automatically"
item on the "tune" tab, and classify the symbols.
To browse the patterns, click the "pattern" tab one or more times
to enter the "Pattern (list)" window. The "PATTERN (list)" mode
displays the bitmap and the properties
of each pattern in a (perhaps very long) form.
Click the "zoom" button to
adjust the size of the pattern bitmaps. Use the scroolbar or
the Next (Page Down) or Previous (Page Up) keys to navigate. Use
the sort options on the "Edit" menu to change the presentation order.
Now press the "pattern" tab again to reach the "Pattern" window. It
presents the "current" pattern with detailed properties. try
activating the "show web clip" option on the "View" menu to
visualize the pattern context. The left and
right arrows will move to the previous and to the next patterns. To
train the current pattern (being exhibited on the "Pattern" window),
just press the key corresponding to its transliteration (Clara will
automatically move to the next pattern) or fill the
input field. There is no need to press ENTER to submit the input
field contents.
Useful hints
------------
If the GUI becomes trashed or blank, press C-l to redraw it.
By now, the GUI do not support cut-and-paste. To save to a file
the contents of the "PAGE (list)" window, use the "Write report"
item on the "File" menu.
The "OCR" button will enter "pressed" stated in some unexpected
situations, like during dialogs. This behaviour will be fixed
soon.
The "STOP" button do not stop immediately the OCR operation in
course (e.g. classification). Clara OCR only stops the operation
in course in "secure" points, where all data structures are
consistent.
The OCR output is automatically saved to the file page.html (or
page.txt if the option -o was used), where "page" is the name of
the currently loaded page, without the "pbm" or "pgm" tag. This
file is created by the "generate output" step on the menu that
appears when the mouse button 3 is pressed over the OCR button.
Some OCR steps are currently unfinished and perform no
action at all.
Fun codes
---------
Clara OCR "fun codes" are similar to videogame "codes" (for those
who have never heard about that, videogame "codes" are special
sequences of mouse or key clicks that make your player
invulnerable, or obtain maximum energy, or perform an unexpected
action, etc).
The difference is that Clara OCR "fun codes" are not secret
(videogame "codes" are normally secret and very hard to discover
by chance). Clara OCR contains no secret feature. Fun codes are
intended to be used along public presentations. By now there is
only one fun code: just click one or more times the banner on the
welcome window to make it scroll.
*/
/* (book)
Supported Alphabets
-------------------
Clara OCR focuses the Latin Alphabet ("a", "b", "c", ...),
used by most European languages, and the decimal digits
("0", "1", "2", ...), but we're trying to support as many
alphabets as possible.
To say that Clara OCR supports a given alphabet means that
Clara OCR
(a) is able to be trained from the keyboard for the symbols of
that alphabet, eventually applying some transliteration from that
alphabet to latin. For instance, when OCRing a greek text, if the
user presses the latin "a" key (assuming that the keyboard has
latin labels), Clara is expected to train the current symbol as
"alpha".
(b) knows the vertical alignment of each letter of that alphabet,
for instance, knows that the bottom of an "e" is aligned at the
baseline;
(c) knows which letters accept or require which signs (accents
and others, like the dot found on "i" and "j");
(d) contains code to help avoiding common mistakes, like
recognizing "e" as "c", "l" as "1", etc.
To say that Clara OCR supports a given alphabet does not
necessarily mean that Clara OCR
(a) knows some particular encoding (ISO-8859-X, Unicode, etc)
for that alphabet;
(b) contains or is able to use fonts for that alphabet to
display the OCR output on the PAGE (OUTPUT) window.
Even ignoring the standard encondings for one given
alphabet (e.g. ISO-LATIN-7 for Greek), Clara eventually
will be able to produce output using TeX macros, like
{\Alpha}.
*/
/* (devel)
Introducing the source code
---------------------------
This Guide is a collection of entry points to the Clara OCR
source code. Some notes explain punctual details about how this
or that feature was implemented. Others are higher-level
descriptions about how one entire subsystem works.
Language and environment
------------------------
Clara OCR is written in ANSI C (with some GNU extensions) and
requires the services of the C library and the Xlib. The
development is using 32-bit Intel GNU/Linux (various different
distributions), GCC, Gnu Make, Bash, XFree86 and Perl 5 (required
for producing the documentation).
Modularization
--------------
Clara source code started, of course, as being one only file
named clara.c. At some point we divided it into smaller
pieces. Currently there are 18 files:
book.c .. Documentation only
build.c .. The function build
clara.c .. Startup and OCR run control
cml.c .. ClaraML dumper and recover
common.h .. Common declarations
consist.c .. Consistency tests
event.c .. GUI initialization and event handler
gui.h .. Declarations that depend on X11
html.c .. HTML generation and parse
pattern.c .. Book font stuff
pbm2cl.c .. Import PBM
pgmblock.c .. grayscale loading and blockfinding
preproc.c .. internal preprocessor
redraw.c .. The function redraw
revision.c .. Revision procedures
skel.c .. Skeleton computation
symbol.c .. Symbol stuff
welcome.c .. Welcome stuff
Along this document we'll not refer these files, but the
identifiers (names of functions and variables).
Note that there are only two headers: common.h and gui.h. It's
complex to maintain one header for each module. Most functions
are not prototyped, but we intend to prototype all them in the
near future.
Security notes
--------------
Concerning security, the following criteria is being used:
1. string operations are generally performed using services that
accept a size parameter, like snprint or strncpy, except when the code
itself is simple and guarantees that a overflow won't occur.
2. The CGI clara.pl invokes write privileges through sclara, a program
specially written to perform only a small set of simple operations
required for the operation of the Clara OCR web interface.
The following should be done:
1. Memory blocks should be cleared before calling free().
Runtime index checking
----------------------
A naive support for runtime index checking is provided through the
macro checkidx. This checking is performed only if the code is
compiled with the macro MEMCHECK defined and the command-line switch
'-X 1' is used.
In fact, only those points on the source code where the macro checkidx
is explicitly used will perform index checking. We've added calls to
checkidx on some critical functions due to its complexity, or because
segfaults were already were detected there.
Background operation
--------------------
Clara OCR decides at runtime if the GUI will be used or not. So
even when using Clara OCR in batch mode (-b command-line switch),
linking with the X libraries is required.
When the -b command-line switch is used, Clara OCR just won't
make calls to X services. The source code tests the flag
"batch_mode" before calling X services. So it won't create the
application window on the X display, and automatically starts a
full OCR operation on all pages found (as if the "OCR" button was
pressed with the "work on all pages" option selected). Upon
completion, Clara OCR will exit.
Synchronization
---------------
Execution model
---------------
In order to allow the GUI to refresh the application window while
one OCR run is in course, Clara does not use multiple
threads. The main function alternates calls to xevents() to
receive input and to continue_ocr() to perform OCR. As the OCR
operations may take long to complete, a very simple model was
implemented to allow the OCR services to execute only partially.
Such services (for instance load_page()) accept a "reset" parameter
to allow resetting all static data, and they're expected to
return 0 when finished, or nonzero otherwise. Therefore, a call to
such services must loop until completion. The continue_ocr() calls
the OCR steps using this model, and some OCR steps call other
services (like load_page()) that implement this model too.
Resetting
---------
XML support
-----------
We decided to use XML because of the facilities of using
non-binary encodings to store, analyse, change and transmit
information, and also because XML is a standard. Currently we do
not have DTDs, and until now we didn't try to load, using the
Clara parser, XML code not produced by Clara itself.
The GUI
-------
Main characteristics
--------------------
1. Clara OCR GUI uses only 5 colors: white, gray, darkgray,
verydarkgray and black. The RGB value for each one is
customizable at startup (-c command-line option). On truecolor
displays, graymaps are displayed using more graylevels than the 5
listed above.
2. The X I/O is not buffered. Buffered X I/O is implemented but
it's not being used.
3. Only one X font is used for all needs (button lables, menu
entries, HTML renderization, and messages).
4. Asynchronous refresh. The OCR operations just set the redraw
flags (redraw_button, redraw_wnd, redraw_int, etc) and let the
redraw() function make its work.
5. No toolkit is used. The graphic code is very specific to
Clara, and it was not written to be reusable. So it's very
small. The disadvantage of this approach is that Clara look and
behaviour will be slightly different from the typical ones found
on popular environments like GNOME or KDE.
The Clara API
-------------
*/
/* (book)
Building the book font
----------------------
Patterns are selected symbols from the book. They're obtained
from manual training, or from automatic selection. The patterns
are used to deduce the transliteration of the unknown symbols by
the bitmap comparison heuristics. In other words, the OCR
discovers that one symbol is the letter "a" or the digit "1"
comparing it with the patterns.
The book font is the collection of all patterns. The term "book
font" was chosen to make sure that we're not talking about the X
font used by the GUI. The book font is stored on a separate file
("patterns", on the work directory). Clara OCR classifies the
patterns into "types", one type for each printing font. By now,
most of this work must be done manually. Someday in the future,
the auto-tuning features and the pre-build customizations will
hopefully make this process less painful.
So, before OCRing one book, it's convenient to observe the
different fonts used. In our case, we have three fonts (the
quotations refer the page 5.pbm):
Unknown Latin 9pt ("Todos sao iguais...")
Unknown Latin 9pt bold ("Art. 5")
Unknown Latin 8pt italic (footings)
It's not mandatory to exactly identify each font by its "correct"
name or style or size (Roman, Arial, Courier, etc). In our case,
we've chosen the labels above ("Unknown Latin 9pt" and the
others). These labels can be manually entered using the PATTERN
(TYPES) tab, one "type" for each "font". So we'll have 3 "types",
and, for each one, various parameters can be manually
informed. At least the alphabet must be informed. In fact, the
PATTERN (TYPES) tab allows structuring very carefully all fonts
used along the book. Even some intrincated details, like the
classification techniques that can be used for each symbol, can
be set.
Now we can select some patterns from the pages 143-l.pbm and
143-r.pbm. Try:
$ cd /home/clara/books/MBB/pbm
$ clara &
Load the page 143-l.pbm. Observe the symbols, select a nice one
using the mouse button 1 or the arrows (say, a letter "a", small)
and train it pressing the corresponding key (the "a" key). Repeat
this process for various symbols, all from one same type (so do
not mix bold with non-bold, etc). The entered patterns belong by
default to "type 0". The "Set pattern type" entry of the Edit
menu can be used to move all "type 0" patterns to some other type
(1, 2 or 3 in our case). To display the letters and digits for
which few or no samples are trained, click the mouse right button
over the PAGE tab and select "Show pattern type". This way, one
can complete all fonts used along the book.
At this point, the "Auto-classify" feature (Edit menu) may be
quite useful. When on, Clara OCR will apply the just trained
pattern to solve all unknown symbols, so after training an "a",
only those "a" letters dissimilar to that trained will remain
unknown (grayed).
Now save the session (menu "File"), exit Clara OCR (menu "File"),
and enter Clara OCR again using the same commands above. Try to
load one file and/or to observe the patterns on the tabs PATTERN,
PATTERN (list), TUNE (SKEL), etc. This is a good way to
experience that Clara OCR is started and exited many times along
the duration of one OCR project.
The last remark in this subsection: instead of the just described
manual pattern selection, Clara OCR is able to select by itself
the patterns to use from the pages. In order to use this feature,
after selecting the checkbox "Build the bookfont automatically"
(TUNE tab), classify the symbols (just press the OCR button using
the mouse button 1, or press the mouse button 3 over it and
select the "classify" item). However, the current recommendation
is to prefer the manual selection of patterns, at least as a
first step.
*/
/* (book)
Reference of the Clara GUI
--------------------------
In this section, the Clara application window will be described
in detail, both to document all its features and to define the
terminology.
The application window
----------------------
The application window is divided into three major areas: the
buttons ("zoom", "OCR", "stop", etc) the "plate" (right),
including the tabs ("page", "symbol" and "font"), and one or more
"document windows" inside the plate.
We say "document window" because each window is exhibiting one
"document". This "document" may be the scanned page (PAGE
window), the current OCR output for this page (PAGE OUTPUT
window), the symbol form (PAGE SYMBOL window), the GPL (GPL
window) and so on. However, we'll refer the document windows
merely as "windows".
Around each window there are two scrollbars. On the botton of the
application window there is a status line. On the top there is
a menu bar (fully documented on the section "Reference of the
menus").
+-----------------------------------------------+
| File Edit OCR ... |
+-----------------------------------------------+
| +--------+ +----+ +--------+ +-------+ |
| | zoom | |page| |patterns| | tune | |
| +--------+ +-+ +-+ +-+ +-+ |
| +--------+ | +-------------------------+ | |
| | zone | | | | | |
| +--------+ | | | | |
| +--------+ | | | | |
| | OCR | | | WELCOME TO | | |
| +--------+ | | | | |
| +--------+ | | C L A R A O C R | | |
| | stop | | | | | |
| +--------+ | | | | |
| . | | | | |
| . | | | | |
| | | | | |
| | | | | |
| | +-------------------------+ | |
| +-----------------------------+ |
| |
| (status line) |
+-----------------------------------------------+
Tabs and windows
----------------
Three tabs are oferred, and each one may operate in one or more
"modes". For instance, pressing the PATTERN tab many times will
circulate two modes: one presenting the windows "pattern" and
"pattern (props)" and another with the window "pattern
(list)".
On each tab, Clara OCR displays on the plate one or more
windows. Each such window is called a "document window" to
distinguish them from the application window. Each such window
is supposed to be displaying a portion of a larger document, for
instance
The scanned page (graphic)
The OCR output (text)
The list of pages (text)
The list of patterns (text)
The symbol description (text)
Unless the user hides them, two scrollbars are displayed for each
document window, one horizontal and one vertical. On each one, a
cursor is drawn to show the relative portion of the full document
currently visible ont the display.
All available tabs and the modes for each one are listed
below. The numbers (1, 2, etc) are only to make easier to
distinguish one mode from the others. There is no effective
association between the modes and the numbers.
tab mode windows
-------------------------------
1 WELCOME
2 GPL
3 PATTERN_ACTION
page 4 PAGE_LIST
5 PAGE
PAGE_OUTPUT
PAGE_SYMBOL
6 PAGE_FATBITS
PAGE_MATCHES
pattern 7 PATTERN
8 PATTERN_LIST
9 PATTERN_TYPES
tune 10 TUNE
11 TUNE_PATTERN
TUNE_SKEL
11 TUNE_ACTS
Note that the windows WELCOME and GPL have no corresponding
tab. When these windows are displayed, there is no active
tab. Except in these cases, the name of the current window is
always presented as the label of the active tab.
The Alphabet Map
----------------
When the "Show alphabet map" option of the "View" menu is selected,
the GUI will include an alphabet map between the buttons and the
plate. This map presents all symbols from the current alphabet. The
current alphabet is selected using the alphabet button. The alphabet
button circulates all alphabets selected on the "Alphabets" menu.
Clara OCR offers an initial support for multiple alphabets. To become
useful, it needs more work. The alphabet map currently does not offer
any functionality. For some alphabets (Cyrillic and Arabic) the
alphabet map is disabled on the source code due to the large alphabet
size. Currently Clara OCR does not contain bitmaps for displaying
Katakana.
Reference of the menus
----------------------
Most menus are acessible from their labels menu bar (on the top of the
application window). The labels are "File", "Edit", etc. Other menus
are presented when the user clicks the mouse button 3 on some special
places (for instance the button "OCR"). Let's describe all menus and
their entries.
*/
/* (devel)
Geometry of windows
-------------------
The current window is informed through the CDW global variable
(set by the setview function). The variable CDW is an index for
the dw array of dwdesc structs. Some macros are used to refer the
fields of the structure dw[CDW]. The list of all them can be
found on the headers under the title "Parameters of the current
window".
The portion of the document being displayed is defined by the
macros X0, Y0, HR and VR, where (X0,Y0) is the top left and HR
and VR are the width and heigth, measured in pixels (graphic
documents) or characters (text documents):
X0 X0+HR-1
| |
+----+-----+--+
| |
| |
| +-----+ +- Y0
| | | |
| | | |
| | | |
| +-----+ +- Y0+VR-1
| |
| |
| |
| |
| |
| |
+-------------+
The document
Regarding the application window, the document window is a
portion of the plate, defined by DM, DT, DW and DH, where (DM,DT)
is the top left and DW and DH are the width and heigth measured
in display pixels.
DM DM+DW-1
| |
+-----+-----------------+----+
| |
| |
| |
| +-----------------+ +- DT
| | | | |
| | | X |
| | | X |
| | Document | X |
| | window | | |
| | | | |
| | | | |
| | | | |
| | | | |
| +-----------------+ +- DT+DH-1
| -----XXXXXXXXXXX- |
| |
| |
+----------------------------+
Application window
The rectangle (X0,Y0,HR,VR) from the document is exhibited into
the display rectangle (DM,DT,DW,DH). When displaying the scanned
page, the reduction factor RF applies. Each square RFxRF of
pixels from the document will be mapped to one display pixel.
When displaying the scanned page in fat bit mode, each document
pixel will be mapped to a square ZPSxZPS of display pixels, and a
grid will be displayed too.
Scrollbars
----------
The scrollbars inform the relative portion of the document being
exhibited. The viewable region of the document (in the sense just
defined) is defined by X0, Y0, HR and VR:
Y0 Y0+HR-1
+----+-------+-------+ - 0
| |
X0 + +-------+ |
| | | |
| | | |
| | | |
| | | |
X0+VR-1 + +-------+ |
| |
| |
| |
| |
+--------------------+ - GRY-1
| |
0 GRX-1
The variables GRX and GRY contain the total width and height of
the full document, measured in pixels. The interpretation of the
contents of the variables X0, Y0, HR and VR is not simple. In some
cases, they will contain values measured in pixels. In other cases,
in characters. The variables HR and VR define the size of the
window. However, in some cases this size is the size
from the viewpoint of the document and, in others, of the display
(the difference is a reduction factor).
+------------+ -
| | |
| | |
| | X
| | X
| | X
| | |
| | |
+------------+ -
|---XXXX-----|
Note that the parameters X0, Y0, HR, VR, GRX and GRY are macros
that refer the corresponding fields of the structure dw[CDW],
that stores the parameters of the current DW.
Displaying bitmaps
------------------
The Bitmaps on HTML windows and on the PAGE window are exhibited
in "reduced" fashion (a square RFxRF of pixels from the bitmap is
mapped to one display pixel). If RF=1, then each bitmap pixel
will map to one display pixel.
The windows PATTERN, PAGE_FATBITS, and PAGE_MATCHES exhibit
bitmaps in "zoomed" mode (one bitmap pixel maps to a ZPSxZPS
square of display pixels). In this case a grid is displayed to
make easier to distinguish each pixel. The variables GW and GS
contain the grid width and the "grid separation" (GS=ZPS+GW).
ZPS GS GW
|<---->|<----->| --->||<---
++------++------++------++----
++------++------++------++----
|| || || ||
|| || || ||
|| || || ||
++------++------++------++----
++------++------++------++----
|| || || ||
|| || || ||
|| || || ||
Note that the parameters RF, GS and GW are macros that refer the
corresponding fields of the structure dw[CDW], that stores the
parameters of the current DW.
Auto-submission of forms
------------------------
The Clara OCR GUI tries to apply immediately all actions taken by
the user. So the HTML forms (e.g. the PATTERN window) do not
contain SUBMIT buttons, because they're not required (some forms
contain a SUBMIT button disguised as a CONSIST facility, but this
is just for the user's convenience).
The editable input fields make auto-submission mechanisms a bit
harder, because we cannot apply consistency tests and process the
form before the user finishes filling the field, so
auto-submission must be triggered on selected events. The
triggers must be a bit smart, because some events must be
attended before submission (for instance toggle a CHECKBOX),
while others must be attended after submission (for instance
changing the current tab). So auto-submission must be carefully
studied. The current strategy follows:
a. When the window PAGE (symbol) or the window PATTERN is
visible, auto-submit just after attending the buttons that change
the current symbol/pattern data (buttons BOLD, ITALIC, ALPHABET
or PTYPE).
b. When the window PAGE (symbol) or the window PATTERN is
visible, auto-submit just before attending the left or right
arrows.
c. When the user presses ENTER and an active input field exists,
auto-submit.
d. Auto-submit as the first action taken by the setview service,
in order to flush the current form before changing the current
tab or tab mode.
e. Auto-submit just after opening any menu, in order to flush
data before some critic action like quitting the program or
starting some OCR step.
f. Auto-submit just after attending CHECKBOX or RADIO buttons.
Auto-submission happens when the service auto_submit_form is
called, so it's easy to locate all triggering points (just search
the string auto_submit_form). This service takes no action when
the current form is unchanged.
The Clara API
-------------
This section describes the variables and functions exported by
Clara OCR for extensionability purpuses. Note that Clara OCR
currently does not have an interface for extensions. The first
such interface planned to be added will use the Guile
interpreter, available from the GNU Project.
*/
/* (all)
AVAILABILITY
------------
Clara OCR is free software. Its source code is distributed under
the terms of the GNU GPL (General Public License), and is
available at CLARA_HOME. If you don't know what is the GPL,
please read it and check the GPL FAQ at
http://www.gnu.org/copyleft/gpl-faq.html. You should have
received a copy of the GNU General Public License along with this
software; if not, write to the Free Software Foundation, Inc., 59
Temple Place - Suite 330, Boston, MA 02111-1307, USA. The Free
Software Foundation can be found at http://www.fsf.org.
CREDITS
-------
Clara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati
wrote the internal preprocessor. Clara OCR includes bugfixes
produced by other developers. The Changelog
(http://www.claraocr.org/CHANGELOG) acknowledges all them (see
below). Imre Simon contributed high-volume tests, discussions
with experts, selection of bibliographic resources, propaganda
and many ideas on how to make the software more useful.
Ricardo authored various free materials, some included (at least)
in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator
"conjugue", the ispell dictionary br.ispell and the proxy
axw3). He recently ported the EiC interpreter to the Psion 5
handheld and patched the Xt-based vncviewer to scale framebuffers
and compute image diffs. Ricardo works as an independent
developer and instructor. He received no financial aid to develop
Clara OCR. He's not an employee of any company or organization.
Imre Simon promotes the usage and development of free
technologies and information from his research, teaching and
administrative labour at the University.
Roberto Hirata Junior and Marcelo Marcilio Silva contributed
ideas on character isolation and recognition. Richard Stallman
suggested improvements on how to generate HTML output. Marius
Vollmer is helping to add Guile support. Jacques Le Marois helped
on the announce process. We acknowledge Mike O'Donnell and Junior
Barrera for their good criticism. We acknowledge Peter Lyman for
his remarks about the Berkeley Digital Library, and Wanderley
Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior
for some web and bibliographic pointers. Bruno Barbieri Gnecco
provided hints and explanations about GOCR (main author: Jorg
Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently
supporting our tentatives of using portions of his code. Adriano
Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried
the tutorial before the first announce. Eduardo Marcel Macan
packaged Clara OCR for Debian and suggested some
improvements. Mandrakesoft is hosting claraocr.org. We
acknowledge Conectiva and SuSE for providing copies of their
outstanding distributions. Finally, we acknowledge the late Jose
Hugo de Oliveira Bussab for his interest in our work.
Adriano Nagelschmidt Rodrigues donated a 15" monitor.
The fonts used by the "view alphabet map" feature came from
Roman Czyborra's "The ISO 8859 Alphabet Soup" page at
http://czyborra.com/charsets/iso8859.html.
The names cited by the CHANGELOG (and not cited before) follow
(small patches, bug reports, specfiles, suggestions,
explanations, etc).
Brian G. (win32),
Bruce Momjian,
Charles Davant (server admin),
Daniel Merigoux,
De Clarke,
Emile Snider (preprocessor, to be released),
Erich Mueller,
Franz Bakan (OS/2),
groggy,
Harold van Oostrom,
Ho Chak Hung,
Jeroen Ruigrok,
Laurent-jan,
Nathalie Vielmas,
Romeu Mantovani Jr (packager),
Ron Young,
R P Herrold,
Sergei Andrievskii,
Stuart Yeates,
Terran Melconian,
Thomas Klausner (NetBSD),
Tim McNerney,
Tyler Akins.
*/
/* (faq)
WELCOME
-------
These are the Clara OCR Frequently Asked Questions. They're
useful for a first contact with Clara OCR. If you're looking for
information on how to use Clara OCR, please try the Clara OCR
Tutorial instead. Clara OCR can be found at CLARA_HOME.
CONTENTS
--------
What is Clara OCR?
Why is Clara a "cooperative OCR"?
Is Clara OCR Free? Open Source?
Is Clara OCR a GNU program?
On which platforms does Clara OCR run?
Does Clara OCR have a command-line interface?
Does Clara OCR run on KDE? GNOME?
Which languages are supported by Clara OCR?
Does Clara OCR support Unicode?
Is Clara OCR omnifont?
How does Clara differ from other OCRs?
What is PBM/PGM/PPM/PNM?
How can I scan paper documents using Clara OCR?
I've tried Clara OCR, but the results disappointed me
How can I get support on Clara OCR?
Does Clara OCR induce to Copyright Law infringements?
How can I help the Clara OCR development effort?
What is Clara OCR?
------------------
Clara is an OCR program. OCR stands for "Optical Character
Recognition". An OCR program tries to recognize the characters
from the digital image of a paper document. The name Clara stands
for "Cooperative Lightweight chAracter Recognizer".
Why is Clara a "cooperative OCR"?
---------------------------------
Clara is a cooperative OCR because it offers an web interface for
training and revision, so these tasks can benefit from the
revision effort of many people across the Internet. However,
Clara OCR also offers a powerful X-based GUI for standalone
usage.
Is Clara OCR Free? Open Source?
-------------------------------
Clara OCR is distributed within the terms of the Gnu Public
License (GPL) version 2. Yes, Clara OCR is Free. Yes, Clara OCR
is Open Source. Clara OCR is not "Shareware", nor "Public
Domain".
Is Clara OCR a GNU program?
---------------------------
Clara OCR is unrelated to the GNU Project but its development is
strongly based on GNU programs (GCC, Emacs and others), as well
as on other free softwares, like the Linux kernel and XFree86.
Clara OCR is free software because we agree on the free software
ideal as stated by the GPL. To make this agreement explicit we
also adopted some suggestions from the Free Software
Foundation. These suggestions apply to the Clara OCR
documentation:
(a) GPL programs are referred as "free software", not "open
source".
(b) The term "GNU/Linux (operating system)" is used rather
than "Linux (operating system)".
(c) We do not recommend non-free softwares and do not refer
the user to non-free documentation for free softwares.
Furthermore, Clara OCR will support Guile as an extension
language in the near future.
Remark: We write "free software" instead of "open source"
just for coherence. We dislike antagonisms between the various
initiatives created along the years to freely produce, use,
change and distribute software.
On which platforms does Clara OCR run?
--------------------------------------
Clara OCR is being developed on 32-bit Intel running GNU/Linux.
Currently Clara OCR won't run on big-endian CPUs (e.g. Sparc) nor
on systems lacking X windows support (e.g. MS-Windows). A
relatively fast CPU (300MHz or more) is recommended. There is a
port initiative to MS-Windows being worked. See also the next
question.
Does Clara OCR have a command-line interface?
---------------------------------------------
Yes, but the X Windows headers and libraries are required anyway
to compile the source code, and the X Windows libraries are
required to run even the Clara OCR command-line interface. Unless
someone reworks the code, it's not possible to detach the GUI in
order to compile Clara OCR on systems that do not support X
Windows.
Does Clara OCR run on KDE? GNOME?
---------------------------------
Clara OCR will hopefully run on any graphic environment based on
Xwindows, including KDE, GNOME, CDE, WindowMaker and
others. Clara OCR depends only on the X library, and does not
require GTK, Qt or Motif to run. Clara OCR does not use the X
Toolkit (aka "Xt"). Clara OCR has been successfully tested on
X11R5 and X11R6 environments with twm, fvwm, mwm and others.
Which languages are supported by Clara OCR?
-------------------------------------------
As a generic recogniser, Clara OCR may be tried with any language
and any alphabet. However, there are some restrictions. Currently
Clara OCR expects the words to be written horizontally, and there
are some heuristics that suppose some geometric relationships
typical for the Latin Alphabet and the accents used by most
european languages. Support for language-specific spell checking
is expected to be added soon.
Does Clara OCR support Unicode?
-------------------------------
No, Clara OCR does not support Unicode, and the support to the
ISO-8859 charsets is partial.
Is Clara OCR omnifont?
----------------------
No, Clara OCR is not omnifont. Clara OCR implements an OCR model
based on training. This model makes training and revision one
same thing, making possible to reuse training and revision
information (see also the next question).
How does Clara differ from other OCRs?
--------------------------------------
This is a quote from the Clara Advanced User's Manual:
Clara differs from other OCR softwares in various aspects:
1. Most known OCRs are non-free and Clara is free. Clara focus
the X windows system. Clara offers batch processing, a web
interface and supports cooperative revision effort.
2. Most OCR softwares focus omnifont technology disregarding
training. Clara does not implement omnifont techniques and
concentrate on building specialized fonts (some day in the
future, however, maybe we'll try classification techniques that
do not require training).
3. Most OCR softwares make the revision of the recognized text a
process totally separated from the recognition. Clara
pragmatically joins the two processes, and makes training and
revision parts of one same thing. In fact, the OCR model
implemented by Clara is an interactive effort where the usage of
the heuristics alternates with revision and fine-tuning of the
OCR, guided by the user experience and feeling.
4. Clara allows to enter the transliteration of each pattern
using an interface that displays a graphic cursor directly over
the image of the scanned page, and builds and maintains a mapping
between graphic symbols and their transliterations on the OCR
output. This is a potentially useful mechanism for documentation
systems, and a valuable tool for typists and reviewers. In fact,
Clara OCR may be seen as a productivity tool for typists.
5. Most OCR softwares are integrated to scanning tools offerring
to the user an unified interface to execute all steps from
scanning to recognition. Clara does not offer one such integrated
interface, so you need a separate software (e.g. SANE) to
perform scanning.
6. Most OCR softwares expect the input to be a graphic file
encoded in tiff or other formats. Clara supports only raw PBM
and PGM.
What is PBM/PGM/PPM/PNM?
------------------------
PBM, PGM and PPM are graphic file formats defined by Jef
Poskanzer. PNM is not a graphic file format, but a generic
reference to those three formats. In other words, to say that a
program supports PNM means that it handles PBM, PGM and PPM.
PBM = Portable BitMap
PGM = Protable GrayMap
PPM = Portable PixMap
PNM = Portable aNyMap
PBM files are black-and-white images, 1 bit per pixel. PGM files
are grayscale images, 8 bits per pixel. PPM files are color
images, 24 bits per pixel. Currently Clara OCR likes raw PBM and
raw PGM files only. A scanned page stored in some format other
than PBM or PGM can be converted to PBM or PGM using the netpbm
tools, ImageMagick or others.
PNM files may be "raw" or "plain". The plain versions are rarely
used. Clara OCR does not support plain PBM nor plain PGM. To make
sure about the file format, try the "file" utility, for instance
file test.pbm
Remember that image conversion sometimes implies data loss. For
instance, to convert a color image to black-and-white, each pixel
must be mapped to either black or white, so the original color
(say, red, lightblue, seagreen, tomato, mistyrose, etc) is
dropped. Also, the conversion process should decide for each
pixel if it will be mapped to black or to white. Generally, the
program that performs the conversion offers a variety of
different mapping criteria. The OCR results depend strongly on
the criterion chosen.
How can I scan paper documents using Clara OCR?
-----------------------------------------------
You cannot. Clara OCR includes no support for scanners. To scan
paper documents, use another software, like the one bundled with
your scanner, or SANE (http://www.mostang.com/sane/). The
development tests are using SANE.
I've tried Clara OCR, but the results disappointed me
-----------------------------------------------------
All OCR programs will disappoint you depending on the texts
you're trying to recognize. If you're a developer, join the Clara
OCR development effort and try to make it behave better on your
texts. If your are not a developer, wait a new version and try
again.
How can I get support on Clara OCR?
-----------------------------------
If the documentation did not solve your problems, try the
discussion list.
Does Clara OCR induce to Copyright Law infringements?
-----------------------------------------------------
No. Clara OCR is just a tool for character recognition like many
others that can be purchased or are bundled with scanners. The
Clara OCR Project claims all users to be aware about the
Copyrigth Law and not infringe it. The Clara OCR Project
abominates any try to infringe the legitimate laws of any
country.
Nonetheless, the Clara OCR Project supports the free and public
availability of materials produced to be free, or of materials
out of copyright due to its age. The Clara OCR Project recognizes
the right of anyone to produce free or non-free materials.
How can I help the Clara OCR development effort?
------------------------------------------------
The best way is to use Clara OCR to recognize the texts you're
interested on, and try to make it adapt better to them. The
Developer's Guide should help in this case (C programming skills
are required). The Clara OCR Project acknowledges all efforts to
make Clara OCR more widely known and used.
*/
/* (glossary)
WELCOME
-------
This is the Clara OCR glossary. It's somewhat specific to Clara
OCR. The entries that do not refer an author were written by
Ricardo Ueda Karpischek. Send new entries or suggestions to
claraocr@claraocr.org. This glossary is part of the Clara OCR
documentation. Clara OCR is distributed under the terms of the
GNU GPL.
CONTENTS
--------
algorithm
binarization
bitmap
bitmap comparison
border
border mapping
clara
classification
density
depth
digital image
dpi
function
graphic format
graymap
heuristic
image size
mapping
OCR
page
pattern
pixel
pixel distance
pixmap
PBM
PGM
PNM
PPM
resolution
skeleton
skeleton fitting
symbol
thresholding
Xlib
*/
/* (glossary)
image size
----------
As a digital image uses to be a rectangular matrix of pixels, its
size in pixels can be conveniently described giving the rectangle
width and height, usually in the form WxH. For instance, a 200x100
image is a rectangle of pixels having width 200 and height 100.
depth
-----
the number of bits available to store the color of each pixel.
Black-and-white images have depth 1. Graymaps use to have depth
8 (256 graylevels). The larger the depth, the larger will be the
amount of disk or ram space required to store a digital image.
For instance, an image of size 100x100 and depth 8 requires
100*100*8 = 80000 bits = 8000 bytes to be stored.
graphic format
--------------
A standardised way to store the color of each pixel from a digital
image in a disk file. The graphic format may include other
information, like density and image annotations. Some graphic
formats include a provision to compress the data. In some cases,
this compression, if used, may change the color of some pixels
or regions to colors close to the original ones, but different.
So the usage of some graphic formats may imply in data loss.
Examples of graphic formats are TIFF, JPEG, GIF, BMP, PNM, etc.
clara
-----
Cooperative Lightweight Recognizer. "Clara" is also a personal
name: Clara (Latin, Portuguese, Spanish), "Chiara" (Italian),
Claire (English).
OCR
---
Optical Character Recognition. Some people feel hard to
understand conveniently what OCR is due to the lack of knowledge
on how computers store and process text and image data. Most
users think OCR as being a required step before editing and
spell-checking documents got from the scanner (it's not wrong,
though).
algorithm
---------
a well defined procedure. The term "algorithm" is usually
reserved for procedures whose properties can be assured,
generally through a rigorous mathematical proof. For instance,
the procedure learned by children to multiply two numbers from
their multi-digit decimal representations is an algorithm (see
heuristic).
binarization
------------
the conversion from color or grayscale (PGM) to
black-and-white. The Clara OCR classification heuristics
currently available require black-and-white input, so when the
input is grayscale (PGM), Clara OCR needs to convert it to
black-and-white before OCR. Note that to binarize an image, some
choice must be done on how to map colors or graylevels to either
black or white. Also and mainly, and the OCR results depends
strongly on that choice.
bitmap
------
The Clara OCR documentation tries to use the term "bitmap" to
mean only rectangular, black-and-white digital images. Grayscale
rectangular digital images are called "graymaps" (see also
pixel).
bitmap comparison
-----------------
any method intended to decide if two given bitmaps are
similar. Clara OCR implements three such methods: skeleton
fitting, border mapping and pixel distance.
border
------
the line formed by the bitmap black pixels that have white
neighbours. Note that the definition of "neighbour" may
vary. Clara OCR generally consider that the neighbours of one
pixel are all 8 pixels contiguous to it (top left, top, top
right, left, right, bottom left, bottom, bottom right).
border mapping
--------------
a bitmap comparison technique that builds a mapping from the
border pixels of one bitmap to the border pixels of another
bitmap. If this mapping is found to satisfy certain mathematical
properties, the bitmaps are considered similar.
classification
--------------
the process that recognizes a given bitmap as being the letter
"a" or the digit "5", etc. Instead of saying that the bitmap was
"recognized" as a letter "a", it's common to say that it was
"classified" as a letter "a". All Clara OCR classification
methods are currently based on bitmap comparison techniques.
density
-------
see dpi.
digital image
-------------
see pixel.
dpi
---
dots-per-inch. A measure of linear image density. Example:
scanning an A4 (210x297mm) page at 300 dpi results an image of
size 2481x3508 (remember that 1 inch equals 25.4 millimeters). In
most cases, all relevant visual details from printed characters
can be conveniently captured at 600dpi (in some cases, 300dpi
suffices). Some file formats, like TIFF or JPEG, include density
information. Others, like PBM, PGM or PPM, don't. So when
converting from TIFF to PGM, remember that the density
information is dropped. So if, for instance, you ask SANE to scan
a page creating a TIFF file, and subsequently convert it to PPM,
and from PPM to TIFF again, the last file will not be equal to
the first one. Density information uses to be irrelevant when
displaying images on the computer monitor, because in this case a
1-1 mapping between image pixels and display pixels is
assumed. However, density information is quite important when
printing an image on paper, or when performing OCR. Clara OCR
expects to be informed explicitly about the image density
(default 600 dpi).
function
--------
a rule that assigns, for each given element, another element, in
a unique fashion. For instance, the equation y = x+1 defines a
function that assigns to each number x the number x+1. A 2d
digital image may be seen as a function that assigns to each dot,
given by its horizontal and vertical coordinates, a color
("black", "white", "green", etc). Functions are also called
"mappings".
graymap
-------
see bitmap.
heuristic
---------
a procedure whose properties are not assured. Heuristics are
generally the expression of some more or less vague feeling, or a
naive, initial approch for a complex problem. If an heuristic can
be proven to satisfy some interesting property, then it can be
referred as an algorithm (in regard of that property). Some
experts say that OCR is an engeneering field, not a mathematical
field. Perhaps we can express this same idea saying that by its
own nature, OCR is a field where nothing else than heuristics can
be stated.
mapping
-------
see function.
page
----
a scanned document. The Clara OCR documentation tries to avoid
using terms like "document", "image" or "file" to signify a
scanned document. "Page" is used instead.
pattern
-------
in the Clara OCR context, it's a letter, digit or accent
instance, used to classify the page symbols through bitmap
comparison. Clara OCR builds a set of patterns based on manual
training or automatic selection, and uses it to classify all page
symbols.
pixel
-----
each one of the individual dots that compose a digital image
(quite frequently, the term "pixel" is used to refer only the
non-white dots of an image). A digital image uses to be a
rectangular matrix of dots. To each one it's possible to assign
one from many available colors, in order to form an image. If the
available colors are only "black" and "white", the image thus
formed is a "black-and-white image". As the representation of one
from two possible values may be done using a bit, and the
assignment of geometrically well positioned dots to colors may be
seen as a function or mapping, a black-and-white image is also
called a "bitmap". Similarly, if the colors available are only
gray levels, usually from 0 (black) to 255 (white), then the
image is a "grayscale image" or a graymap, and a generic
assignment of pixels to colors is called a "pixmap".
pixel distance
--------------
a bitmap comparison technique that builds a mapping from all
pixels of one bitmap to the pixels of another bitmap. If this
mapping is found to satisfy certain mathematical properties, the
bitmaps are considered similar.
pixmap
------
see pixel.
PBM
---
see PNM.
PGM
---
see PNM.
PNM
---
Portable aNyMap. PNM is a generic reference to the graphic file
formats PBM, PGM and PPM defined by Jef Poskanzer. In other
words, to say that a program supports PNM means that it handles
PBM, PGM and PPM. PBM (Portable BitMap) files are black-and-white
images, 1 bit per pixel. PGM (Protable GrayMap) files are
grayscale images, 8 bits per pixel. PPM (Portable PixMap) files
are color images, 24 bits per pixel. Currently Clara OCR likes
PBM and PGM files only. A scanned page stored in some format
other than PBM or PGM can be converted to PBM or PGM using the
netpbm tools, ImageMagick or others. PNM files may be "raw" or
"plain". The plain versions are rarely used. Clara OCR does not
support plain PBM nor plain PGM.
PPM
---
see PNM.
resolution
----------
this term is used along the Clara OCR documentation to refer
either the image size (for instance: 640x480 pixels) or the image
density (for instance: 300 pixels per inch).
skeleton
--------
ideally, it's a minimal structural bitmap. From an algorithmic
standpoint, the skeleton of a symbol is the bitmap obtained
clearing a number of its peripheric pixels, whose remotion does
not destroy the symbol shape.
skeleton fitting
----------------
a bitmap comparison technique that decides that two given bitmaps
are similar if and only if the skeleton of each one fits into the
other.
symbol
------
an instance of a letter or digit in a page. So if the word
"classical" occurs in a page, all its letters ("c", "l", "a",
"s", "s", "i", "c", "a", "l") are individual symbols. At the
source code level, things that are not letters not digits are
sometimes called symbols (for instance, pieces of broken symbols,
dots, accents, noise, etc).
thresholding
------------
a simple binarization method. It decides to map each pixel from a
graymap to either black or white just testing if its gray level
is smaller or larger than a given threshold. So, if the threshold
is, say, 171, then all gray levels from 0 to 170 are mapped to 0
(black) and all graylevels from 171 to 255 are mapped to 255
(white). The thresholding is said to be global if one fixed
(per-page) binarization threshold is used to decide the mapping
of all page pixels. The thresholding is said to be local if the
threshold is allowed to vary along the page, due to irregular
printing intensity.
Xlib
----
the low-level, standard, Xwindows library. It offers
basic graphic primitives, similar to others found on most graphic
environments, like "draw line", "draw pixel", "get next event",
etc, as well as services more specific to the Xwindows way of
doing things, like "connect to an X display", properties
(resources) handling, etc. The Xlib does not include facilities
to create menus, buttons, etc. Application programs usually take
these facilities from "toolkits" like Xt, GTK, Qt and
others. Clara OCR creates the few facilities it needs using
the Xlib primitives.
*/
/*
Alignment drafts
s_pair(a,b)
complete_align(a,b)
get_ap(a)
use hardcoded data
get_ap(b)
use hardcoded data
get_dd(a,x,b,d)
estimate from alignment data
geo_align(a)
geo_align(b)
1. geometrical line detection.
2. compute per-symbol geometrical alignment.
3. add per-symbol alignment data to the pattern types.
4. add alignment filtering rule to the classification service.
*/
|