File: mxTextTools.html

package info (click to toggle)
egenix-mx-base 2.0.6-1
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 3,028 kB
  • ctags: 4,762
  • sloc: ansic: 14,965; python: 11,739; sh: 313; makefile: 117
file content (1969 lines) | stat: -rw-r--r-- 55,022 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
  <HEAD>
    <TITLE>TextTools - Fast Text Manipulation Tools for Python</TITLE>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
    <STYLE TYPE="text/css">
      p { text-align: justify; }
      ul.indent { }
      body { }
    </STYLE>
  </HEAD>

  <BODY TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#0000EE" VLINK="#551A8B" ALINK="#FF0000">

    <HR NOSHADE WIDTH="100%">

    <H2>mxTextTools - Fast Text Manipulation Tools for Python</H2>

    <HR SIZE=1 NOSHADE WIDTH="100%">
    <TABLE WIDTH="100%">
      <TR>
	<TD>
	  <SMALL>
	    <A HREF="#Engine">Engine</A> :
	    <A HREF="#Objects">Objects</A> :
	    <A HREF="#Functions">Functions</A> :
	    <A HREF="#Constants">Constants</A> :
	    <A HREF="#Examples">Examples</A> :
	    <A HREF="#Structure">Structure</A> :
	    <A HREF="#Support">Support</A> :
            <A HREF="http://www.egenix.com/files/python/eGenix-mx-Extensions.html#Download-mxBASE"><B>Download</B></A> :
	    <A HREF="#Copyright">Copyright &amp; License</A> :
	    <A HREF="#History">History</A> :
	    <A HREF="" TARGET="_top">Home</A>
	</SMALL>
	</TD>
	<TD ALIGN=RIGHT VALIGN=TOP>
	  <SMALL>
	    <FONT COLOR="#FF0000">Version 2.1.0</FONT>
	  </SMALL>
	</TD>
    </TABLE>
    <HR SIZE=1 NOSHADE WIDTH="100%">

    <H3>Introduction</H3>

    <UL CLASS="indent">

	<P>
	  A while ago, in spring 1997, I started out to write some
	  tools that were supposed to make string handling and parsing
	  text faster than what the standard library has to offer. I
	  had a need for this since I was (and still am) working on a
	  WebService Framework that greatly simplifies building and
	  maintaining interactive web sites. After some initial
	  prototypes of what I call <I>Tagging Engine</I> written
	  totally in Python I started rewriting the main parts in C
	  and soon realized that I needed a little more sophisticated
	  searching tools.

	<P>
	  I could walk through text pretty fast, but in many
	  situations I just needed to replace some text with some
	  other text.

	<P>
	  The next step was to create a new types for fast searching
	  in text. I decided to code up an enhanced version of the
	  well known Boyer-Moore search algorithm. This made me think
	  a bit more about searching and how knowledge about the text
	  and the search pattern could be better used to make it work
	  even faster. The result was an algorithm that uses a suffix
	  skip array, which I call Fast Search Algorithm.

	<P>
	  The two search types are built upon a small C lib I wrote
	  for this. The implementations are optimized for gcc/Linux
	  and from the tests I ran I can say that they out-perform
	  every other technique I have tried. Even the very fast
	  Boyer-Moore implementation of fgrep (1).

	<P>
	  Then I reintegrated those search utilities into the Tagging
	  Engine and also added a fast variant for doing 'char out of
	  a string'-kind of tests. These are done using 'sets',
	  i.e. strings that contain one bit per character position
	  (and thus 32 bytes long).

	<P>
	  All this got wrapped up in a nice Python package:
	<OL>
	  <LI>a fast search mechanism,
	  <LI>a state machine for doing fast tagging,
	  <LI>a set of functions aiding in post-processing the output of the
	    two and
	  <LI>a set of functions handling sets of characters.
	</OL>

	<P>
	  One word about the word '<I>tagging</I>'. This originated
	  from what is done in HTML to mark some text with a certain
	  extra information. I extended this notion to assigning
	  Python objects to text substrings. Every substring marked in
	  this way carries a 'tag' (the object) which can be used to
	  do all kinds of nifty things. 

    </UL><!--CLASS="indent"-->
    
    <A NAME="Engine">

    <H3>Tagging Engine</H3>

    <UL CLASS="indent">

	<P>
	  Marking certains parts of a text should not involve storing
	  hundreds of small strings. This is why the Tagging Engine
	  uses a specially formatted list of tuples to return the
	  results:

	<P>
	  <B>Tag List</B>

	<P>
	  A tag list is a list of tuples marking certain slices of
	  a text. The tuples always have the format
<PRE><FONT COLOR="#000066">(object, left_index, right_index, sublist)
</FONT></PRE>
	<P>
	  with the meaning: <CODE>object</CODE> contains
	  information about the slice
	  <CODE>[left_index:right_index]</CODE> in some text. The
	  <CODE>sublist</CODE> is either another taglist created
	  by recursively invoking the Tagging Engine or
	  <CODE>None</CODE>.

	<P>
	  <B>Tag Table</B>

	<P>
	  To create such taglists, you have to define a Tag Table
	  and let the Tagging Engine use it to mark the text.  Tag
	  Tables are really just standard Python tuples containing
	  other tuples in a specific format:

	<PRE><FONT COLOR="#000066">tag_table = (('lowercase',AllIn,a2z,+1,+2),
	     ('upper',AllIn,A2Z,+1),
	     (None,AllIn,white+newline,+1),
	     (None,AllNotIn,alpha+white+newline,+1),
	     (None,EOF,Here,-4)) # EOF </FONT></PRE>

	<P>
	  The tuples contained in the table use a very simple format:
	    <PRE><FONT COLOR="#000066">(tagobj, command+flags, command_argument
	      		[,jump_no_match] [,jump_match=+1])
	    </FONT></PRE>

	<B>Semantics</B>

	<P>
	  The Tagging Engine reads the Tag Table starting at the top
	  entry. While performing the command actions (see below for
	  details) it moves a read-head over the characters of the
	  text. The engine stops when a command fails to match and no
	  alternative is given or when it reaches a non-existing
	  entry, e.g. by jumping beyond the end of the table.

	<P>
	  Tag Table entries are processed as follows:

	<P>
	  If the <CODE>command</CODE> matched, say the slice
	  <CODE>text[l:r]</CODE>, the default action is to append
	  <CODE>(tagobj,l,r,sublist)</CODE> to the taglist (this
	  behaviour can be modified by using special
	  <CODE>flags</CODE>; if you use <CODE>None</CODE> as tagobj,
	  no tuple is appended) and to continue matching with the
	  table entry that is reached by adding
	  <CODE>jump_match</CODE> to the current position (think of
	  them as relative jump offsets). The head position of the
	  engine stays where the command left it (over index
	  <CODE>r</CODE>), e.g. for <CODE>(None,AllIn,'A')</CODE>
	  right after the last 'A' matched.

	<P>
	  In case the <CODE>command</CODE> does not match, the
	  engine either continues at the table entry reached after
	  skipping <CODE>jump_no_match</CODE> entries, or if this
	  value is not given, terminates matching the current
	  table and returns <I>not matched</I>. The head position is
	  always restored to the position it was in before the
	  non-matching command was executed, enabling
	  backtracking.

	<P>
	  The format of the <CODE>command_argument</CODE> is dependent
	  on the command. It can be a string, a set, a search object,
	  a tuple of some other wild animal from Python land. See the
	  command section below for details.

	<P>
	  A table matches a string if and only if the Tagging Engine
	  reaches a table index that lies beyond the end of the
	  table. The engine then returns <I>matched ok</I>. Jumping
	  beyond the start of the table (to a negative table index)
	  causes the table to return with result <I>failed to
	  match</I>.

	<P>
	  <B>Tagging Commands</B>

	<P>
	  The commands and constants used here are integers defined in
	  <TT>Constants/TagTables.py</TT> and imported into the
	  package's root module. For the purpose of explaining the
	  taken actions we assume that the tagging engine was called
	  with <CODE>tag(text,table,start=0,end=len(text))</CODE>. The
	  current head position is indicated by <CODE>x</CODE>.

	<P>
	<TABLE BORDER=0 CELLSPACING=1 CELLPADDING=5 BGCOLOR="#F3F3F3">
	  <TR BGCOLOR="#D6D6D6">
	    <TD><B>Command</B></TD>

	    <TD><B>Matching Argument</B></TD>

	    <TD><B>Action</B></TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>Fail</TD>

	    <TD>Here</TD>

	    <TD>
	      Causes the engine to fail matching at the current head
	      position.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>Jump</TD>

	    <TD>To</TD>

	    <TD>
	      Causes the engine to perform a relative jump by
	      <CODE>jump_no_match</CODE> entries.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>AllIn</TD>

	    <TD>string</TD>

	    <TD>
	      Matches all characters found in <CODE>text[x:end]</CODE>
	      up to the first that is not included in string. At least
	      one character must match.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>AllNotIn</TD>

	    <TD>string</TD>

	    <TD>
	      Matches all characters found in <CODE>text[x:end]</CODE>
	      up to the first that is included in string. At least one
	      character must match.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>AllInSet</TD>

	    <TD>set</TD>

	    <TD>
	      Matches all characters found in <CODE>text[x:end]</CODE>
	      up to the first that is not included in the string
	      set. At least one character must match.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>Is</TD>

	    <TD>character</TD>

	    <TD>
	      Matches iff <CODE>text[x] == character</CODE>.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>IsNot</TD>

	    <TD>character</TD>

	    <TD>
	      Matches iff <CODE>text[x] != character</CODE>.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>IsIn</TD>

	    <TD>string</TD>

	    <TD>
	      Matches iff <CODE>text[x] is in string</CODE>.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>IsNotIn</TD>

	    <TD>string</TD>

	    <TD>
	      Matches iff <CODE>text[x] is not in string</CODE>.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>IsInSet</TD>

	    <TD>set</TD>

	    <TD>
	      Matches iff <CODE>text[x] is in set</CODE>.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>Word</TD>

	    <TD>string</TD>

	    <TD>
	      Matches iff <CODE>text[x:x+len(string)] == string</CODE>.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>WordStart</TD>

	    <TD>string</TD>

	    <TD>
	      Matches all characters up to the first occurance of
	      string in <CODE>text[x:end]</CODE>.
	      <P>
		If string is not found, the command does not match and
		the head position remains unchanged. Otherwise, the
		head stays on the first character of string in the
		found occurance.
	      <P>
		At least one character must match.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>WordEnd</TD>

	    <TD>string</TD>

	    <TD>
	      Matches all characters up to the first occurance of
	      string in <CODE>text[x:end]</CODE>. 
	      <P>
		If string is not found, the command does not match and
		the head position remains unchanged.  Otherwise, the
		head stays on the last character of string in the
		found occurance.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>sWordStart</TD>

	    <TD>search object</TD>

	    <TD>
	      Same as WordStart except that the search object is used
	      to perform the necessary action (which can be much faster)
	      and zero matching characters are allowed.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>sWordEnd</TD>

	    <TD>search object</TD>

	    <TD>
	      Same as WordEnd except that the search object is used
	      to perform the necessary action (which can be much faster).
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>sFindWord</TD>

	    <TD>search object</TD>

	    <TD>
	      Uses the search object to find the given substring.
	      <P>
		If found, the tagobj is assigned only to the slice of
		the substring. The characters leading up to it are
		ignored.
	      <P>
		The head position is adjusted to right after the
		substring -- just like for sWordEnd.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>Call</TD>

	    <TD>function</TD>

	    <TD>
	      Calls the matching
	      <CODE>function(text,x,end)</CODE>.
	      <P>
		The function must return the index <CODE>y</CODE> of
		the character in <CODE>text[x:end]</CODE> right after
		the matching substring.
	      <P>
		The entry is considered to be matching, iff <CODE>x !=
		y</CODE>. The engines head is positioned on
		<CODE>y</CODE> in that case.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>CallArg</TD>

	    <TD>(function,[arg0,...])</TD>

	    <TD>
	      Same as Call except that
	      <CODE>function(text,x,end[,arg0,...])</CODE> is being
	      called. The command argument must be a tuple.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>Table</TD>

	    <TD>tagtable or ThisTable</TD>

	    <TD>
	      Matches iff tagtable matches <CODE>text[x:end]</CODE>.
	      <P>
		This calls the engine recursively.
	      <P>
		In case of success the head position is adjusted to
		point right after the match and the returned taglist
		is made available in the subtags field of this tables
		taglist entry.
	      <P>
		You may pass the special constant
		<CODE>ThisTable</CODE> instead of a Tag Table if you
		want to call the current table recursively.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>SubTable</TD>

	    <TD>tagtable or ThisTable</TD>

	    <TD>
	      Same as Table except that the subtable reuses this
	      table's tag list for its tag list.  The
	      <CODE>subtags</CODE> entry is set to None.
	      <P>
		You may pass the special constant
		<CODE>ThisTable</CODE> instead of a Tag Table if you
		want to call the current table recursively.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>TableInList</TD>

	    <TD>(list_of_tables,index)</TD>

	    <TD>
	      Same as Table except that the matching table to be used
	      is read from the <CODE>list_of_tables</CODE> at position
	      <CODE>index</CODE> whenever this command is
	      executed.
	      <P>
		This makes self-referencing tables possible which
		would otherwise not be possible (since Tag Tables are
		immutable tuples).
	      <P>
		Note that it can also introduce circular references,
		so be warned !
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>SubTableInList</TD>

	    <TD>(list_of_tables,index)</TD>

	    <TD>
	      Same as TableInList except that the subtable reuses this
	      table's tag list. The <CODE>subtags</CODE> entry is set
	      to <CODE>None</CODE>.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>EOF</TD>

	    <TD>Here</TD>

	    <TD>
	      Matches iff the head position is beyond <CODE>end</CODE>.
	    </TD>
	  </TR>

	  <TR VALIGN=TOP>
	    <TD>Skip</TD>

	    <TD>offset</TD>

	    <TD>
	      Always matches and moves the head position to <CODE>x +
	      offset</CODE>.
	    </TD>
	  </TR>

	  
	  <TR VALIGN=TOP>
	    <TD>Move</TD>

	    <TD>position</TD>

	    <TD>
	      Always matches and moves the head position to
	      <CODE>slice[position]</CODE>. Negative indices move the
	      head to <CODE>slice[len(slice)+position+1]</CODE>,
	      e.g. (None,Move,-1) moves to EOF. <CODE>slice</CODE>
	      refers to the current text slice being worked on by the
	      Tagging Engine.
	    </TD>
	  </TR>

	  
	  <TR VALIGN=TOP>
	    <TD>Loop</TD>

	    <TD>count</TD>

	    <TD>
	      Remains undocumented for this release.
	    </TD>
	  </TR>

	  
	  <TR VALIGN=TOP>
	    <TD>LoopControl</TD>

	    <TD>Break/Reset</TD>

	    <TD>
	      Remains undocumented for this release.
	    </TD>
	  </TR>

	</TABLE>

	<P>
	  The following flags can be added to the command integers above:

	<P>
	<UL CLASS="indent">
	    <DL>

	      <DT>
		CallTag
		
	      <DD>
		Instead of appending <CODE>(tagobj,l,r,subtags)</CODE>
		to the taglist upon successful matching, call
		<CODE>tagobj(taglist,text,l,r,subtags)</CODE>.
		<P>

	      <DT>
		AppendMatch

	      <DD>
		Instead of appending <CODE>(tagobj,l,r,subtags)</CODE>
		to the taglist upon successful matching, append the
		match found as string.  
		<P>
		  Note that this will produce non-standard taglists ! 
		  It is useful in combination with <CODE>join()</CODE>
		  though and can be used to implement smart split()
		  replacements algorithms.
		<P>
		  
	      <DT>
		AppendToTagobj

	      <DD>
		Instead of appending <CODE>(tagobj,l,r,subtags)</CODE>
		to the taglist upon successful matching, call
		<CODE>tagobj.append((None,l,r,subtags))</CODE>.
		<P>
		  
	      <DT>
		AppendTagobj

	      <DD>
		Instead of appending <CODE>(tagobj,l,r,subtags)</CODE>
		to the taglist upon successful matching, append
		<CODE>tagobj</CODE> itself. 
		<P>
		  Note that this can cause the taglist to have a
		  non-standard format, i.e. functions relying on the
		  standard format could fail. 
		<P>
		  This flag is mainly intended to build
		  <I>join-lists</I> usable by the
		  <CODE>join()</CODE>-function (see below).
		<P>

	      <DT>
		LookAhead

	      <DD>
		If this flag is set, the current position of the head
		will be reset to <CODE>l</CODE> (the left position of
		the match) after a successful match.
		<P>
		  This is useful to implement lookahead strategies.
		<P>
		  Using the flag has no effect on the way the tagobj
		  itself is treated, i.e. it will still be processed
		  in the usual way.
		<P>

	    </DL>
	</UL><!--CLASS="indent"-->

	<P>
	  Some additional constants that can be used as argument or relative
	  jump position:

	<P>
	<UL CLASS="indent">
	    <DL>

	      <DT>
		To
		
	      <DD>
		Useful as argument for 'Jump'.
		<P>

	      <DT>
		Here
		
	      <DD>
		Useful as argument for 'Fail' and 'EOF'.
		<P>

	      <DT>
		MatchOk
		
	      <DD>
		Jumps to a table index beyond the tables end, causing
		the current table to immediatly return with 'matches
		ok'.
		<P>

	      <DT>
		MatchFail
		
	      <DD>
		Jumps to a negative table index, causing the current
		table to immediatly return with 'failed to match'.
		<P>

	      <DT>
		ToBOF,ToEOF
		
	      <DD>
		Useful as arguments for 'Move': (None,Move,ToEOF)
		moves the head to the character behind the last
		character in the current slice, while
		(None,Move,ToBOF) moves to the first character.
		<P>

	      <DT>
		ThisTable
		
	      <DD>
		Useful as argument for 'Table' and 'SubTable'. See
		above for more information.
		<P>

	    </DL>
	</UL><!--CLASS="indent"-->

	<P>
	  Internally, the Tag Table is used as program for a state
	  machine which is coded in C and accessible through the
	  package as <CODE>tag()</CODE> function along with the
	  constants used for the commands (e.g. Allin, AllNotIn,
	  etc.). Note that in computer science one normally
	  differentiates between finite state machines, pushdown
	  automata and turing machines. The Tagging Engine offers all
	  these levels of complexity depending on which techniques you
	  use, yet the basic structure of the engine is best compared
	  to a finite state machine.

	<P>
	  I admit, these tables don't look very elegant. In fact I
	  would much rather write them in some meta language that gets
	  compiled into these tables instead of handcoding them. But
	  I've never had time to do much research into this. Mike
	  C. Fletcher has been doing some work in this direction
	  recently. You may want to check out his <A
	  HREF="http://members.home.com/mcfletch/programming/simpleparse/simpleparse.html">SimpleParse</A>
	  add-on for mxTextTools. Recently, Tony J. Ibbs has also
	  started to work in this direction. His <A
	  HREF="http://homepage.ntlworld.com/tibsnjoan/mxtext/metalang.html">meta-language
	  for mxTextTools</A> aims at simplifying the task of writing
	  Tag Table tuples.

	<P>
	  <U>Tip:</U> if you are getting an error 'call of a
	  non-function' while writing a table definition, you probably
	  have a missing ',' somewhere in the tuple !

	<P>
	  <B>Debugging</B>

	<P>
	  The packages includes a nearly complete Python emulation of
	  the Tagging Engine in the Examples subdirectory
	  (pytag.py). Though it is unsupported it might still provide
	  some use since it has a builtin debugger that will let you
	  step through the Tag Tables as they are executed. See the
	  source for further details.

	<P>
	  As an alternative you can build a version of the Tagging
	  Engine that provides lots of debugging output. See
	  <TT>mxTextTools/Setup</TT> for explanations on how to do
	  this. When enabled the module will create several
	  <TT>.log</TT> files containing the debug information of
	  various parts of the implementation whenever the Python
	  interpreter is run with the debug flag enabled (python
	  -d). These files should give a fairly good insight into the
	  workings of the Tag Engine (though it still isn't as elegant
	  as it could be).

	<P>
	  Note that the debug version of the module is almost as fast
	  as the regular build, so you might as well do regular work
	  with it.

    </UL><!--CLASS="indent"-->

    <A NAME="Objects">

    <H3>Search Objects</H3>

    <UL CLASS="indent">

	<P>
	  These objects are immutable and usable for one search string
	  per object only. They can be applied to as many text strings
	  as you like -- much like compiled regular
	  expressions. Matching is done exact (doing the translations
	  on-the-fly). 

	<P>
	  The search objects can be pickled and implement the copy
	  protocol as defined by the copy module. Comparisons and
	  hashing are not implemented (the objects are stored by id in
	  dictionaries -- may change in future releases though).

	<P>
	  <B>Search Object Constructors</B>

	<UL CLASS="indent">
	    <P>
	      There are two types of search objects. The Boyer-Moore
	      type uses less memory, while the Fast Search type gives
	      you enhanced speed with a little more memory overhead.

	    <P>
	      <U>Note:</U> The Fast Search object is *not* included in
	      the public release, since I wan't to write a paper about
	      it and therefore can't make it available yet.

	    <P>
	    <DL>
	      <DT><CODE><FONT COLOR="#000099">
		    BMS(match[,translate])</FONT></CODE></DT>

	      <DD>
		Create a Boyer Moore substring search object for the
		string match; translate is an optional
		translate-string like the one used in the module 're',
		i.e. a 256 character string mapping the oridnals of
		the base character set to new characters. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    FS(match[,translate])</FONT></CODE></DT>

	      <DD>
		Create a Fast substring Search object for the string
		match; translate is an optional translate-string like
		the one used in the module 're'. </DD><P>

	    </DL>
	</UL><!--CLASS="indent"-->

	<P>
	  <B>Search Object Instance Variables</B>

	<UL CLASS="indent">
	    <P>
	      To provide some help for reflection and pickling
	      the search types give (read-only) access to these
	      attribute.

	    <P>
	    <DL>

	      <DT><CODE><FONT COLOR="#000099">
		    match</FONT></CODE></DT>

	      <DD>
		The string that the search object will look for in the
		search text.</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    translate</FONT></CODE></DT>

	      <DD>
		The translate string used by the object or None (if no
		translate string was passed to the
		constructor).</DD><P>

	    </DL>

	</UL><!--CLASS="indent"-->

	<P>
	  <B>Search Object Instance Methods</B>

	<UL CLASS="indent">
	    <P>
	      The two search types have the same methods:

	    <P>
	    <DL>

	      <DT><CODE><FONT COLOR="#000099">
		    search(text,[start=0,len_text=len(text)])</FONT></CODE></DT>

	      <DD>
		Search for the substring match in text, looking only
		at the slice <CODE>[start:len_text]</CODE> and return
		the slice <CODE>(l,r)</CODE> where the substring was
		found, or <CODE>(start,start)</CODE> if it was not
		found.</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    find(text,[start=0,len_text=len(text)])</FONT></CODE></DT>

	      <DD>
		Search for the substring match in text, looking only
		at the slice <CODE>[start:len_text]</CODE> and return
		the index where the substring was found, or
		<CODE>-1</CODE> if it was not found. This interface is
		compatible with <CODE>string.find</CODE>.</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    findall(text,start=0,len_text=len(text))</FONT></CODE></DT>

	      <DD>
		Same as <CODE>search()</CODE>, but return a list of
		all non-overlapping slices <CODE>(l,r)</CODE> where
		the match string can be found in text.</DD><P>

	    </DL>

	    <P>
	      Note that translating the text before doing the search
	      often results in a better performance. Use
	      <CODE>string.translate()</CODE> to do that efficiently.
	</UL><!--CLASS="indent"-->
    </UL><!--CLASS="indent"-->

    <A NAME="Functions">

    <H3>Functions</H3>

    <UL CLASS="indent">

	<P>
	  These functions are defined in the package:

	<P>
	<UL CLASS="indent">
	    <DL>

	      <DT><CODE><FONT COLOR="#000099">
		    tag(text,tagtable[,startindex=0,len_text=len(text),taglist=[]])</FONT></CODE></DT>

	      <DD>
		This is the interface to the Tagging Engine. 

		<P>
		  It returns a tuple <CODE>(success, taglist,
		  nextindex)</CODE>, where nextindex indicates the
		  next index to be processed after the last character
		  matched by the Tag Table. 

		<P>
		  In case of a non match (success == 0), it points to
		  the error location in text.  If you provide a tag
		  list it will be used for the processing. 

		<P>
		  Passing <CODE>None</CODE> as taglist results in no
		  tag list being created at all. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    join(joinlist[,sep='',start=0,stop=len(joinlist)])</FONT></CODE></DT>

	      <DD>
		This function works much like the corresponding
		function in module 'string'. It pastes slices from
		other strings together to form a new string. 

		<P>
		  The format expected as <I>joinlist</I> is similar to
		  a tag list: it is a sequence of tuples
		  <CODE>(string,l,r[,...])</CODE> (the resulting
		  string will then include the slice
		  <CODE>string[l:r]</CODE>) or strings (which are
		  copied as a whole). Extra entries in the tuple are
		  ignored. 

		<P>
		  The optional argument sep is a separator to be used
		  in joining the slices together, it defaults to the
		  empty string (unlike string.join). start and stop
		  allow to define the slice of joinlist the function
		  will work in.
		
		<P>
		  <U>Important Note:</U> The syntax used for negative
		  slices is different than the Python standard: -1
		  corresponds to the first character *after* the string,
		  e.g. ('Example',0,-1) gives 'Example' and not 'Exampl',
		  like in Python. To avoid confusion, don't use negative
		  indices. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    cmp(a,b)</FONT></CODE></DT>

	      <DD>
		Compare two valid taglist tuples w/r to their slice
		position. This is useful for sorting joinlists and not
		much slower than sorting integers, since the function is
		coded in C.  </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    joinlist(text,list[,start=0,stop=len(text)])</FONT></CODE></DT>

	      <DD>
		Produces a joinlist suitable for passing to
		<CODE>join()</CODE> from a list of tuples
		<CODE>(replacement,l,r,...)</CODE> in such a way that all
		slices <CODE>text[l:r]</CODE> are replaced by the given
		replacement. 

		<P>
		  A few restrictions apply, though:
		<OL>

		  <LI>
		    the list must be sorted ascending (e.g. using the
		    cmp() as compare function)

		  <LI>
		    it may not contain overlapping slices

		  <LI>
		    the slices may not contain negative indices

		  <LI>
		    if the taglist cannot contain overlapping slices, you
		    can give this function the taglist produced by tag()
		    directly (sorting is not needed, as the list will
		    already be sorted)

		</OL>

		<P>
		  If one of these conditions is not met, a ValueError
		  is raised.  </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    set(string[,logic=1])</FONT></CODE></DT>

	      <DD>
		Returns a character set for string: a bit encoded version
		of the characters occurring in string. 

		<P>
		  If logic is 0, then all characters <I>not</I> in
		  string will be in the set. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    invset(string)</FONT></CODE></DT>

	      <DD>
		Same as <CODE>set(string,0)</CODE>.  </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    setfind(text,set[,start=0,stop=len(text)])</FONT></CODE></DT>

	      <DD>
		Find the first occurence of any character from set in
		<CODE>text[start:stop]</CODE>. <CODE>set</CODE> must be a
		string obtained from <CODE>set()</CODE>.  </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    setstrip(text,set[,start=0,stop=len(text),mode=0])</FONT></CODE></DT>

	      <DD>
		Strip all characters in text[start:stop] appearing in
		set.  mode indicates where to strip (&lt;0: left; =0:
		left and right; &gt;0: right). set must be a string
		obtained with <CODE>set()</CODE>.  </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    setsplit(text,set[,start=0,stop=len(text)])</FONT></CODE></DT>

	      <DD>
		Split text[start:stop] into substrings using set, omitting
		the splitting parts and empty substrings. <CODE>set</CODE>
		must be a string obtained from <CODE>set()</CODE>.
	      </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    setsplitx(text,set[,start=0,stop=len(text)])</FONT></CODE></DT>

	      <DD>
		Split text[start:stop] into substrings using set, so that
		every second entry consists only of characters in
		set. <CODE>set</CODE> must be a string obtained from
		<CODE>set()</CODE>.  </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    upper(string)</FONT></CODE></DT>

	      <DD>
		Returns the string with all characters converted to upper
		case. 

		<P>
		  Note that the translation string used is generated
		  at import time. Locale settings will only have an
		  effect if set prior to importing the package. 

		<P>
		  This function is almost twice as fast as the one in
		  the string module. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    lower(string)</FONT></CODE></DT>

	      <DD>
		Returns the string with all characters converted to lower
		case. Same note as for upper(). </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    is_whitespace(text,start=0,stop=len(text))</FONT></CODE></DT>

	      <DD>
		Returns 1 iff text[start:stop] only contains whitespace
		characters (as defined in Constants/Sets.py), 0
		otherwise.</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    replace(text,what,with,start=0,stop=len(text))</FONT></CODE></DT>

	      <DD>
		Works just like string.replace() -- only faster since a
		search object is used in the process. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    multireplace(text,replacements,start=0,stop=len(text))</FONT></CODE></DT>

	      <DD>
		Apply multiple replacement to a text in one processing step.

		replacements must be list of tuples (replacement,
		left, right).  The replacement string is then used to
		replace the slice text[left:right].

		Note that the replacements do not affect one another
		w/r to indexing: indices always refer to the original
		text string.

		Replacements may not overlap. Otherwise a ValueError
		is raised. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    find(text,what,start=0,stop=len(text))</FONT></CODE></DT>

	      <DD>
		Works just like string.find() -- only faster since a
		search object is used in the process. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    findall(text,what,start=0,stop=len(text))</FONT></CODE></DT>

	      <DD>
		Returns a list of slices representing all
		non-overlapping occurances of what in
		text[start:stop]. The slices are given as 2-tuples
		<CODE>(left,right)</CODE> meaning that
		<CODE>what</CODE> can be found at text[left:right].
		</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    collapse(text,separator=' ')</FONT></CODE></DT>

	      <DD>
		Takes a string, removes all line breaks, converts all
		whitespace to a single separator and returns the
		result. Tim Peters will like this one with separator
		'-'. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    charsplit(text,char,start=0,stop=len(text))</FONT></CODE></DT>

	      <DD>
		Returns a list that results from splitting
		text[start:stop] at all occurances of the character
		given in char. 

		<P>
		  This is a special case of string.split() that has
		  been optimized for single character splitting
		  running 40% faster. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    splitat(text,char,nth=1,start=0,stop=len(text))</FONT></CODE></DT>

	      <DD>
		Returns a 2-tuple that results from splitting
		text[start:stop] at the nth occurance of char. 

		<P>
		  If the character is not found, the second string is
		  empty. nth may also be negative: the search is then
		  done from the right and the first string is empty in
		  case the character is not found.  

		<P>
		  The splitting character itself is not included in
		  the two substrings. </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    suffix(text,suffixes,start=0,stop=len(text)[,translate])</FONT></CODE></DT>

	      <DD>
		Looks at text[start:stop] and returns the first
		matching suffix out of the tuple of strings given in
		suffixes.  

		<P>
		  If no suffix is found to be matching, None is
		  returned.  An empty suffix ('') matches the
		  end-of-string. 

		<P>
		  The optional 256 char translate string is used to
		  translate the text prior to comparing it with the
		  given suffixes. It uses the same format as the
		  search object translate strings. If not given, no
		  translation is performed and the match done exact.

	      </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    prefix(text,prefixes,start=0,stop=len(text)[,translate])</FONT></CODE></DT>

	      <DD>
		Looks at text[start:stop] and returns the first
		matching prefix out of the tuple of strings given in
		prefixes.  

		<P>
		  If no prefix is found to be matching, None is
		  returned. An empty prefix ('') matches the
		  end-of-string. 

		<P>
		  The optional 256 char translate string is used to
		  translate the text prior to comparing it with the
		  given suffixes. It uses the same format as the
		  search object translate strings. If not given, no
		  translation is performed and the match done exact.

	      </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    splitlines(text)</FONT></CODE></DT>

	      <DD>
		Splits text into a list of single lines.

		<P>
		  The following combinations are considered to be
		  line-ends: '\r', '\r\n', '\n'; they may be used in
		  any combination.  The line-end indicators are
		  removed from the strings prior to adding them to the
		  list.

		<P>
		  This function allows dealing with text files from
		  Macs, PCs and Unix origins in a portable way.
		  </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    countlines(text)</FONT></CODE></DT>

	      <DD>
		Returns the number of lines in text.

		<P>
		  Line ends are treated just like for splitlines() in
		  a portable way.  </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    splitwords(text)</FONT></CODE></DT>

	      <DD>
		Splits text into a list of single words delimited by
		whitespace.

		<P>
		  This function is just here for completeness. It
		  works in the same way as string.split(text).  Note
		  that setsplit() gives you much more control over how
		  splitting is performed. whitespace is defined as
		  given below (see <A
		  HREF="#Constants">Constants</A>).  </DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    str2hex(text)</FONT></CODE></DT>

	      <DD>
		Returns text converted to a string consisting of two
		byte HEX values, e.g. ',.-' is converted to
		'2c2e2d'. The function uses lowercase HEX
		characters.</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    hex2str(hex)</FONT></CODE></DT>

	      <DD>
		Returns the string hex interpreted as two byte HEX
		values converted to a string, e.g. '223344' becomes
		'"3D'. The function expects lowercase HEX characters
		per default but can also work with upper case
		ones.</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    isascii(text)</FONT></CODE></DT>

	      <DD>
		Returns 1/0 depending on whether text only contains
		ASCII characters or not.</DD><P>

	    </DL>
	</UL><!--CLASS="indent"-->

	<P>
	  The <TT>TextTools.py</TT> also defines some other functions, but
	  these are left undocumented since they may disappear in future
	  releases.

	<P>

    </UL><!--CLASS="indent"-->

    <A NAME="Constants">

    <H3>Constants</H3>

    <UL CLASS="indent">

	<P>
	  The package exports these constants. They are defined in
	  <TT>Constants/Sets</TT>.

	<P>
	<UL CLASS="indent">
	    <DL>

	      <DT><CODE><FONT COLOR="#000099">
		    a2z</FONT></CODE></DT>

	      <DD>
		'abcdefghijklmnopqrstuvwxyz'</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    A2Z</FONT></CODE></DT>

	      <DD>
		'ABCDEFGHIJKLMNOPQRSTUVWXYZ'</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    a2z</FONT></CODE></DT>

	      <DD>
		'abcdefghijklmnopqrstuvwxyz'</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    umlaute</FONT></CODE></DT>

	      <DD>
		''</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    Umlaute</FONT></CODE></DT>

	      <DD>
		''</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    alpha</FONT></CODE></DT>

	      <DD>
		A2Z + a2z</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    a2z</FONT></CODE></DT>

	      <DD>
		'abcdefghijklmnopqrstuvwxyz'</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    german_alpha</FONT></CODE></DT>

	      <DD>
		A2Z + a2z + umlaute + Umlaute</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    number</FONT></CODE></DT>

	      <DD>
		'0123456789'</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    alphanumeric</FONT></CODE></DT>

	      <DD>
		alpha + number</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    white</FONT></CODE></DT>

	      <DD>
		' \t\v'</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    newline</FONT></CODE></DT>

	      <DD>
		'\n\r'</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    formfeed</FONT></CODE></DT>

	      <DD>
		'\f'</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    whitespace</FONT></CODE></DT>

	      <DD>
		white + newline + formfeed</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    any</FONT></CODE></DT>

	      <DD>
		All characters from \000-\377</DD><P>

	      <DT><CODE><FONT COLOR="#000099">
		    *_set</FONT></CODE></DT>

	      <DD>
		All of the above as character sets.</DD><P>

	    </DL>
	</UL><!--CLASS="indent"-->
	  
    </UL><!--CLASS="indent"-->

    <A NAME="Examples">

    <H3>Examples of Use</H3>

    <UL CLASS="indent">

	<P>
	  The <TT>Examples/</TT> subdirectory of the package contains a
	  few examples of how tables can be written and used. Here is a
	  non-trivial example for parsing HTML (well, most of it):

	<PRE><FONT COLOR="#000066">
    from mx.TextTools import *

    error = '***syntax error'			# error tag obj

    tagname_set = set(alpha+'-'+number)
    tagattrname_set = set(alpha+'-'+number)
    tagvalue_set = set('"\'> ',0)
    white_set = set(' \r\n\t')

    tagattr = (
	   # name
	   ('name',AllInSet,tagattrname_set),
	   # with value ?
	   (None,Is,'=',MatchOk),
	   # skip junk
	   (None,AllInSet,white_set,+1),
	   # unquoted value
	   ('value',AllInSet,tagvalue_set,+1,MatchOk),
	   # double quoted value
	   (None,Is,'"',+5),
	     ('value',AllNotIn,'"',+1,+2),
	     ('value',Skip,0),
	     (None,Is,'"'),
	     (None,Jump,To,MatchOk),
	   # single quoted value
	   (None,Is,'\''),
	     ('value',AllNotIn,'\'',+1,+2),
	     ('value',Skip,0),
	     (None,Is,'\'')
	   )

    valuetable = (
	# ignore whitespace + '='
	(None,AllInSet,set(' \r\n\t='),+1),
	# unquoted value
	('value',AllInSet,tagvalue_set,+1,MatchOk),
	# double quoted value
	(None,Is,'"',+5),
	 ('value',AllNotIn,'"',+1,+2),
	 ('value',Skip,0),
	 (None,Is,'"'),
	 (None,Jump,To,MatchOk),
	# single quoted value
	(None,Is,'\''),
	 ('value',AllNotIn,'\'',+1,+2),
	 ('value',Skip,0),
	 (None,Is,'\'')
	)

    allattrs = (# look for attributes
	       (None,AllInSet,white_set,+4),
	        (None,Is,'>',+1,MatchOk),
	        ('tagattr',Table,tagattr),
	        (None,Jump,To,-3),
	       (None,Is,'>',+1,MatchOk),
	       # handle incorrect attributes
	       (error,AllNotIn,'> \r\n\t'),
	       (None,Jump,To,-6)
	       )

    htmltag = ((None,Is,'&lt;'),
	       # is this a closing tag ?
	       ('closetag',Is,'/',+1),
	       # a coment ?
	       ('comment',Is,'!',+8),
		(None,Word,'--',+4),
		('text',sWordStart,BMS('-->'),+1),
		(None,Skip,3),
		(None,Jump,To,MatchOk),
		# a SGML-Tag ?
		('other',AllNotIn,'>',+1),
		(None,Is,'>'),
		    (None,Jump,To,MatchOk),
		   # XMP-Tag ?
		   ('tagname',Word,'XMP',+5),
		    (None,Is,'>'),
		    ('text',WordStart,'&lt;/XMP&gt;'),
		    (None,Skip,len('&lt;/XMP&gt;')),
		    (None,Jump,To,MatchOk),
		   # get the tag name
		   ('tagname',AllInSet,tagname_set),
		   # look for attributes
		   (None,AllInSet,white_set,+4),
		    (None,Is,'>',+1,MatchOk),
		    ('tagattr',Table,tagattr),
		    (None,Jump,To,-3),
		   (None,Is,'>',+1,MatchOk),
		   # handle incorrect attributes
		   (error,AllNotIn,'> \n\r\t'),
		   (None,Jump,To,-6)
		  )

    htmltable = (# HTML-Tag
		 ('htmltag',Table,htmltag,+1,+4),
		 # not HTML, but still using this syntax: error or inside XMP-tag !
		 (error,Is,'&lt;',+3),
		  (error,AllNotIn,'&gt;',+1),
		  (error,Is,'>'),
		 # normal text
		 ('text',AllNotIn,'<',+1),
		 # end of file
		 ('eof',EOF,Here,-5),
		)
      
	</FONT></PRE>

	<P>
	  I hope this doesn't scare you away <TT>:-)</TT> ... it's
	  fast as hell.

    </UL><!--CLASS="indent"-->

    <A NAME="Structure">

    <H3>Package Structure</H3>

    <UL CLASS="indent">

    <PRE>
[TextTools]
       [Constants]
              Sets.py
              TagTables.py
       Doc/
       [Examples]
              HTML.py
              Loop.py
              Python.py
              RTF.py
              RegExp.py
              Tim.py
              Words.py
              altRTF.py
              pytag.py
       [mxTextTools]
              test.py
       TextTools.py
    </PRE>

    <P>
      Entries enclosed in brackets are packages (i.e. they are
      directories that include a <TT>__init__.py</TT> file). Ones with
      slashes are just ordinary subdirectories that are not accessible
      via <CODE>import</CODE>.

    <P>
      The package TextTools imports everything needed from the other
      components. It is sometimes also handy to do a <CODE>from
      mx.TextTools.Constants.TagTables import *</CODE>.

    <P>
      <TT>Examples/</TT> contains a few demos of what the Tag Tables
      can do.

    <P>

    </UL><!--CLASS="indent"-->
    
    <H4>Optional Add-Ons for mxTextTools</H4>

    <P>
      Mike C. Fletcher is working on a Tag Table generator called <A
      HREF="http://members.home.com/mcfletch/programming/simpleparse/simpleparse.html">SimpleParse</A>.
      It works as parser generating front end to the Tagging Engine
      and converts a EBNF style grammar into a Tag Table directly
      useable with the <CODE>tag()</CODE> function.

    <P>
      Tony J. Ibbs has started to work on a <A
      HREF="http://www.tibsnjoan.demon.co.uk/mxtext/Metalang.html">meta-language
      for mxTextTools</A>. It aims at simplifying the task of writing
      Tag Table tuples using a Python style syntax. It also gets rid
      off the annoying jump offset calculations.

    <P>
      Andrew Dalke has started work on a parser generator called <A
      HREF="http://www.biopython.org/~dalke/Martel/">Martel</A> built
      upon mxTextTools which takes a regular expression grammer for a
      format and turns the resultant parsed tree into a set of
      callback events emulating the XML/SAX API. The results look very
      promising !

    </UL><!--CLASS="indent"-->

    <A NAME="Support">

    <H3>Support</H3>

    <UL CLASS="indent">

	<P>
	  eGenix.com is providing commercial support for this
	  package. If you are interested in receiving information
	  about this service please see the <A
	  HREF="http://www.egenix.com/files/python/eGenix-mx-Extensions.html#Support">eGenix.com
	  Support Conditions</A>.

    </UL><!--CLASS="indent"-->

    <A NAME="Copyright">

    <H3>Copyright &amp; License</H3>

    <UL CLASS="indent">

	<P>
	  &copy; 1997-2000, Copyright by Marc-Andr&eacute; Lemburg;
	  All Rights Reserved.  mailto: <A
	  HREF="mailto:mal@lemburg.com">mal@lemburg.com</A>
	<P>
	  &copy; 2000-2001, Copyright by eGenix.com Software GmbH,
	  Langenfeld, Germany; All Rights Reserved.  mailto: <A
	  HREF="mailto:info@egenix.com">info@egenix.com</A>

	<P>
	  This software is covered by the <A
	  HREF="mxLicense.html#Public"><B>eGenix.com Public
	  License Agreement</B></A>. The text of the license is also
	  included as file "LICENSE" in the package's main directory.

	<P>
	  <B> By downloading, copying, installing or otherwise using
	  the software, you agree to be bound by the terms and
	  conditions of the eGenix.com Public License
	  Agreement. </B>

    </UL><!--CLASS="indent"-->

    <A NAME="History">

    <H3>History & Future</H3>

    <UL CLASS="indent">

	<P>Things that still need to be done:

	<P><UL>

	    <LI>Provide some more examples.

	    <P><LI>Clean up the C implementation and this document
	    some more.

	    <P><LI>Do some benchmarking...

	    <P><LI>Add a cached based mechanism that compiles the
	    tuples into easily machine readable and sanity checked C
	    arrays. The cache should keep a weak reference to the
	    tuples in order to be able to use their object id as hash
	    value. The cache ought to free and remove entries whose
	    refcount have gone down to one. This should improve the
	    performance of the already fast engine even more. [Patrick
	    Maupan has contributed a similar implementation which
	    waits to be integrated into mxTextTools.]

	    <P><LI>Provide a command to raise parametrized exceptions.

	    <P><LI>Add a tag command to match word-in-list. This could
	    also be extended to use multi pattern search objects.

	    <P><LI>Add a command or feature to allow efficient
	    lookahead. A table will have to be able to return
	    differentiated information about what part of it actually
	    did match. E.g. if the table matches A(B|C|D) and the
	    string is found to match AC, there should be a way for the
	    caller to identify and use that information for further
	    execution.

	    <P><LI>Add a per-call stack and command to manipulate
	    it. This would provide for a way to do recursion without
	    relying on the C stack and also provide a means to
	    implement communication between the different recursive
	    levels (might be of use for the above bullet). [Patrick
	    Maupan has contributed a similar implementation which
	    waits to be integrated into mxTextTools.]

	    <P><LI> Convert some more APIs to use the buffer interface
	    instead of insisting on Python string objects.

	    <P><LI> Add the examples to the regression tests.

	    <P><LI> Add a context object to all commands which call
	    external resources. This should make context sensitive
	    parsing and other cool things much more easy to implement.
	    It will change the function call signatures though, so is
	    likely to break code. [Patrick Maupan has contributed a
	    similar implementation which waits to be integrated into
	    mxTextTools.]

	    <P><LI> Provide an Unicode aware version of mxTextTools.

	    <P><LI> Use a special list implementation for taglists
	    which resize in larger chunks (e.g. 1024 entries per
	    realloc()). The current scheme implemented in the standard
	    Python list implementation does way to many realloc()s,
	    slowing down the taglist creation considerably.

	</UL>

	<P>Things that changed from 2.0.2 to 2.0.3:

	<P><UL>

            <LI> Added isascii().

	</UL>

	<P>Things that changed from 2.0.0 to 2.0.2:

	<P><UL>

            <LI> Fixed a bug in the Words.py example. Thanks to Michael Husmann
	    for finding this one.

            <P><LI> Fixed a memory leak in the CallTag processing.

	</UL>

 	<P>Things that changed from <A
	HREF="mxTextTools-1.1.1.zip">1.1.1</A> to 2.0.0:

	<P><UL>

            <LI> Fixed a cast bug in mxTextTools which shows up on
            Alphas.  Thanks to Tony Ibbs for reporting this one.

            <P><LI> <B>Changed</B> the semantics of the 'Move'
            command.  It now works relative to the current slice
            rather than absolute as it did before. As side effect, you
            can now easily skip back to the first character in the
            currently processed text slice (note that the 'Table'
            commands position always work on sub slices of the text
            slice passed to the tag() function).

	    <P><LI> Added constant Constants.TagTables.ToBOF.

	    <P><LI> Changed some internals producing a slight speedup.
	    Converted some of the functions to use the buffer
	    interface instead of string objects.

	    <P><LI> Fixed a bug that caused the HTML parsers not to
	    detect empty value definitions, e.g. VALUE="". Found by
	    Felix Thibault.

	    <P><LI> Added multireplace().

	    <P><LI> Fixed a bug in the code for SubTableInList: it
	    created a new sub tag list even though it should have used
	    the table's tag list.

	    <P><LI> Fixed a bug in the CALLARG opcode argument
	    handling code. Thanks to Rod Watterworth for spotting this
	    one.

	    <P><LI> Fixed a typo in the collapse() keyword parameter:
	    seperator -> separator.

	    <P><LI> Added LookAhead flag. Thanks to Andrew Dalke for
	    inspiring this flag.

	    <P><LI> Fixed SubTable and SubTableInList to remove any
	    additions to the taglist in case of an unsuccessful match.

	    <P><LI> <B>Moved</B> the package under a new top-level
	    package 'mx'. It is part of the <I>eGenix.com mx BASE
	    distribution</I>.

	</UL>

	<P>Things that changed from <A
	HREF="mxTextTools-1.1.0.zip">1.1.0</A> to 1.1.1:

	<P><UL>

            <LI> Added a compile time switch for the type code used in
            parsing input data for the various APIs dealing with text
            data. It defaults to "s#" meaning that all objects
            implementing the getreadbuffer interface are useable; this
            includes text encoding such as Unicode too, so beware of
            mixing searching pattern object types and text object
            types.

            <P><LI> Fixed a bugglet in the definition of MatchFail. It
            should be the constant -20000, not -1. Also, there was a
            bug in the finishing part of the Tagging Engine: jumps to
            negative table indices did not result in a 'match
            fail'. Thanks to Tony J. Ibbs for pointing this out.

	</UL>

	<P>Things that changed from <A
	HREF="mxTextTools-1.0.2.zip">1.0.2</A> to 1.1.0:

	<P><UL>

            <LI>Added MatchFail jump offset.

            <P><LI>Added suffix() and prefix().

	    <P><LI>Fixed the debugging output so that it will print to
	    several .log-files instead of stdout.

	    <P><LI>Changed the search objects to make them work on any
	    type that supports the buffer protocol, e.g. memory mapped
	    files. The Tagging Engine and the other functions still
	    insist on real Python string objects.

	    <P><LI>Changed join() to accept any sequence as joinlist,
	    not just Python lists.

	    <P><LI>Made the two search objects pickleable, copyable
	    and added instance variables .match and .translate.

	    <P><LI>Added start and stop optional arguments to join().

	    <P><LI>Added AppendMatch flag.

	    <P><LI>Added splitlines(), countlines(), str2hex() and
	    hex2str().

	    <P><LI>Added splitwords().

	    <P><LI>Added SubTableInList command and compactified the
	    Tagging Engine a bit.

	    <P><LI>Added setstrip().

	    <P><LI>Changed the compile time flag MAL_PYTHON to
	    MAL_DEBUG_WITH_PYTHON and hacked up Setup.in a little.

	</UL>

	<P>Things that changed from <A
	HREF="mxTextTools-1.0.1.zip">1.0.1</A> to 1.0.2:

	<P><UL>

            <LI>Fixed some of the undocumented printing functions.

            <P><LI>Added Tim.py example for dynamic programming using
            Tag Tables.

	    <P><LI>Tuned the Tagging Engine a little more. Added optimizations
	    to TextTools.join(). It is faster then string.join() now (but
	    only excepts real Python lists as input).

	    <P><LI>Added collapse(). Tim Peters will like this one...

	    <P><LI>Tuned setsplit, setsplitx and joinlist
	    somewhat. The performance is now comparable to
	    string.split (for tasks producing the same output).

	    <P><LI>Added charsplit() and splitat().

	    <P><LI>Fixed a bug in join() that prevented the function
	    from returning '' for empty lists. It raised a SystemError
	    instead.

	    <P><LI>Added better exception reporting to the tagging
	    engine.  Errors are now reported together with the index
	    of the Tag Table entry that caused the exception.

	    <P><LI>Fixed and reformatted included debugging
	    support. If you want the C engine to be very verbose about
	    what it's doing, compile the engine using '-DMAL_DEBUG
	    -DMAL_PYTHON'. If you run the Python interpreter with '-d'
	    option, the engine will print tons of information to
	    stdout, e.g. "python -d Examples/HTML.py
	    Doc/mxTextTools.html". The engine remains silent without
	    the -d switch.

	    <P><LI>Added special ThisTable constant to simplify
	    writing recursive Tag Tables.

        </UL>

	<P>Things that changed from <A
	HREF="mxTextTools-1.0.0.zip">1.0.0</A> to 1.0.1:

	<P><UL>

            <LI>Added new functions find() and findall().

            <P><LI>Fixed a few quirks that caused compilation problems
            on Windows. Eliminated the dependency on hack.py in
            TextTools.py and some of the examples.

	    <P><LI>Added a compiled Windows PYD-file of the C
	    extension.  Thanks to Gordon McMillan for providing it and
	    pointing out a couple of portability bugs.

	    <P><LI>Added instructions on how to build the C extension
	    under WinXX courtesy of Gordon McMillan.

	    <P><LI>Added some type casts to make CodeWarrior/Mac
	    happy.  Thanks to Just van Rossum for this hint.

        </UL>

	<P>Things that changed from the really old <A
	HREF="tagit.tgz">TagIt module</A> version 0.7 to mxTextTools
	1.0.0:

	<P><UL>

            <LI>Added lots of new commands, fixed some bugs, added
            documentation and wrapped everything into a package.

            <P><LI>Added character set handling routines and search
            objects.

        </UL>

    </UL><!--CLASS="indent"-->

    <P>
    <HR WIDTH="100%">
    <CENTER><FONT SIZE=-1>
        <P>
          &copy; 1997-2000, Copyright by Marc-Andr&eacute; Lemburg;
          All Rights Reserved.  mailto: <A
          HREF="mailto:mal@lemburg.com">mal@lemburg.com</A>
        <P>
          &copy; 2000-2001, Copyright by eGenix.com Software GmbH; 
          All Rights Reserved.  mailto: <A
          HREF="mailto:info@egenix.com">info@egenix.com</A>
    </FONT></CENTER>
    </FONT></CENTER>

  </BODY>
</HTML>