File: re2java.1

package info (click to toggle)
re2c 4.4-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 51,512 kB
  • sloc: cpp: 34,160; ml: 8,494; sh: 5,311; makefile: 1,014; haskell: 611; python: 431; ansic: 234; javascript: 113
file content (4370 lines) | stat: -rw-r--r-- 174,397 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3810
3811
3812
3813
3814
3815
3816
3817
3818
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3839
3840
3841
3842
3843
3844
3845
3846
3847
3848
3849
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3870
3871
3872
3873
3874
3875
3876
3877
3878
3879
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
3895
3896
3897
3898
3899
3900
3901
3902
3903
3904
3905
3906
3907
3908
3909
3910
3911
3912
3913
3914
3915
3916
3917
3918
3919
3920
3921
3922
3923
3924
3925
3926
3927
3928
3929
3930
3931
3932
3933
3934
3935
3936
3937
3938
3939
3940
3941
3942
3943
3944
3945
3946
3947
3948
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
3959
3960
3961
3962
3963
3964
3965
3966
3967
3968
3969
3970
3971
3972
3973
3974
3975
3976
3977
3978
3979
3980
3981
3982
3983
3984
3985
3986
3987
3988
3989
3990
3991
3992
3993
3994
3995
3996
3997
3998
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
4043
4044
4045
4046
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4099
4100
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
4148
4149
4150
4151
4152
4153
4154
4155
4156
4157
4158
4159
4160
4161
4162
4163
4164
4165
4166
4167
4168
4169
4170
4171
4172
4173
4174
4175
4176
4177
4178
4179
4180
4181
4182
4183
4184
4185
4186
4187
4188
4189
4190
4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
4202
4203
4204
4205
4206
4207
4208
4209
4210
4211
4212
4213
4214
4215
4216
4217
4218
4219
4220
4221
4222
4223
4224
4225
4226
4227
4228
4229
4230
4231
4232
4233
4234
4235
4236
4237
4238
4239
4240
4241
4242
4243
4244
4245
4246
4247
4248
4249
4250
4251
4252
4253
4254
4255
4256
4257
4258
4259
4260
4261
4262
4263
4264
4265
4266
4267
4268
4269
4270
4271
4272
4273
4274
4275
4276
4277
4278
4279
4280
4281
4282
4283
4284
4285
4286
4287
4288
4289
4290
4291
4292
4293
4294
4295
4296
4297
4298
4299
4300
4301
4302
4303
4304
4305
4306
4307
4308
4309
4310
4311
4312
4313
4314
4315
4316
4317
4318
4319
4320
4321
4322
4323
4324
4325
4326
4327
4328
4329
4330
4331
4332
4333
4334
4335
4336
4337
4338
4339
4340
4341
4342
4343
4344
4345
4346
4347
4348
4349
4350
4351
4352
4353
4354
4355
4356
4357
4358
4359
4360
4361
4362
4363
4364
4365
4366
4367
4368
4369
4370
.\" Man page generated from reStructuredText.
.
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
.TH "RE2JAVA" 1 "" "" ""
.SH NAME
re2java \- generate fast lexical analyzers for Java
.SH SYNOPSIS
.sp
re2java \fB[ OPTIONS ]\fP \fB[ WARNINGS ]\fP \fBINPUT\fP
.sp
Input can be either a file or \fB\-\fP for stdin.
.SH INTRODUCTION
.sp
re2java works as a preprocessor. It reads the input file (which is usually a
program in Java, but can be anything) and looks for blocks of code
enclosed in special\-form start/end markers. The text outside of these blocks is
copied verbatim into the output file. The contents of the blocks are processed
by re2java\&. It translates them to code in Java and outputs the generated
code in place of the block.
.sp
Here is an example of a small program that checks if a given string contains a
decimal number:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

class Main {
    static boolean lex(String yyinput) {
        int yycursor = 0;

        /*!re2c
            re2c:YYCTYPE = \(dqchar\(dq;
            re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;
            re2c:yyfill:enable = 0;

            [1\-9][0\-9]* { return true; }
            *           { return false; }
        */
    }

    public static void main(String []args) {
        assert lex(\(dq1234\e0\(dq);
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.sp
In the output re2java replaced the block in the middle with the generated code:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// Generated by re2java
// re2java $INPUT \-o $OUTPUT

class Main {
    static boolean lex(String yyinput) {
        int yycursor = 0;

        
{
    char yych = 0;
    int yystate = 0;
    yyl: while (true) {
        switch (yystate) {
            case 0:
                yych = yyinput.charAt(yycursor);
                yycursor += 1;
                switch (yych) {
                    case 0x31:
                    case 0x32:
                    case 0x33:
                    case 0x34:
                    case 0x35:
                    case 0x36:
                    case 0x37:
                    case 0x38:
                    case 0x39:
                        yystate = 2;
                        continue yyl;
                    default:
                        yystate = 1;
                        continue yyl;
                }
            case 1:
                { return false; }
            case 2:
                yych = yyinput.charAt(yycursor);
                switch (yych) {
                    case 0x30:
                    case 0x31:
                    case 0x32:
                    case 0x33:
                    case 0x34:
                    case 0x35:
                    case 0x36:
                    case 0x37:
                    case 0x38:
                    case 0x39:
                        yycursor += 1;
                        yystate = 2;
                        continue yyl;
                    default:
                        yystate = 3;
                        continue yyl;
                }
            case 3:
                { return true; }
            default:
                throw new IllegalStateException(\(dqinternal lexer error\(dq);
        }
    }
}

    }

    public static void main(String []args) {
        assert lex(\(dq1234\e0\(dq);
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SH BASICS
.sp
A re2java program consists of a sequence of \fIblocks\fP intermixed with code in the
target language. A block may contain \fIdefinitions\fP, \fIconfigurations\fP, \fIrules\fP,
\fIactions\fP and \fIdirectives\fP in any order:
.INDENT 0.0
.TP
.B \fBname = regular\-expression ;\fP
A \fIdefinition\fP binds \fBname\fP to \fBregular\-expression\fP\&. Names may contain
alphanumeric characters and underscore. The \fI\%regular expressions\fP section
gives an overview of re2java syntax for regular expressions. Once defined,
the \fBname\fP can be used in other regular expressions and in rules.
Recursion in named definitions is not allowed, and each name should be
defined before it is used. A block inherits named definitions from the
global scope. Redefining a name that exists in the current scope is an error.
.TP
.B \fBconfiguration = value ;\fP
A \fIconfiguration\fP allows one to change re2java behavior and customize the
generated code. For a full list of configurations supported by re2java see
the \fI\%configurations\fP section. Depending on a particular configuration, the
\fBvalue\fP can be a keyword, a nonnegative integer number or a one\-line string
which should be enclosed in double or single quotes unless it consists of
alphanumeric characters. A block inherits configurations from the global
scope and may redefine them or add new ones. Configurations defined inside
of a block affect the whole block, even if they appear at the end of it.
.TP
.B \fBregular\-expression code\fP
A \fIrule\fP binds \fBregular\-expression\fP to its semantic action (a block of
code in curly braces, or a block of code that starts with \fB:=\fP and ends on
a newline followed by any non\-whitespace character).
If the \fBregular\-expression\fP matches, the associated \fBcode\fP is executed.
If multiple rules match, the longest match takes precedence. If multiple
rules match the same string, the earliest one takes precedence. There are
two special rules: the default rule \fB*\fP and the end of input rule \fB$\fP\&.
Default rule should always be defined, it has the lowest priority regardless
of its place in the block, and it matches any code unit (not necessarily a
valid character, see the \fI\%encoding support\fP section). The end of input rule
should be defined if the corresponding method for
\fI\%handling the end of input\fP is used.
With \fI\%start conditions\fP rules have more complex syntax.
.TP
.B \fB!action code\fP
An \fIaction\fP binds a user\-defined block of \fBcode\fP to a particular place in
the generated finite state machine (in the same way as semantic actions bind
code to the final states). See the \fI\%actions\fP section for a full list of
predefined actions.
.TP
.B \fB!directive ;\fP
A \fIdirective\fP is one of the special predefined statements. Each directive
has a unique purpose. See the \fI\%directives\fP section for details.
.UNINDENT
.SS Blocks
.sp
Block start and end markers are either \fB/*!re2c\fP and \fB*/\fP, or \fB%{\fP and
\fB%}\fP (both styles are supported). Starting from version 2.2 blocks may have
optional names that allow them to be referenced in other blocks.
There are different kinds of blocks:
.INDENT 0.0
.TP
.B \fB/*!re2c[:<name>] ... */\fP or \fB%{[:<name>] ... %}\fP
A \fIglobal block\fP contains definitions, configurations, rules and directives.
re2java compiles regular expressions associated with each rule into a
deterministic finite automaton, encodes it in the form of conditional jumps
in the target language and replaces the block with the generated code. Names
and configurations defined in a global block are added to the global scope
and become visible to subsequent blocks. At the start of the program the
global scope is initialized with command\-line \fI\%options\fP\&.
.TP
.B \fB/*!local:re2c[:<name>] ... */\fP or \fB%{local[:<name>] ... %}\fP
A \fIlocal block\fP is like a global block, but the names and configurations in
it have local scope (they do not affect other blocks).
.TP
.B \fB/*!rules:re2c[:<name>] ... */\fP or \fB%{rules[:<name>] ... %}\fP
A \fIrules block\fP is like a local block, but it does not generate any code by
itself, nor does it add any definitions to the global scope \-\- it is meant
to be reused in other blocks. This is a way of sharing code (more details in
the \fI\%reusable blocks\fP section). Prior to re2java version 2.2 rules blocks
required \fB\-r \-\-reusable\fP option.
.TP
.B \fB/*!use:re2c[:<name>] ... */\fP or \fB%{use[:<name>] ... %}\fP
A use block that references a previously defined rules block. If the name is
specified, re2java looks for a rules blocks with this name. Otherwise the most
recent rules block is used (either a named or an unnamed one). A use block
can add definitions, configurations and rules of its own, which are added to
those of the referenced rules block. Prior to re2java version 2.2 use blocks
required \fB\-r \-\-reusable\fP option.
.TP
.B \fB/*!max:re2c[:<name1>[:<name2>...]] ... */\fP or \fB%{max[:<name1>[:<name2>...]] ... %}\fP
A block that generates \fBYYMAXFILL\fP definition. An optional list of block
names specifies which blocks should be included when computing \fBYYMAXFILL\fP
value (if the list is empty, all blocks are included).
By default the generated code is a macro\-definition for C
(\fB#define YYMAXFILL <n>\fP), or a global variable for Go
(\fBvar YYMAXFILL int = <n>\fP). It can be customized with an optional
configuration \fBformat\fP that specifies a template string where \fB@@{max}\fP
(or \fB@@\fP for short) is replaced with the numeric value of \fBYYMAXFILL\fP\&.
.TP
.B \fB/*!maxnmatch:re2c[:<name1>[:<name2>...]] ... */\fP or \fB%{maxnmatch[:<name1>[:<name2>...]] ... %}\fP
A block that generates \fBYYMAXNMATCH\fP definition (it requires
\fB\-P \-\-posix\-captures\fP option). An optional list of block names specifies
which blocks should be included when computing \fBYYMAXNMATCH\fP value (if the
list is empty, all blocks are included).
By default the generated code is a macro\-definition for C
(\fB#define YYMAXNMATCH <n>\fP), or a global variable for Go
(\fBvar YYMAXNMATCH int = <n>\fP). It can be customized with an optional
configuration \fBformat\fP that specifies a template string where \fB@@{max}\fP
(or \fB@@\fP for short) is replaced with the numeric value of \fBYYMAXNMATCH\fP\&.
.TP
.B \fB/*!stags:re2c[:<name1>[:<name2>...]] ... */\fP, \fB/*!mtags:re2c[:<name1>[:<name2>...]] ... */\fP or \fB%{stags[:<name1>[:<name2>...]] ... %}\fP, \fB%{mtags[:<name1>[:<name2>...]] ... %{\fP
Blocks that specify a template piece of code that is expanded for each
s\-tag/m\-tag variable generated by re2java\&. An optional list of block names
specifies which blocks should be included when computing the set of tag
variables (if the list is empty, all blocks are included).
There are two optional configurations: \fBformat\fP and \fBseparator\fP\&.
Configuration \fBformat\fP specifies a template string where \fB@@{tag}\fP (or
\fB@@\fP for short) is replaced with the name of each tag variable.
Configuration \fBseparator\fP specifies a piece of code used to join the
generated \fBformat\fP pieces for different tag variables.
.TP
.B \fB/*!svars:re2c[:<name1>[:<name2>...]] ... */\fP, \fB/*!mvars:re2c[:<name1>[:<name2>...]] ... */\fP or \fB%{svars[:<name1>[:<name2>...]] ... %}\fP, \fB%{mvars[:<name1>[:<name2>...]] ... %{\fP
Blocks that specify a template piece of code that is expanded for each
s\-tag/m\-tag that is either explicitly mentioned by the rules (with
\fB\-\-tags\fP option) or implicitly generated by re2java (with \fB\-\-captvars\fP or
\fB\-\-posix\-captvars\fP options). An optional list of block names specifies
which blocks should be included when computing the set of tags (if the list
is empty, all blocks are included).
There are two optional configurations: \fBformat\fP and \fBseparator\fP\&.
Configuration \fBformat\fP specifies a template string where \fB@@{tag}\fP (or
\fB@@\fP for short) is replaced with the name of each tag.
Configuration \fBseparator\fP specifies a piece of code used to join the
generated \fBformat\fP pieces for different tags.
.TP
.B \fB/*!getstate:re2c[:<name1>[:<name2>...]] ... */\fP or \fB%{getstate[:<name1>[:<name2>...]] ... %}\fP
A block that generates conditional dispatch on the lexer state (it requires
\fB\-\-storable\-state\fP option). An optional list of block names specifies
which blocks should be included in the state dispatch. The default
transition goes to the start label of the first block on the list. If the
list is empty, all blocks are included, and the default transition goes to
the first block in the file that has a start label.
This block type is incompatible with the \fB\-\-loop\-switch\fP option, as it
requires cross\-block transitions that are unsupported without \fBgoto\fP or
function calls.
.TP
.B \fB/*!conditions:re2c[:<name1>[:<name2>...]] ... */\fP, \fB/*!types:re2c... */\fP or \fB%{conditions[:<name1>[:<name2>...]] ... %}\fP, \fB%{types... %}\fP
A block that generates condition enumeration (it requires \fB\-\-conditions\fP
option). An optional list of block names specifies which blocks should be
included when computing the set of conditions (if the list is empty, all
blocks are included).
By default the generated code is an enumeration \fBYYCONDTYPE\fP\&. It can be
customized with optional configurations \fBformat\fP and \fBseparator\fP\&.
Configuration \fBformat\fP specifies a template string where \fB@@{cond}\fP (or
\fB@@\fP for short) is replaced with the name of each condition, and
\fB@@{num}\fP is replaced with a numeric index of that condition.
Configuration \fBseparator\fP specifies a piece of code used to join the
generated \fBformat\fP pieces for different conditions.
.TP
.B \fB/*!include:re2c <file> */\fP or \fB%{include <file> %}\fP
This block allows one to include \fB<file>\fP, which must be a double\-quoted
file path. The contents of the file are literally substituted in place of
the block, in the same way as \fB#include\fP works in C/C++. This block can be
used together with the \fB\-\-depfile\fP option to generate build system
dependencies on the included files.
.TP
.B \fB/*!header:re2c:on*/\fP or \fB%{header:on %}\fP
This block marks the start of header file. Everything after it and up to the
following \fBheader:off\fP block is processed by re2java and written to the
header file specified with \fB\-t \-\-type\-header\fP option.
.TP
.B \fB/*!header:re2c:off*/\fP or \fB%{header:off %}\fP
This block marks the end of header file started with \fBheader:on*/\fP block.
.TP
.B \fB/*!ignore:re2c ... */\fP or \fB%{ignore ... %}\fP
A block which contents are ignored and removed from the output file.
.UNINDENT
.SS Configurations
.sp
Here is a full list of configurations supported by re2java:
.INDENT 0.0
.TP
.B \fBre2c:api\fP, \fBre2c:input\fP
Same as the \fB\-\-api\fP option.
.TP
.B \fBre2c:api:sigil\fP
Specify the marker (\(dqsigil\(dq) that is used for argument placeholders in the
API primitives. The default is \fB@@\fP\&. A placeholder starts with sigil
followed by the argument name in curly braces. For example, if sigil is set
to \fB$\fP, then placeholders will have the form \fB${name}\fP\&. Single\-argument
APIs may use shorthand notation without the name in braces. This option can
be overridden by options for individual API primitives, e.g.
\fBre2c:YYFILL@len\fP for \fBYYFILL\fP\&.
.TP
.B \fBre2c:api:style\fP
Specify API style. Possible values are \fBfunctions\fP (the default for C) and
\fBfree\-form\fP (the default for Go and Rust).
In \fBfunctions\fP style API primitives are generated with an argument list in
parentheses following the name of the primitive. The arguments are provided
only for autogenerated parameters (such as the number of characters passed
to \fBYYFILL\fP), but not for the general lexer context, so the primitives
behave more like macros in C/C++ or closures in Go and Rust.
In free\-form style API primitives do not have a fixed form: they should be
defined as strings containing free\-form pieces of code with interpolated
variables of the form \fB@@{var}\fP or \fB@@\fP (they correspond to arguments in
function\-like style).
This configuration may be overridden for individual API primitives, see for
example \fBre2c:YYFILL:naked\fP configuration for \fBYYFILL\fP\&.
.TP
.B \fBre2c:bit\-vectors\fP, \fBre2c:flags:bit\-vectors\fP, \fBre2c:flags:b\fP
Same as the \fB\-\-bit\-vectors\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:captures\fP, \fBre2c:leftmost\-captures\fP
Same as the \fB\-\-leftmost\-captures\fP option, but can be configured on
per\-block basis.
.TP
.B \fBre2c:captvars\fP, \fBre2c:leftmost\-captvars\fP
Same as the \fB\-\-leftmost\-captvars\fP option, but can be configured on
per\-block basis.
.TP
.B \fBre2c:case\-insensitive\fP, \fBre2c:flags:case\-insensitive\fP
Same as the \fB\-\-case\-insensitive\fP option, but can be configured on
per\-block basis.
.TP
.B \fBre2c:case\-inverted\fP, \fBre2c:flags:case\-inverted\fP
Same as the \fB\-\-case\-inverted\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:case\-ranges\fP, \fBre2c:flags:case\-ranges\fP
Same as the \fB\-\-case\-ranges\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:computed\-gotos\fP, \fBre2c:flags:computed\-gotos\fP, \fBre2c:flags:g\fP
Same as the \fB\-\-computed\-gotos\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:computed\-gotos:relative\fP, \fBre2c:cgoto:relative\fP
Same as the \fB\-\-computed\-gotos\-relative\fP option, but can be configured on
per\-block basis.
.TP
.B \fBre2c:computed\-gotos:threshold\fP, \fBre2c:cgoto:threshold\fP
If computed \fBgoto\fP is used, this configuration specifies the complexity
threshold that triggers the generation of jump tables instead of nested
\fBif\fP statements and bitmaps. The default value is \fB9\fP\&.
.TP
.B \fBre2c:cond:abort\fP
If set to a positive integer value, the default case in the generated
condition dispatch aborts program execution.
.TP
.B \fBre2c:cond:goto\fP
Specifies a piece of code used for the autogenerated shortcut rules \fB:=>\fP
in conditions. The default is \fBgoto @@;\fP\&.
The \fB@@\fP placeholder is substituted with condition name (see
configurations \fBre2c:api:sigil\fP and \fBre2c:cond:goto@cond\fP).
.TP
.B \fBre2c:cond:goto@cond\fP
Specifies the sigil used for argument substitution in \fBre2c:cond:goto\fP
definition. The default value is \fB@@\fP\&.
Overrides the more generic \fBre2c:api:sigil\fP configuration.
.TP
.B \fBre2c:cond:divider\fP
Defines the divider for condition blocks.
The default value is \fB/* *********************************** */\fP\&.
Placeholders are substituted with condition name (see \fBre2c:api;sigil\fP and
\fBre2c:cond:divider@cond\fP).
.TP
.B \fBre2c:cond:divider@cond\fP
Specifies the sigil used for argument substitution in \fBre2c:cond:divider\fP
definition. The default is \fB@@\fP\&.
Overrides the more generic \fBre2c:api:sigil\fP configuration.
.TP
.B \fBre2c:cond:prefix\fP, \fBre2c:condprefix\fP
Specifies the prefix used for condition labels.
The default is \fByyc_\fP\&.
.TP
.B \fBre2c:cond:enumprefix\fP, \fBre2c:condenumprefix\fP
Specifies the prefix used for condition identifiers.
The default is \fByyc\fP\&.
.TP
.B \fBre2c:debug\-output\fP, \fBre2c:flags:debug\-output\fP, \fBre2c:flags:d\fP
Same as the \fB\-\-debug\-output\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:empty\-class\fP, \fBre2c:flags:empty\-class\fP
Same as the \fB\-\-empty\-class\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:encoding:ebcdic\fP, \fBre2c:flags:ecb\fP, \fBre2c:flags:e\fP
Same as the \fB\-\-ebcdic\fP option, but can be configured on per\-block basis.
.TP
.B \fBre2c:encoding:ucs2\fP, \fBre2c:flags:wide\-chars\fP, \fBre2c:flags:w\fP
Same as the \fB\-\-ucs2\fP option, but can be configured on per\-block basis.
.TP
.B \fBre2c:encoding:utf8\fP, \fBre2c:flags:utf\-8\fP, \fBre2c:flags:8\fP
Same as the \fB\-\-utf8\fP option, but can be configured on per\-block basis.
.TP
.B \fBre2c:encoding:utf16\fP, \fBre2c:flags:utf\-16\fP, \fBre2c:flags:x\fP
Same as the \fB\-\-utf16\fP option, but can be configured on per\-block basis.
.TP
.B \fBre2c:encoding:utf32\fP, \fBre2c:flags:unicode\fP, \fBre2c:flags:u\fP
Same as the \fB\-\-utf32\fP option, but can be configured on per\-block basis.
.TP
.B \fBre2c:encoding\-policy\fP, \fBre2c:flags:encoding\-policy\fP
Same as the \fB\-\-encoding\-policy\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:eof\fP
Specifies the sentinel symbol used with the end\-of\-input rule \fB$\fP\&. The
default value is \fB\-1\fP (\fB$\fP rule is not used). Other possible values
include all valid code units. Only decimal numbers are recognized.
.TP
.B \fBre2c:header\fP, \fBre2c:flags:type\-header\fP, \fBre2c:flags:t\fP
Specifies the name of the generated header file relative to the directory of
the output file. Same as the \fB\-\-header\fP option except that the file path
is relative.
.TP
.B \fBre2c:indent:string\fP
Specifies the string used for indentation. The default is a single tab
character \fB\(dq\et\(dq\fP\&. Indent string should contain whitespace characters only.
To disable indentation entirely, set this configuration to an empty string.
.TP
.B \fBre2c:indent:top\fP
Specifies the minimum amount of indentation to use. The default value is
zero. The value should be a non\-negative integer number.
.TP
.B \fBre2c:invert\-captures\fP
Same as the \fB\-\-invert\-captures\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:label:prefix\fP, \fBre2c:labelprefix\fP
Specifies the prefix used for DFA state labels. The default is \fByy\fP\&.
.TP
.B \fBre2c:label:start\fP, \fBre2c:startlabel\fP
Controls the generation of a block start label. The default value is zero,
which means that the start label is generated only if it is used. An integer
value greater than zero forces the generation of start label even if it is
unused by the lexer. A string value also forces start label generation and
sets the label name to the specified string. This configuration applies only
to the current block (it is reset to default for the next block).
.TP
.B \fBre2c:label:yyFillLabel\fP
Specifies the prefix of \fBYYFILL\fP labels used with \fBre2c:eof\fP and in
storable state mode.
.TP
.B \fBre2c:label:yyloop\fP
Specifies the name of the label marking the start of the lexer loop with
\fB\-\-loop\-switch\fP option. The default is \fByyloop\fP\&.
.TP
.B \fBre2c:label:yyNext\fP
Specifies the name of the optional label that follows \fBYYGETSTATE\fP switch
in storable state mode (enabled with \fBre2c:state:nextlabel\fP). The default
is \fByyNext\fP\&.
.TP
.B \fBre2c:lookahead\fP, \fBre2c:flags:lookahead\fP
Deprecated (see the deprecated \fB\-\-no\-lookahead\fP option).
.TP
.B \fBre2c:monadic\fP
If set to non\-zero, the generated lexer will use monadic notation (this
configuration is specific to Haskell).
.TP
.B \fBre2c:nested\-ifs\fP, \fBre2c:flags:nested\-ifs\fP, \fBre2c:flags:s\fP
Same as the \fB\-\-nested\-ifs\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:posix\-captures\fP, \fBre2c:flags:posix\-captures\fP, \fBre2c:flags:P\fP
Same as the \fB\-\-posix\-captures\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:posix\-captvars\fP
Same as the \fB\-\-posix\-captvars\fP option, but can be configured on per\-block
basis.
.TP
.B \fBre2c:tags\fP, \fBre2c:flags:tags\fP, \fBre2c:flags:T\fP
Same as the \fB\-\-tags\fP option, but can be configured on per\-block basis.
.TP
.B \fBre2c:tags:expression\fP
Specifies the expression used for tag variables.
By default re2java generates expressions of the form \fByyt<N>\fP\&. This might
be inconvenient, for example if tag variables are defined as fields in a
struct. All occurrences of \fB@@{tag}\fP or \fB@@\fP are replaced with the
actual tag name. For example, \fBre2c:tags:expression = \(dqs.@@\(dq;\fP results
in expressions of the form \fBs.yyt<N>\fP in the generated code.
See also \fBre2c:api:sigil\fP configuration.
.TP
.B \fBre2c:tags:negative\fP
Specifies the constant expression that is used for negative tag value
(typically this would be \fB\-1\fP if tags are integer offsets in the input
string, or null pointer if they are pointers).
.TP
.B \fBre2c:tags:prefix\fP
Specifies the prefix for tag variable names. The default is \fByyt\fP\&.
.TP
.B \fBre2c:sentinel\fP
Specifies the sentinel symbol used for the end\-of\-input checks (when bounds
checks are disabled with \fBre2c:yyfill:enable = 0;\fP and \fBre2c:eof\fP is not
set). This configuration does not affect code generation: its purpose is to
verify that the sentinel is not allowed in the middle of a rule, and ensure
that the lexer won\(aqt read past the end of buffer. The default value is
\fI\-1\(ga\fP (in that case re2java assumes that the sentinel is zero, which is the
most common case). Only decimal numbers are recognized.
.TP
.B \fBre2c:state:abort\fP
If set to a positive integer value, the default case in the generated
state dispatch aborts program execution, and an explicit \fB\-1\fP case
contains transition to the start of the block.
.TP
.B \fBre2c:state:nextlabel\fP
Controls if the \fBYYGETSTATE\fP switch is followed by an \fByyNext\fP label
(the default value is zero, which corresponds to no label).
Alternatively one can use \fBre2c:label:start\fP to generate a specific start
label, or an explicit \fBgetstate\fP block to generate the \fBYYGETSTATE\fP
switch separately from the lexer block.
.TP
.B \fBre2c:unsafe\fP, \fBre2c:flags:unsafe\fP
Same as the \fB\-\-no\-unsafe\fP option, but can be configured on per\-block
basis.
If set to zero, it suppresses the generation of \fBunsafe\fP wrappers around
\fBYYPEEK\fP\&. The default is non\-zero (wrappers are generated).
This configuration is specific to Rust.
.TP
.B \fBre2c:YYBACKUP\fP, \fBre2c:define:YYBACKUP\fP
Defines generic API primitive \fBYYBACKUP\fP\&.
.TP
.B \fBre2c:YYBACKUPCTX\fP, \fBre2c:define:YYBACKUPCTX\fP
Defines generic API primitive \fBYYBACKUPCTX\fP\&.
.TP
.B \fBre2c:YYCONDTYPE\fP, \fBre2c:define:YYCONDTYPE\fP
Defines API primitive \fBYYCONDTYPE\fP\&.
.TP
.B \fBre2c:YYCTYPE\fP, \fBre2c:define:YYCTYPE\fP
Defines API primitive \fBYYCTYPE\fP\&.
.TP
.B \fBre2c:YYCTXMARKER\fP, \fBre2c:define:YYCTXMARKER\fP
Defines API primitive \fBYYCTXMARKER\fP\&.
.TP
.B \fBre2c:YYCURSOR\fP, \fBre2c:define:YYCURSOR\fP
Defines API primitive \fBYYCURSOR\fP\&.
.TP
.B \fBre2c:YYDEBUG\fP, \fBre2c:define:YYDEBUG\fP
Defines API primitive \fBYYDEBUG\fP\&.
.TP
.B \fBre2c:YYFILL\fP, \fBre2c:define:YYFILL\fP
Defines API primitive \fBYYFILL\fP\&.
.TP
.B \fBre2c:YYFILL@len\fP, \fBre2c:define:YYFILL@len\fP
Specifies the sigil used for argument substitution in \fBYYFILL\fP
definition. Defaults to \fB@@\fP\&.
Overrides the more generic \fBre2c:api:sigil\fP configuration.
.TP
.B \fBre2c:YYFILL:naked\fP, \fBre2c:define:YYFILL:naked\fP
Overrides the more generic \fBre2c:api:style\fP configuration for \fBYYFILL\fP\&.
Zero value corresponds to free\-form API style.
.TP
.B \fBre2c:YYFN\fP
Defines API primitive \fBYYFN\fP\&.
.TP
.B \fBre2c:YYINPUT\fP
Defines API primitive \fBYYINPUT\fP\&.
.TP
.B \fBre2c:YYGETCOND\fP, \fBre2c:define:YYGETCONDITION\fP
Defines API primitive \fBYYGETCOND\fP\&.
.TP
.B \fBre2c:YYGETCOND:naked\fP, \fBre2c:define:YYGETCONDITION:naked\fP
Overrides the more generic \fBre2c:api:style\fP configuration for
\fBYYGETCOND\fP\&. Zero value corresponds to free\-form API style.
.TP
.B \fBre2c:YYGETSTATE\fP, \fBre2c:define:YYGETSTATE\fP
Defines API primitive \fBYYGETSTATE\fP\&.
.TP
.B \fBre2c:YYGETSTATE:naked\fP, \fBre2c:define:YYGETSTATE:naked\fP
Overrides the more generic \fBre2c:api:style\fP configuration for
\fBYYGETSTATE\fP\&. Zero value corresponds to free\-form API style.
.TP
.B \fBre2c:YYGETACCEPT\fP, \fBre2c:define:YYGETACCEPT\fP
Defines API primitive \fBYYGETACCEPT\fP\&.
.TP
.B \fBre2c:YYLESSTHAN\fP, \fBre2c:define:YYLESSTHAN\fP
Defines generic API primitive \fBYYLESSTHAN\fP\&.
.TP
.B \fBre2c:YYLIMIT\fP, \fBre2c:define:YYLIMIT\fP
Defines API primitive \fBYYLIMIT\fP\&.
.TP
.B \fBre2c:YYMARKER\fP, \fBre2c:define:YYMARKER\fP
Defines API primitive \fBYYMARKER\fP\&.
.TP
.B \fBre2c:YYMTAGN\fP, \fBre2c:define:YYMTAGN\fP
Defines generic API primitive \fBYYMTAGN\fP\&.
.TP
.B \fBre2c:YYMTAGP\fP, \fBre2c:define:YYMTAGP\fP
Defines generic API primitive \fBYYMTAGP\fP\&.
.TP
.B \fBre2c:YYPEEK\fP, \fBre2c:define:YYPEEK\fP
Defines generic API primitive \fBYYPEEK\fP\&.
.TP
.B \fBre2c:YYRESTORE\fP, \fBre2c:define:YYRESTORE\fP
Defines generic API primitive \fBYYRESTORE\fP\&.
.TP
.B \fBre2c:YYRESTORECTX\fP, \fBre2c:define:YYRESTORECTX\fP
Defines generic API primitive \fBYYRESTORECTX\fP\&.
.TP
.B \fBre2c:YYRESTORETAG\fP, \fBre2c:define:YYRESTORETAG\fP
Defines generic API primitive \fBYYRESTORETAG\fP\&.
.TP
.B \fBre2c:YYSETCOND\fP, \fBre2c:define:YYSETCONDITION\fP
Defines API primitive \fBYYSETCOND\fP\&.
.TP
.B \fBre2c:YYSETCOND@cond\fP, \fBre2c:define:YYSETCONDITION@cond\fP
Specifies the sigil used for argument substitution in \fBYYSETCOND\fP
definition. The default value is \fB@@\fP\&.
Overrides the more generic \fBre2c:api:sigil\fP configuration.
.TP
.B \fBre2c:YYSETCOND:naked\fP, \fBre2c:define:YYSETCONDITION:naked\fP
Overrides the more generic \fBre2c:api:style\fP configuration for
\fBYYSETCOND\fP\&. Zero value corresponds to free\-form API style.
.TP
.B \fBre2c:YYSETSTATE\fP, \fBre2c:define:YYSETSTATE\fP
Defines API primitive \fBYYSETSTATE\fP\&.
.TP
.B \fBre2c:YYSETSTATE@state\fP, \fBre2c:define:YYSETSTATE@state\fP
Specifies the sigil used for argument substitution in \fBYYSETSTATE\fP
definition. The default value is \fB@@\fP\&.
Overrides the more generic \fBre2c:api:sigil\fP configuration.
.TP
.B \fBre2c:YYSETSTATE:naked\fP, \fBre2c:define:YYSETSTATE:naked\fP
Overrides the more generic \fBre2c:api:style\fP configuration for
\fBYYSETSTATE\fP\&. Zero value corresponds to free\-form API style.
.TP
.B \fBre2c:YYSETACCEPT\fP, \fBre2c:define:YYSETACCEPT\fP
Defines API primitive \fBYYSETACCEPT\fP\&.
.TP
.B \fBre2c:YYSKIP\fP, \fBre2c:define:YYSKIP\fP
Defines generic API primitive \fBYYSKIP\fP\&.
.TP
.B \fBre2c:YYSHIFT\fP, \fBre2c:define:YYSHIFT\fP
Defines generic API primitive \fBYYSHIFT\fP\&.
.TP
.B \fBre2c:YYCOPYMTAG\fP, \fBre2c:define:YYCOPYMTAG\fP
Defines generic API primitive \fBYYCOPYMTAG\fP\&.
.TP
.B \fBre2c:YYCOPYSTAG\fP, \fBre2c:define:YYCOPYSTAG\fP
Defines generic API primitive \fBYYCOPYSTAG\fP\&.
.TP
.B \fBre2c:YYSHIFTMTAG\fP, \fBre2c:define:YYSHIFTMTAG\fP
Defines generic API primitive \fBYYSHIFTMTAG\fP\&.
.TP
.B \fBre2c:YYSHIFTSTAG\fP, \fBre2c:define:YYSHIFTSTAG\fP
Defines generic API primitive \fBYYSHIFTSTAG\fP\&.
.TP
.B \fBre2c:YYSTAGN\fP, \fBre2c:define:YYSTAGN\fP
Defines generic API primitive \fBYYSTAGN\fP\&.
.TP
.B \fBre2c:YYSTAGP\fP, \fBre2c:define:YYSTAGP\fP
Defines generic API primitive \fBYYSTAGP\fP\&.
.TP
.B \fBre2c:yyaccept\fP, \fBre2c:variable:yyaccept\fP
Defines API primitive \fByyaccept\fP\&.
.TP
.B \fBre2c:yybm\fP, \fBre2c:variable:yybm\fP
Defines API primitive \fByybm\fP\&.
.TP
.B \fBre2c:yybm:hex\fP, \fBre2c:variable:yybm:hex\fP
If set to nonzero, bitmaps for the \fB\-\-bit\-vectors\fP option are generated
in hexadecimal format. The default is zero (bitmaps are in decimal format).
.TP
.B \fBre2c:yych\fP, \fBre2c:variable:yych\fP
Defines API primitive \fByych\fP\&.
.TP
.B \fBre2c:yych:emit\fP, \fBre2c:variable:yych:emit\fP
If set to zero, \fByych\fP definition is not generated.
The default is non\-zero.
.TP
.B \fBre2c:yych:conversion\fP, \fBre2c:variable:yych:conversion\fP
If set to non\-zero, re2java automatically generates a conversion to \fBYYCTYPE\fP
every time \fByych\fP is read. The default is to zero (no conversion).
.TP
.B \fBre2c:yych:literals\fP, \fBre2c:variable:yych:literals\fP
Specifies the form of literals that \fByych\fP is matched against. Possible
values are: \fBchar\fP (character literals in single quotes, non\-printable
ones use escape sequences that start with backslash), \fBhex\fP (hexadecimal
integers) and \fBchar_or_hex\fP (a mixture of both, character literals for
printable characters and hexadecimal integers for others).
.TP
.B \fBre2c:yyctable\fP, \fBre2c:variable:yyctable\fP
Defines API primitive \fByyctable\fP\&.
.TP
.B \fBre2c:yynmatch\fP, \fBre2c:variable:yynmatch\fP
Defines API primitive \fByynmatch\fP\&.
.TP
.B \fBre2c:yypmatch\fP, \fBre2c:variable:yypmatch\fP
Defines API primitive \fByypmatch\fP\&.
.TP
.B \fBre2c:yytarget\fP, \fBre2c:variable:yytarget\fP
Defines API primitive \fByytarget\fP\&.
.TP
.B \fBre2c:yystable\fP, \fBre2c:variable:yystable\fP
Deprecated.
.TP
.B \fBre2c:yystate\fP, \fBre2c:variable:yystate\fP
Defines API primitive \fByystate\fP\&.
.TP
.B \fBre2c:yyfill\fP, \fBre2c:variable:yyfill\fP
Defines API primitive \fByyfill\fP\&.
.TP
.B \fBre2c:yyfill:check\fP
If set to zero, suppresses the generation of pre\-\fBYYFILL\fP check for the
number of input characters (the \fBYYLESSTHAN\fP definition in generic API and
the \fBYYLIMIT\fP\-based comparison in C pointer API). The default is non\-zero
(generate the check).
.TP
.B \fBre2c:yyfill:enable\fP
If set to zero, suppresses the generation of \fBYYFILL\fP (together
with the check). This should be used when the whole input fits into one piece
of memory (there is no need for buffering) and the end\-of\-input checks do not
rely on the \fBYYFILL\fP checks (e.g. if a sentinel character is used).
Use warnings (\fB\-W\fP option) and \fBre2c:sentinel\fP configuration to verify
that the generated lexer cannot read past the end of input.
The default is non\-zero (\fBYYFILL\fP is enabled).
.TP
.B \fBre2c:yyfill:parameter\fP
If set to zero, suppresses the generation of parameter passed to \fBYYFILL\fP\&.
The parameter is the minimum number of characters that must be supplied.
Defaults to non\-zero (the parameter is generated).
This configuration can be overridden with \fBre2c:YYFILL:naked\fP or
\fBre2c:api:style\fP\&.
.TP
.B \fBre2c:yyfn:sep\fP
Specifies separator used in \fBYYFN\fP elements (defaults to semicolon).
.TP
.B \fBre2c:yyfn:throw\fP
Specifies exceptions thrown by \fBYYFN\fP function (defaults to empty, which
means no exceptions).
.UNINDENT
.SS Regular expressions
.sp
re2java uses the following syntax for regular expressions:
.INDENT 0.0
.TP
.B \fB\(dqfoo\(dq\fP
Case\-sensitive string literal.
.TP
.B \fB\(aqfoo\(aq\fP
Case\-insensitive string literal.
.TP
.B \fB[a\-xyz]\fP, \fB[^a\-xyz]\fP
Character class (possibly negated).
.TP
.B \fB\&.\fP
Any character except newline.
.TP
.B \fBR \e S\fP
Difference of character classes \fBR\fP and \fBS\fP\&.
.TP
.B \fBR*\fP
Zero or more occurrences of \fBR\fP\&.
.TP
.B \fBR+\fP
One or more occurrences of \fBR\fP\&.
.TP
.B \fBR?\fP
Optional \fBR\fP\&.
.TP
.B \fBR{n}\fP
Repetition of \fBR\fP exactly \fBn\fP times.
.TP
.B \fBR{n,}\fP
Repetition of \fBR\fP at least \fBn\fP times.
.TP
.B \fBR{n,m}\fP
Repetition of \fBR\fP from \fBn\fP to \fBm\fP times.
.TP
.B \fB(R)\fP
Just \fBR\fP; parentheses are used to override precedence. If submatch
extraction is enabled, \fB(R)\fP is a capturing or a non\-capturing group
depending on \fB\-\-invert\-captures\fP option.
.TP
.B \fB(!R)\fP
If submatch extraction is enabled, \fB(!R)\fP is a non\-capturing or a
capturing group depending on \fB\-\-invert\-captures\fP option.
.TP
.B \fBR S\fP
Concatenation: \fBR\fP followed by \fBS\fP\&.
.TP
.B \fBR | S\fP
Alternative: \fBR or S\fP\&.
.TP
.B \fBR / S\fP
Lookahead: \fBR\fP followed by \fBS\fP, but \fBS\fP is not consumed.
.TP
.B \fBname\fP
Regular expression defined as \fBname\fP (or literal string \fB\(dqname\(dq\fP in
Flex compatibility mode).
.TP
.B \fB{name}\fP
Regular expression defined as \fBname\fP in Flex compatibility mode.
.TP
.B \fB@stag\fP
An \fIs\-tag\fP: saves the last input position at which \fB@stag\fP matches in a
variable named \fBstag\fP\&.
.TP
.B \fB#mtag\fP
An \fIm\-tag\fP: saves all input positions at which \fB#mtag\fP matches in a
variable named \fBmtag\fP\&.
.TP
.B \fB$\fP
End of input.
.UNINDENT
.sp
Character classes and string literals may contain the following escape
sequences: \fB\ea\fP, \fB\eb\fP, \fB\ef\fP, \fB\en\fP, \fB\er\fP, \fB\et\fP, \fB\ev\fP, \fB\e\e\fP,
octal escapes \fB\eooo\fP and hexadecimal escapes \fB\exhh\fP, \fB\euhhhh\fP and
\fB\eUhhhhhhhh\fP\&.
.SS Actions
.sp
Here is a list of predefined actions supported by re2java:
.INDENT 0.0
.TP
.B \fB!entry code\fP
Entry action binds a user\-defined block of \fBcode\fP to the start state of
the current finite state machine. If \fI\%start conditions\fP are used, the entry
action can be set individually for each condition. This action may be used
to perform initialization, e.g. to save start location of a lexeme.
.TP
.B \fB!pre_rule code\fP
Pre\-rule action prepends a user\-defined block of \fBcode\fP to semantic actions
of all rules in the current block (or condition, if \fI\%start conditions\fP are
used). This action may be used to factor out the common part of all semantic
actions (e.g. saving the end location of a lexeme).
.TP
.B \fB!post_rule code\fP
Post\-rule action appends a user\-defined block of \fBcode\fP to semantic actions
of all rules in the current block (or condition, if \fI\%start conditions\fP are
used). This action may be used to emit trap statements that guard against
unintended control flow.
.UNINDENT
.SS Directives
.sp
Here is a full list of directives supported by re2java:
.INDENT 0.0
.TP
.B \fB!use:name ;\fP
An in\-block use directive that merges a previously defined rules block with
the specified \fBname\fP into the current block. Named definitions, configurations
and rules of the referenced block are added to the current ones. Conflicts
between overlapping rules and configurations are resolved in the usual way:
the first rule takes priority, and the latest configuration overrides the
preceding ones. One exception is the special rules \fB*\fP, \fB$\fP and \fB<!>\fP
for which a block\-local definition always takes priority. A use directive
can be placed anywhere inside of a block, and multiple use directives are
allowed.
.TP
.B \fB!include file ;\fP
This directive is the same as \fBinclude\fP block: it inserts \fBfile\fP
contents verbatim in place of the directive.
.UNINDENT
.SS Program interface
.sp
The generated code interfaces with the outer program with the help of
\fIprimitives\fP, collectively referred to as the \fIAPI\fP\&.
Which primitives should be defined for a particular program depends on multiple
factors, including the complexity of regular expressions, input representation,
buffering and the use of various features. All the necessary primitives should
be defined by the user in the form of macros, functions, variables or any other
suitable form that makes the generated code syntactically and semantically
correct. re2java does not (and cannot) check the definitions, so if anything is
missing or defined incorrectly, the generated program may have compile\-time or
run\-time errors.
This manual provides examples of API definitions in the most common cases.
.sp
re2java has three API flavors that define the core set of primitives used by a
program:
.INDENT 0.0
.TP
.B \fBSimple API\fP
This is the default API for the Java backend. It consists of the following
primitives: \fBYYINPUT\fP (which should be defined as a sequence of code
units, e.g. a string) and \fBYYCURSOR\fP, \fBYYMARKER\fP, \fBYYCTXMARKER\fP,
\fBYYLIMIT\fP (which should be defined as indices in \fBYYINPUT\fP).
.nf

.fi
.sp
.TP
.B \fBRecord API\fP
Record API is useful in cases when lexer state must be stored in a class.
It is enabled with \fB\-\-api record\fP option or \fBre2c:api = record\fP
configuration. This API consists of a variable \fByyrecord\fP (the
name can be overridden with \fBre2c:yyrecord\fP) that should be
defined as a class with fields \fByyinput\fP, \fByycursor\fP, \fByymarker\fP,
\fByyctxmarker\fP, \fByylimit\fP (only the fields used by the generated code
need to be defined, and their names can be configured).
.nf

.fi
.sp
.TP
.B \fBGeneric API\fP
This is the most flexible API. It is enabled with \fB\-\-api generic\fP option
or \fBre2c:api = generic\fP configuration.
It contains primitives for generic operations:
\fBYYPEEK\fP,
\fBYYSKIP\fP,
\fBYYBACKUP\fP,
\fBYYBACKUPCTX\fP,
\fBYYSTAGP\fP,
\fBYYSTAGN\fP,
\fBYYMTAGP\fP,
\fBYYMTAGN\fP,
\fBYYRESTORE\fP,
\fBYYRESTORECTX\fP,
\fBYYRESTORETAG\fP,
\fBYYSHIFT\fP,
\fBYYSHIFTSTAG\fP,
\fBYYSHIFTMTAG\fP,
\fBYYLESSTHAN\fP,
\fBYYEND\fP\&.
.UNINDENT
.sp
Here is a full list of API primitives that may be used by the generated code in
order to interface with the outer program.
.INDENT 0.0
.TP
.B \fBYYCTYPE\fP
The type of the input characters (code units).
For ASCII, EBCDIC and UTF\-8 encodings it should be 1\-byte unsigned integer.
For UTF\-16 or UCS\-2 it should be 2\-byte unsigned integer. For UTF\-32 it
should be 4\-byte unsigned integer.
.TP
.B \fBYYCURSOR\fP
An l\-value that stores the current input position (a pointer or an integer
offset in \fBYYINPUT\fP). Initially \fBYYCURSOR\fP should point to the first
input character, and later it is advanced by the generated code. When a rule
matches, \fBYYCURSOR\fP position is the one after the last matched character.
.TP
.B \fBYYLIMIT\fP
An r\-value that stores the end of input position (a pointer or an integer
offset in \fBYYINPUT\fP). Initially \fBYYLIMIT\fP should point to the position
after the last available input character. It is not changed by the
generated code. The lexer compares \fBYYCURSOR\fP to \fBYYLIMIT\fP
in order to determine if there are enough input characters left.
.TP
.B \fBYYMARKER\fP
An l\-value that stores the position of the latest matched rule (a pointer or
an integer offset in \fBYYINPUT\fP). It is used to restore the \fBYYCURSOR\fP
position if the longer match fails and the lexer needs to rollback.
Initialization is not needed.
.TP
.B \fBYYCTXMARKER\fP
An l\-value that stores the position of the trailing context (a pointer or an
integer offset in \fBYYINPUT\fP). No initialization is needed. \fBYYCTXMARKER\fP
is needed only if the lookahead operator \fB/\fP is used.
.TP
.B \fBYYFILL\fP
A generic API primitive with one variable \fBlen\fP\&.
\fBYYFILL\fP should provide at least \fBlen\fP more input characters or fail.
If \fBre2c:eof\fP is used, then \fBlen\fP is always \fB1\fP and  \fBYYFILL\fP should
always return to the calling function; zero return value indicates success.
If \fBre2c:eof\fP is not used, then \fBYYFILL\fP return value is ignored and it
should not return on failure. The maximum value of \fBlen\fP is \fBYYMAXFILL\fP\&.
.TP
.B \fBYYFN\fP
A primitive that defines function prototype in \fB\-\-recursive\-functions\fP
code model. Its value should be an array of one or more strings, where each
string contains two or three components separated by the string specified in
\fBre2c:fn:sep\fP configuration (typically a semicolon). The first array
element defines function name and return type (empty for a void function).
Subsequent elements define function arguments: first, the expression for the
argument used in function body (usually just a name); second, argument type;
third, an optional formal parameter (it defaults to the first component \-
usually both the argument and the parameter are the same identifier).
.TP
.B \fBYYINPUT\fP
An r\-value that stores the current input character sequence (string, buffer,
etc.).
.TP
.B \fBYYMAXFILL\fP
An integral constant equal to the maximum value of the argument to
\fBYYFILL\fP\&.  It can be generated with a \fBmax\fP block.
.TP
.B \fBYYLESSTHAN\fP
A generic API primitive with one variable \fBlen\fP\&.
It should be defined as an r\-value of boolean type that equals \fBtrue\fP if
and only if there are less than \fBlen\fP input characters left.
.TP
.B \fBYYEND\fP
A generic API primitive with no variables.
It should be defined as an r\-value of boolean type that equals \fBtrue\fP if
and only if the \fIlogical\fP end of input has been reached (excluding any
padding or sentinel symbols). \fBYYEND\fP is used to implement \fB$\fP symbol in
regular expressions. It differs from \fBYYLESSTHAN\fP, which is used to ensure
that the lexer won\(aqt read past the end of buffer.
.TP
.B \fBYYPEEK\fP
A generic API primitive with no variables.
It should be defined as an r\-value of type \fBYYCTYPE\fP that is equal to the
character at the current input position.
.TP
.B \fBYYSKIP\fP
A generic API primitive that should advance the current input position by
one code unit.
.TP
.B \fBYYBACKUP\fP
A generic API primitive that should save the current input position (to be
restored with \fBYYRESTORE\fP later).
.TP
.B \fBYYRESTORE\fP
A generic API primitive that should restore the current input position to
the value saved by \fBYYBACKUP\fP\&.
.TP
.B \fBYYBACKUPCTX\fP
A generic API primitive that should save the current input position as the
position of the trailing context (to be restored with \fBYYRESTORECTX\fP
later).
.TP
.B \fBYYRESTORECTX\fP
A generic API primitive that should restore the trailing context position
saved with \fBYYBACKUPCTX\fP\&.
.TP
.B \fBYYRESTORETAG\fP
A generic API primitive with one variable \fBtag\fP that should restore the
trailing context position to the value of \fBtag\fP\&.
.TP
.B \fBYYSTAGP\fP
A generic API primitive with one variable \fBtag\fP, where \fBtag\fP can be a
pointer or an offset in \fBYYINPUT\fP (see submatch extraction section for
details). \fBYYSTAGP\fP should set \fBtag\fP to the current input position.
.TP
.B \fBYYSTAGN\fP
A generic API primitive with one variable \fBtag\fP, where \fBtag\fP can be a
pointer or an offset in \fBYYINPUT\fP (see submatch extraction section for
details). \fBYYSTAGN\fP should to set \fBtag\fP to a value that represents
non\-existent input position.
.TP
.B \fBYYMTAGP\fP
A generic API primitive with one variable \fBtag\fP\&.
\fBYYMTAGP\fP should append the current position to the submatch history of
\fBtag\fP (see the submatch extraction section for details.)
.TP
.B \fBYYMTAGN\fP
A generic API primitive with one variable \fBtag\fP\&.
\fBYYMTAGN\fP should append a value that represents non\-existent input
position position to the submatch history of \fBtag\fP (see the submatch
extraction section for details.)
.TP
.B \fBYYSHIFT\fP
A generic API primitive with one variable \fBshift\fP that should shift the
current input position by \fBshift\fP characters (the shift value may be
negative).
.TP
.B \fBYYCOPYSTAG\fP
A generic API primitive with two variables, \fBlhs\fP and \fBrhs\fP that should
copy right\-hand\-side s\-tag variable \fBrhs\fP to the left\-hand\-side s\-tag
variable \fBlhs\fP\&. For most languages this primitive has a default definition
that assigns \fBlhs\fP to \fBrhs\fP\&.
.TP
.B \fBYYCOPYMTAG\fP
A generic API primitive with two variables, \fBlhs\fP and \fBrhs\fP that should
copy right\-hand\-side m\-tag variable \fBrhs\fP to the left\-hand\-side m\-tag
variable \fBlhs\fP\&. For most languages this primitive has a default definition
that assigns \fBlhs\fP to \fBrhs\fP\&.
.TP
.B \fBYYSHIFTSTAG\fP
A generic  API primitive with two variables, \fBtag\fP and \fBshift\fP that
should shift \fBtag\fP by \fBshift\fP code units (the shift value may be
negative).
.TP
.B \fBYYSHIFTMTAG\fP
A generic API primitive with two variables, \fBtag\fP and \fBshift\fP that
should shift the latest value in the history of \fBtag\fP by \fBshift\fP code
units (the shift value may be negative).
.TP
.B \fBYYMAXNMATCH\fP
An integral constant equal to the maximal number of POSIX capturing groups
in a rule. It is generated with a \fBmaxnmatch\fP block.
.TP
.B \fBYYCONDTYPE\fP
The type of the condition enum.
It can be generated either with \fBconditions\fP block or \fB\-\-header\fP option.
.TP
.B \fBYYGETACCEPT\fP
A primitive with one variable \fBvar\fP that stores numeric selector of the
accepted rule. For most languages this primitive has a default definition
that reads from \fBvar\fP\&.
.TP
.B \fBYYSETACCEPT\fP
A primitive with two variables: \fBvar\fP (an l\-value that stores numeric
selector of the accepted rule), and \fBval\fP (the value of selector). For
most languages this primitive has a default definition that assigns \fBvar\fP
to \fBval\fP\&.
.TP
.B \fBYYGETCOND\fP
An r\-value of type \fBYYCONDTYPE\fP that is equal to the current condition
identifier.
.TP
.B \fBYYSETCOND\fP
A primitive with one variable \fBcond\fP that should set the current
condition identifier to \fBcond\fP\&.
.TP
.B \fBYYGETSTATE\fP
An r\-value of integer type that is equal to the current lexer state. It
should be initialized to \fB\-1\fP\&.
.TP
.B \fBYYSETSTATE\fP
A primitive with one variable \fBstate\fP that should set the current lexer
state to \fBstate\fP\&.
.TP
.B \fBYYDEBUG\fP
This primitive is generated only with \fB\-d\fP, \fB\-\-debug\-output\fP option.
Its purpose is to add logging to the generated code (typical \fBYYDEBUG\fP
definition is a print statement). \fBYYDEBUG\fP statements are generated in
every state and have two variables: \fBstate\fP (either a DFA state index or
\fB\-1\fP) and \fBsymbol\fP (the current input symbol).
.TP
.B \fByyaccept\fP
An l\-value of unsigned integral type that stores the number of the latest
matched rule. User definition is necessary only with \fB\-\-storable\-state\fP
option.
.TP
.B \fByybm\fP
A table containing compressed bitmaps for up to 8 transitions (used with
the \fB\-\-bitmaps\fP option). The table contains 256 elements and is indexed by
1\-byte code units. Each 8\-bit element combines boolean values for up to 8
transitions. k\-Th bit of n\-th element is true iff n\-th code unit is in the
range of k\-th transition. The idea of this bitmap is to replace many \fIif\fP
branches or \fIswitch\fP cases with one check of a single bit in the table.
.TP
.B \fByych\fP
An l\-value of type \fBYYCTYPE\fP that stores the current input character.
User definition is necessary only with \fB\-f\fP \fB\-\-storable\-state\fP option.
.TP
.B \fByyctable\fP
Jump table generated for the initial condition dispatch (enabled with the
combination of \fB\-\-conditions\fP and \fB\-\-computed\-gotos\fP options).
.TP
.B \fByyfill\fP
An l\-value that stores the result of \fBYYFILL\fP call (this may be necessary
for pure functional languages, where \fBYYFILL\fP is a monadic function with
complex return value).
.TP
.B \fByynmatch\fP
An l\-value of unsigned integral type that stores the number of POSIX
capturing groups in the matched rule.
Used only with \fB\-P\fP \fB\-\-posix\-captures\fP option.
.TP
.B \fByypmatch\fP
An array of l\-values that are used to hold the tag values corresponding
to the capturing parentheses in the matching rule. Array length must be
at least \fByynmatch * 2\fP (usually \fBYYMAXNMATCH * 2\fP is a good choice).
Used only with \fB\-P\fP \fB\-\-posix\-captures\fP option.
.TP
.B \fByystable\fP
Deprecated.
.TP
.B \fByystate\fP
An l\-value used with the \fB\-\-loop\-switch\fP option to store the current DFA
state.
.TP
.B \fByytarget\fP
Jump table that contains jump targets (label addresses) for all transitions
from a state. This table is local to each state. Generation of \fByytarget\fP
tables is enabled with \fB\-\-computed\-gotos\fP option.
.UNINDENT
.SS Options
.sp
Some of the options have corresponding \fI\%configurations\fP,
others are global and cannot be changed after re2c starts reading the input file.
Debug options generally require building re2c in debug configuration.
Internal options are useful for experimenting with the algorithms used in re2c.
.INDENT 0.0
.TP
.B \fB\-? \-\-help \-h\fP
Show help message.
.TP
.B \fB\-\-api <simple | record | generic>\fP
Specify the API used by the generated code to interface with used\-defined
code. Option \fBsimple\fP should be used in simple cases when there\(aqs no need
for buffer refilling and storing lexer state. Option \fBrecord\fP should be
used when lexer state needs to be stored in a record (struct, class, etc.).
Option \fBgeneric\fP should be used in complex cases when the other two APIs
are not flexible enough.
.TP
.B \fB\-\-bit\-vectors \-b\fP
Optimize conditional jumps using bit masks.
This option implies \fB\-\-nested\-ifs\fP\&.
.TP
.B \fB\-\-captures\fP, \fB\-\-leftmost\-captures\fP
Enable submatch extraction with leftmost greedy capturing groups. The result
is collected into an array \fByybmatch\fP of capacity \fB2 * YYMAXNMATCH\fP, and
\fByynmatch\fP is set to the number of groups for the matching rule.
.TP
.B \fB\-\-captvars\fP, \fB\-\-leftmost\-captvars\fP
Enable submatch extraction with leftmost greedy capturing groups. The result
is collected into variables \fByytl<k>\fP, \fByytr<k>\fP for \fBk\fP\-th capturing
group.
.TP
.B \fB\-\-case\-insensitive\fP
Treat single\-quoted and double\-quoted strings as case\-insensitive.
.TP
.B \fB\-\-case\-inverted\fP
Invert the meaning of single\-quoted and double\-quoted strings:
treat single\-quoted strings as case\-sensitive and double\-quoted strings
as case\-insensitive.
.TP
.B \fB\-\-case\-ranges\fP
Collapse consecutive cases in a switch statements into a range of the form
\fBlow ... high\fP\&. This syntax is a C/C++ language extension that is
supported by compilers like GCC, Clang and Tcc. The main advantage over
using single cases is smaller generated code and faster generation time,
although for some compilers like Tcc it also results in smaller binary size.
.TP
.B \fB\-\-computed\-gotos \-g\fP
Optimize conditional jumps using non\-standard \(dqcomputed goto\(dq extension
(which must be supported by the compiler). re2java generates jump tables
only in complex cases with a lot of conditional branches. Complexity
threshold can be configured with \fBcgoto:threshold\fP configuration.
Relative offsets can be enabled with \fBcgoto:relative\fP configuration. This
option implies \fB\-\-bit\-vectors\fP\&.
.TP
.B \fB\-\-computed\-gotos\-relative\fP
Similar to \-\-computed\-gotos but generate relative offsets for jump tables
instead (which must be supported by the compiler). This option implies
\fB\-\-computed\-gotos\fP\&.
.TP
.B \fB\-\-conditions \-\-start\-conditions \-c\fP
Enable support of Flex\-like \(dqconditions\(dq: multiple interrelated lexers
within one block. This is an alternative to manually specifying different
re2java blocks connected with \fBgoto\fP or function calls.
.TP
.B \fB\-\-depfile FILE\fP
Write dependency information to \fBFILE\fP in the form of a Makefile rule
\fB<output\-file> : <input\-file> [include\-file ...]\fP\&. This allows one to
track build dependencies in the presence of \fBinclude\fP blocks/directives,
so that updating include files triggers regeneration of the output file.
This option depends on the \fB\-\-output\fP option.
.TP
.B \fB\-\-ebcdic \-\-ecb \-e\fP
Generate a lexer that reads input in EBCDIC encoding. re2java assumes that
the character range is 0 \-\- 0xFF and character size is 1 byte.
.TP
.B \fB\-\-empty\-class <match\-empty | match\-none | error>\fP
Define the way re2java treats empty character classes. With \fBmatch\-empty\fP
(the default) empty class matches empty input (which is illogical, but
backwards\-compatible). With \fBmatch\-none\fP empty class always fails to match.
With \fBerror\fP empty class raises a compilation error.
.TP
.B \fB\-\-encoding\-policy <fail | substitute | ignore>\fP
Define the way re2java treats Unicode surrogates.
With \fBfail\fP re2java aborts with an error when a surrogate is encountered.
With \fBsubstitute\fP re2java silently replaces surrogates with the error code
point 0xFFFD. With \fBignore\fP (the default) re2java treats surrogates as
normal code points. The Unicode standard says that standalone surrogates
are invalid, but real\-world libraries and programs behave in different ways.
.TP
.B \fB\-\-flex\-syntax \-F\fP
Partial support for Flex syntax: in this mode named definitions don\(aqt need
the equal sign and the terminating semicolon, and when used they must be
surrounded with curly braces. Names without curly braces are treated as
double\-quoted strings.
.TP
.B \fB\-\-goto\-label\fP
Use \(dqgoto/label\(dq code model: encode DFA in form of labeled code blocks
connected with \fBgoto\fP transitions across blocks. This is only supported
for languages that have a \fBgoto\fP statement.
.TP
.B \fB\-\-header \-\-type\-header \-t HEADER\fP
Generate a \fBHEADER\fP file. The contents of the file can be specified using
special blocks \fBheader:on\fP and \fBheader:off\fP\&. If conditions are used, the
generated header will have a condition enum automatically appended to it
(unless there is an explicit \fBconditions\fP block).
.TP
.B \fB\-I PATH\fP
Add \fBPATH\fP to the list of locations which are used when searching for
include files. This option is useful in combination with \fBinclude\fP block
or directive. re2java looks for \fBFILE\fP in the directory of the parent file
and in the include locations specified with \fB\-I\fP option.
.TP
.B \fB\-\-input <default | custom>\fP
Deprecated alias for \fB\-\-api\fP\&. Option \fBdefault\fP corresponds to \fBsimple\fP
(it is indeed the default for most backends, but not for all). Option
\fBcustom\fP corresponds to \fBgeneric\fP\&.
.TP
.B \fB\-\-input\-encoding <ascii | utf8>\fP
Specify the way re2java parses regular expressions.
With \fBascii\fP (the default) re2java handles input as ASCII\-encoded: any
sequence of code units is a sequence of standalone 1\-byte characters.
With \fButf8\fP re2java handles input as UTF8\-encoded and recognizes multibyte
characters.
.TP
.B \fB\-\-invert\-captures\fP
Invert the meaning of capturing and non\-capturing groups. By default
\fB(...)\fP is capturing and \fB(! ...)\fP is non\-capturing. With this option
\fB(! ...)\fP is capturing and \fB(...)\fP is non\-capturing.
.TP
.B \fB\-\-lang <none | c | d | go | haskell | java | js | ocaml | python | rust | swift | v | zig>\fP
Specify the target language. Supported languages are C, D, Go, Haskell,
Java, JS, OCaml, Python, Rust, Swift, V, Zig (more languages can be added via
user\-defined syntax files, see the \fB\-\-syntax\fP option). Option \fBnone\fP
disables default suntax configs, so that the target language is undefined.
.TP
.B \fB\-\-location\-format <gnu | msvc>\fP
Specify location format in messages.
With \fBgnu\fP locations are printed as \(aqfilename:line:column: ...\(aq.
With \fBmsvc\fP locations are printed as \(aqfilename(line,column) ...\(aq.
The default is \fBgnu\fP\&.
.TP
.B \fB\-\-loop\-switch\fP
Use \(dqloop/switch\(dq code model: encode DFA in form of a loop over a switch
statement, where individual states are switch cases. State is stored in a
variable \fByystate\fP\&. Transitions between states update \fByystate\fP to the
case label of the destination state and continue execution to the head of
the loop.
.TP
.B \fB\-\-nested\-ifs \-s\fP
Use nested \fBif\fP statements instead of \fBswitch\fP statements in conditional
jumps. This usually results in more efficient code with non\-optimizing
compilers.
.TP
.B \fB\-\-no\-debug\-info \-i\fP
Do not output line directives. This may be useful when the generated code is
stored in a version control system (to avoid huge autogenerated diffs on
small changes).
.TP
.B \fB\-\-no\-generation\-date\fP
Suppress date output in the generated file.
.TP
.B \fB\-\-no\-version\fP
Suppress version output in the generated file.
.TP
.B \fB\-\-no\-unsafe\fP
Do not generate \fBunsafe\fP wrapper over \fBYYPEEK\fP (this option is specific
to Rust). For performance reasons \fBYYPEEK\fP should avoid bounds\-checking,
as the lexer already performs end\-of\-input checks in a more efficient way.
The user may choose to provide a safe \fBYYPEEK\fP definition, or a definition
that is unsafe only in release builds, in which case the \fB\-\-no\-unsafe\fP
option helps to avoid warnings about redundant \fBunsafe\fP blocks.
.TP
.B \fB\-\-output \-o OUTPUT\fP
Specify the \fBOUTPUT\fP file.
.TP
.B \fB\-\-posix\-captures\fP, \fB\-P\fP
Enable submatch extraction with POSIX\-style capturing groups. The result
is collected into an array \fByybmatch\fP of capacity \fB2 * YYMAXNMATCH\fP, and
\fByynmatch\fP is set to the number of groups for the matching rule.
.TP
.B \fB\-\-posix\-captvars\fP
Enable submatch extraction with POSIX\-style capturing groups. The result
is collected into variables \fByytl<k>\fP, \fByytr<k>\fP for \fBk\fP\-th capturing
group.
.TP
.B \fB\-\-recursive\-functions\fP
Use code model based on co\-recursive functions, where each DFA state is a
separate function that may call other state\-functions or itself.
.TP
.B \fB\-\-reusable \-r\fP
Deprecated since version 2.2 (reusable blocks are allowed by default now).
.TP
.B \fB\-\-skeleton \-S\fP
Ignore user\-defined interface code and generate a self\-contained \(dqskeleton\(dq
program. Additionally, generate input files with strings derived from the
regular grammar and compressed match results that are used to verify
\(dqskeleton\(dq behavior on all inputs. This option is useful for finding bugs
in optimizations and code generation. This option is supported only for C.
.TP
.B \fB\-\-storable\-state \-f\fP
Generate a lexer which can store its inner state.
This is useful in push\-model lexers which are stopped by an outer program
when there is not enough input, and then resumed when more input becomes
available. In this mode users should additionally define \fBYYGETSTATE\fP
and \fBYYSETSTATE\fP primitives, and variables \fByych\fP, \fByyaccept\fP and
\fBstate\fP should be part of the stored lexer state.
.TP
.B \fB\-\-syntax FILE\fP
Load configurations from the specified \fBFILE\fP and apply them on top of the
default syntax file. Note that \fBFILE\fP can define only a few configurations
(if it\(aqs used to amend the default syntax file), or it can define a whole
new language backend (in the latter case it is recommended to use
\fB\-\-lang none\fP option).
.TP
.B \fB\-\-tags \-T\fP
Enable submatch extraction with tags.
.TP
.B \fB\-\-ucs2 \-\-wide\-chars \-w\fP
Generate a lexer that reads UCS2\-encoded input. re2java assumes that the
character range is 0 \-\- 0xFFFF and character size is 2 bytes.
This option implies \fB\-\-nested\-ifs\fP\&.
.TP
.B \fB\-\-utf8 \-\-utf\-8 \-8\fP
Generate a lexer that reads input in UTF\-8 encoding. re2java assumes that the
character range is 0 \-\- 0x10FFFF and character size is 1 byte.
.TP
.B \fB\-\-utf16 \-\-utf\-16 \-x\fP
Generate a lexer that reads UTF16\-encoded input. re2java assumes that the
character range is 0 \-\- 0x10FFFF and character size is 2 bytes.
This option implies \fB\-\-nested\-ifs\fP\&.
.TP
.B \fB\-\-utf32 \-\-unicode \-u\fP
Generate a lexer that reads UTF32\-encoded input. re2java assumes that the
character range is 0 \-\- 0x10FFFF and character size is 4 bytes.
This option implies \fB\-\-nested\-ifs\fP\&.
.TP
.B \fB\-\-verbose\fP
Output a short message in case of success.
.TP
.B \fB\-\-vernum \-V\fP
Show version information in \fBMMmmpp\fP format (major, minor, patch).
.TP
.B \fB\-\-version \-v\fP
Show version information.
.TP
.B \fB\-\-single\-pass \-1\fP
Deprecated. Does nothing (single pass is the default now).
.UNINDENT
.INDENT 0.0
.TP
.B \fB\-\-debug\-output \-d\fP
Emit \fBYYDEBUG\fP invocations in the generated code. This is useful to trace
lexer execution.
.TP
.B \fB\-\-dump\-adfa\fP
Debug option: output DFA after tunneling (in .dot format).
.TP
.B \fB\-\-dump\-cfg\fP
Debug option: output control flow graph of tag variables (in .dot format).
.TP
.B \fB\-\-dump\-closure\-stats\fP
Debug option: output statistics on the number of states in closure.
.TP
.B \fB\-\-dump\-dfa\-det\fP
Debug option: output DFA immediately after determinization (in .dot format).
.TP
.B \fB\-\-dump\-dfa\-min\fP
Debug option: output DFA after minimization (in .dot format).
.TP
.B \fB\-\-dump\-dfa\-tagopt\fP
Debug option: output DFA after tag optimizations (in .dot format).
.TP
.B \fB\-\-dump\-dfa\-tree\fP
Debug option: output DFA under construction with states represented as tag
history trees (in .dot format).
.TP
.B \fB\-\-dump\-dfa\-raw\fP
Debug option: output DFA under construction with expanded state\-sets
(in .dot format).
.TP
.B \fB\-\-dump\-interf\fP
Debug option: output interference table produced by liveness analysis of tag
variables.
.TP
.B \fB\-\-dump\-nfa\fP
Debug option: output NFA (in .dot format).
.TP
.B \fB\-\-emit\-dot \-D\fP
Instead of normal output generate lexer graph in .dot format.
The output can be converted to an image with the help of Graphviz
(e.g. something like \fBdot \-Tpng \-odfa.png dfa.dot\fP).
.UNINDENT
.INDENT 0.0
.TP
.B \fB\-\-dfa\-minimization <moore | table>\fP
Internal option: DFA minimization algorithm used by re2java\&. The \fBmoore\fP
option is the Moore algorithm (it is the default). The \fBtable\fP option is
the \(dqtable filling\(dq algorithm. Both algorithms should produce the same DFA
up to states relabeling; table filling is simpler and much slower and serves
as a reference implementation.
.TP
.B \fB\-\-eager\-skip\fP
Internal option: make the generated lexer advance the input position
eagerly \-\- immediately after reading the input symbol. This changes the
default behavior when the input position is advanced lazily \-\- after
transition to the next state.
.TP
.B \fB\-\-no\-lookahead\fP
Internal option, deprecated.
It used to enable TDFA(0) algorithm. Unlike TDFA(1), TDFA(0) algorithm does
not use one\-symbol lookahead. It applies register operations to the incoming
transitions rather than the outgoing ones. Benchmarks showed that TDFA(0)
algorithm is less efficient than TDFA(1).
.TP
.B \fB\-\-no\-optimize\-tags\fP
Internal option: suppress optimization of tag variables (useful for
debugging).
.TP
.B \fB\-\-posix\-closure <gor1 | gtop>\fP
Internal option: specify shortest\-path algorithm used for the construction of
epsilon\-closure with POSIX disambiguation semantics: \fBgor1\fP (the default)
stands for Goldberg\-Radzik algorithm, and \fBgtop\fP stands for \(dqglobal
topological order\(dq algorithm.
.TP
.B \fB\-\-posix\-prectable <complex | naive>\fP
Internal option: specify the algorithm used to compute POSIX precedence
table. The \fBcomplex\fP algorithm computes precedence table in one traversal
of tag history tree and has quadratic complexity in the number of TNFA
states; it is the default. The \fBnaive\fP algorithm has worst\-case cubic
complexity in the number of TNFA states, but it is much simpler than
\fBcomplex\fP and may be slightly faster in non\-pathological cases.
.TP
.B \fB\-\-stadfa\fP
Internal option, deprecated.
It used to enable staDFA algorithm, which differs from TDFA in that register
operations are placed in states rather than on transitions. Benchmarks
showed that staDFA algorithm is less efficient than TDFA.
.TP
.B \fB\-\-fixed\-tags <none | toplevel | all>\fP
Internal option:
specify whether the fixed\-tag optimization should be applied to all tags
(\fBall\fP), none of them (\fBnone\fP), or only those in toplevel concatenation
(\fBtoplevel\fP). The default is \fBall\fP\&.
\(dqFixed\(dq tags are those that are located within a fixed distance to some
other tag (called \(dqbase\(dq). In such cases only the base tag needs to be
tracked, and the value of the fixed tag can be computed as the value of the
base tag plus a static offset. For tags that are under alternative or
repetition it is also necessary to check if the base tag has a no\-match
value (in that case fixed tag should also be set to no\-match, disregarding
the offset). For tags in top\-level concatenation the check is not needed,
because they always match.
.UNINDENT
.SS Warnings
.sp
Warnings can be invividually enabled, disabled and turned into an error.
.INDENT 0.0
.TP
.B \fB\-W\fP
Turn on all warnings.
.TP
.B \fB\-Werror\fP
Turn warnings into errors. Note that this option alone
doesn\(aqt turn on any warnings; it only affects those warnings that have
been turned on so far or will be turned on later.
.TP
.B \fB\-W<warning>\fP
Turn on \fBwarning\fP\&.
.TP
.B \fB\-Wno\-<warning>\fP
Turn off \fBwarning\fP\&.
.TP
.B \fB\-Werror\-<warning>\fP
Turn on \fBwarning\fP and treat it as an error (this implies \fB\-W<warning>\fP).
.TP
.B \fB\-Wno\-error\-<warning>\fP
Don\(aqt treat this particular \fBwarning\fP as an error. This doesn\(aqt turn off
the warning itself.
.UNINDENT
.INDENT 0.0
.TP
.B \fB\-Wcondition\-order\fP
Warn if the generated program makes implicit assumptions about condition
numbering. One should use either \fB\-\-header\fP option or \fBconditions\fP
block to generate a mapping of condition names to numbers and then use the
autogenerated condition names.
.TP
.B \fB\-Wempty\-character\-class\fP
Warn if a regular expression contains an empty character class. Trying to
match an empty character class makes no sense: it should always fail.
However, for backwards compatibility reasons re2java permits empty character
classes and treats them as empty strings. Use the \fB\-\-empty\-class\fP option
to change the default behavior.
.TP
.B \fB\-Wmatch\-empty\-string\fP
Warn if a rule is nullable (matches an empty string).
If the lexer runs in a loop and the empty match is unintentional, the lexer
may unexpectedly hang in an infinite loop.
.TP
.B \fB\-Wswapped\-range\fP
Warn if the lower bound of a range is greater than its upper bound. The
default behavior is to silently swap the range bounds.
.TP
.B \fB\-Wundefined\-control\-flow\fP
Warn if some input strings cause undefined control flow in the lexer (the
faulty patterns are reported). This is a dangerous and common mistake. It
can be easily fixed by adding the default rule \fB*\fP which has the lowest
priority, matches any code unit, and always consumes a single code unit.
.TP
.B \fB\-Wunreachable\-rules\fP
Warn about rules that are shadowed by other rules and will never match.
.TP
.B \fB\-Wdeprecated\-eof_rule\fP
Warn about standalone end of input rules \fB$\fP that will be broken by the
future changes and require fixing. At the moment these rules take precedence
when conflicting with other rules, but after the introduction of generalized
end of input symbol \fB$\fP precedence order will change and these rules will
become shadowed by other rules.
.TP
.B \fB\-Wuseless\-escape\fP
Warn if a symbol is escaped when it shouldn\(aqt be.
By default, re2java silently ignores such escapes, but this may as well
indicate a typo or an error in the escape sequence.
.TP
.B \fB\-Wnondeterministic\-tags\fP
Warn if a tag has \fBn\fP\-th degree of nondeterminism, where \fBn\fP is greater
than 1.
.TP
.B \fB\-Wsentinel\-in\-midrule\fP
Warn if the sentinel symbol occurs in the middle of a rule \-\-\- this may
cause reads past the end of buffer, crashes or memory corruption in the
generated lexer. This warning is only applicable if the sentinel method of
checking for the end of input is used.
It is set to an error if \fBre2c:sentinel\fP configuration is used.
.TP
.B \fB\-Wundefined\-syntax\-config\fP
Warn if the syntax file specified with \fB\-\-syntax\fP option is missing
definitions of some configurations. This helps to maintain user\-defined
syntax files: if a new release adds configurations, old syntax file will
raise a warning, and the user will be notified. If some configurations are
unused and do not need a definition, they should be explicitly set to
\fB<undefined>\fP\&.
.UNINDENT
.SS Syntax files
.sp
Support for different languages in re2c is based on the idea of \fIsyntax files\fP\&.
A syntax file is a configuration file that defines syntax of the target language
\-\- not the whole language, but a small part of it that is used by the generated
code. Syntax files make re2c very flexible, but they should not be used as a
replacement for \fBre2c:\fP configurations: their purpose is to define syntax of
the target language, not to customize one particular lexer. All supported
languages have default syntax files that are part of the distribution (see
\fBinclude/syntax\fP subdirectory); they are also embedded in the re2java binary.
Users may provide a custom syntax file that overrides a few configurations for
one of supported languages, or they may choose to redefine all configurations
(in that case \fB\-\-lang none\fP option should be used).
Syntax files contain configurations of four different kinds: feature lists,
language configurations, inplace configurations and code templates.
.sp
\fBFeature lists\fP
.INDENT 0.0
.INDENT 3.5
A few list configurations define various features supported by a given
backend, so that re2java may give a clear error if the user tries to enable an
unsupported feature:
.INDENT 0.0
.TP
.B \fBsupported_apis\fP
A list of supported APIs with possible elements \fBsimple\fP, \fBrecord\fP,
\fBgeneric\fP\&.
.TP
.B \fBsupported_api_styles\fP
A list of supported API styles with possible elements \fBfunctions\fP,
\fBfree\-form\fP\&.
.TP
.B \fBsupported_code_models\fP
A list of supported code models with possible elements \fBgoto\-label\fP,
\fBloop\-switch\fP, \fBrecursive\-functions\fP\&.
.TP
.B \fBsupported_targets\fP
A list of supported codegen targets with possible elements \fBcode\fP,
\fBdot\fP, \fBskeleton\fP\&.
.TP
.B \fBsupported_features\fP
A list of supported features with possible elements \fBnested\-ifs\fP,
\fBbitmaps\fP, \fBcomputed\-gotos\fP, \fBcase\-ranges\fP, \fBmonadic\fP, \fBunsafe\fP,
\fBtags\fP, \fBcaptures\fP, \fBcaptvars\fP\&.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
\fBLanguage configurations\fP
.INDENT 0.0
.INDENT 3.5
A few boolean configurations describe features of the target language that
affect re2java parser and code generator:
.INDENT 0.0
.TP
.B \fBsemicolons\fP
Non\-zero if the language uses semicolons after statements.
.TP
.B \fBbacktick_quoted_strings\fP
Non\-zero if the language has backtick\-quoted strings.
.TP
.B \fBsingle_quoted_strings\fP
Non\-zero if the language has single\-quoted strings.
.TP
.B \fBindentation_sensitive\fP
Non\-zero if the language is indentation sensitive.
.TP
.B \fBwrap_blocks_in_braces\fP
Non\-zero if compound statements must be wrapped in curly braces.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
\fBInplace configurations\fP
.INDENT 0.0
.INDENT 3.5
Syntax files define initial values of all \fBre2c:\fP configurations, as they
may differ for different languages. See configurations section for a full list
of all inplace configurations and their meaning.
.UNINDENT
.UNINDENT
.sp
\fBCode templates\fP
.INDENT 0.0
.INDENT 3.5
Code templates define syntax of the target language. They are written in a
simple domain\-specific language with the following formal grammar:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
code\-template ::
      name \(aq=\(aq code\-exprs \(aq;\(aq
    | CODE_TEMPLATE \(aq;\(aq
    | \(aq<undefined>\(aq \(aq;\(aq

code\-exprs ::
      <EMPTY>
    | code\-exprs code\-expr

code\-expr ::
      STRING
    | VARIABLE
    | optional
    | list

optional ::
      \(aq(\(aq CONDITIONAL \(aq?\(aq code\-exprs \(aq)\(aq
    | \(aq(\(aq CONDITIONAL \(aq?\(aq code\-exprs \(aq:\(aq code\-exprs \(aq)\(aq

list ::
      \(aq[\(aq VARIABLE \(aq:\(aq code\-exprs \(aq]\(aq
    | \(aq[\(aq VARIABLE \(aq{\(aq NUMBER \(aq}\(aq \(aq:\(aq code\-exprs \(aq]\(aq
    | \(aq[\(aq VARIABLE \(aq{\(aq NUMBER \(aq,\(aq NUMBER \(aq}\(aq \(aq:\(aq code\-exprs \(aq]\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
A code template is a sequence of string literals, variables, optional elements
and lists, or a reference to another code template, or a special value
\fB<undefined>\fP\&. Variables are placeholders that are substituted during code
generation phase. List variables are special: when expanding list templates,
re2java repeats expressions the right hand side of the column a few times, each
time replacing occurrences of the list variable with a value specific to this
repetition. Lists have optional bounds (negative values are counted from the
end, e.g. \fB\-1\fP means the last element). Conditional names start with a dot.
Both conditionals and variables may be either local (specific to the given
code template) or global (allowed in all code templates). When re2java reads
syntax file, it checks that each code template uses only the variables and
conditionals that are allowed in it.
.sp
For example, the following code template defines if\-then\-else construct for a
C\-like language:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
code:if_then_else =
    [branch{0}: topindent \(dqif \(dq cond \(dq {\(dq nl
        indent [stmt: stmt] dedent]
    [branch{1:\-1}: topindent \(dq} else\(dq (.cond ? \(dq if \(dq cond) \(dq {\(dq nl
        indent [stmt: stmt] dedent]
    topindent \(dq}\(dq nl;
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Here \fBbranch\fP is a list variable: \fBbranch{0}\fP expands to the first branch
(which is special, as there is no \fBelse\fP part), \fBbranch{1:\-1}\fP expands to
all remaining branches (if any). \fBstmt\fP is also a list variable:
\fB[stmt: stmt]\fP is a nested list that expands to a list of statements in the
body of the current branch. \fBtopindent\fP, \fBindent\fP, \fBdedent\fP and \fBnl\fP
are global variables, and \fB\&.cond\fP is a local conditional (their meaning is
described below). This code template could produce the following code:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
if x {
    // do something
} else if y {
    // do something else
} else {
    // don\(aqt do anything
}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Here\(aqs a list of all code templates supported by re2java with their local
variables and conditionals. Note that a particular definition may, but does
not have to use local variables and conditionals.
Any unused code templates should be set to \fB<undefined>\fP\&.
.INDENT 0.0
.TP
.B \fBcode:var_local\fP
Declaration or definition of a local variable. Supported variables:
\fBtype\fP (the type of the variable), \fBname\fP (its name) and \fBinit\fP
(initial value, if any). Conditionals: \fB\&.init\fP (true if there is an
initializer).
.TP
.B \fBcode:var_global\fP
Same as \fBcode:var_local\fP, except that it\(aqs used in top\-level.
.TP
.B \fBcode:const_local\fP
Definition of a local constant. Supported variables: \fBtype\fP (the type
of the constant), \fBname\fP (its name) and \fBinit\fP (initial value).
.TP
.B \fBcode:const_global\fP
Same as \fBcode:const_local\fP, except that it\(aqs used in top\-level.
.TP
.B \fBcode:array_local\fP
Definition of a local array (table). Supported variables: \fBtype\fP (the
type of array elements), \fBname\fP (array name), \fBsize\fP (its size),
\fBrow\fP (a list variable that does not itself produce any code, but
expands list expression as many times as there are rows in the table)
and \fBelem\fP (a list variable that expands to all table elements in the
current row \-\- it\(aqs meant to be nested in the \fBrow\fP list).
Supported conditional: \fB\&.const\fP (true if the array is immutable).
.TP
.B \fBcode:array_global\fP
Same as \fBcode:array_local\fP, except that it\(aqs used in top\-level.
.TP
.B \fBcode:array_elem\fP
Reference to an element of an array (table). Supported variables:
\fBarray\fP (the name of the array) and \fBindex\fP (index of the element).
.TP
.B \fBcode:enum\fP
Definition of an enumeration (it may be defined using a special language
construct for enumerations, or simply as a few standalone constants).
Supported variables are \fBtype\fP (user\-defined enumeration type or type
of the constants), \fBelem\fP (list variable that expands to the name of
each member) and \fBinit\fP (initializer for each member). Conditionals:
\fB\&.init\fP (true if there is an initializer).
.TP
.B \fBcode:enum_elem\fP
Enumeration element (a member of a user\-defined enumeration type or a
name of a constant, depending on how \fBcode:enum\fP is defined).
Supported variables are \fBname\fP (the name of the element) and \fBtype\fP
(its type).
.TP
.B \fBcode:assign\fP
Assignment statement. Supported variables are \fBlhs\fP (left hand side)
and \fBrhs\fP (right hand side).
.TP
.B \fBcode:type_int\fP
Signed integer type.
.TP
.B \fBcode:type_uint\fP
Unsigned integer type.
.TP
.B \fBcode:type_yybm\fP
Type of elements in the \fByybm\fP table.
.TP
.B \fBcode:type_yytarget\fP
Type of elements in the \fByytarget\fP table.
.TP
.B \fBcode:type_yyctable\fP
Type of elements in the \fByyctable\fP table.
.TP
.B \fBcode:cmp_eq\fP
Operator \(dqequals\(dq.
.TP
.B \fBcode:cmp_ne\fP
Operator \(dqnot equals\(dq.
.TP
.B \fBcode:cmp_lt\fP
Operator \(dqless than\(dq.
.TP
.B \fBcode:cmp_gt\fP
Operator \(dqgreater than\(dq
.TP
.B \fBcode:cmp_le\fP
Operator \(dqless or equal\(dq
.TP
.B \fBcode:cmp_ge\fP
Operator \(dqgreater or equal\(dq
.TP
.B \fBcode:if_then_else\fP
If\-then\-else statement with one or more branches. Supported variables:
\fBbranch\fP (a list variable that does not itself produce any code, but
expands list expression as many times as there are branches), \fBcond\fP
(condition of the current branch) and \fBstmt\fP (a list variable that
expands to all statements in the current branch). Conditionals:
\fB\&.cond\fP (true if the current branch has a condition), \fB\&.many\fP (true
if there\(aqs more than one branch).
.TP
.B \fBcode:if_then_else_oneline\fP
A specialization of \fBcode:if_then_else\fP for the case when all branches
have one\-line statements. If this is \fB<undefined>\fP,
\fBcode:if_then_else\fP is used instead.
.TP
.B \fBcode:switch\fP
A switch statement with one or more cases. Supported variables: \fBexpr\fP
(the switched\-on expression) and \fBcase\fP (a list variable that expands
to all cases\-groups with their code blocks).
.TP
.B \fBcode:switch_cases\fP
A group of switch cases that maps to a single code block. Supported
variables are \fBcase\fP (a list variable that expands to all cases in
this group) and \fBstmt\fP (a list variable that expands to all statements
in the code block.
.TP
.B \fBcode:switch_cases_oneline\fP
A specialization of \fBcode:switch_cases\fP for the case when the code
block consists of a single one\-line statement. If this is
\fB<undefined>\fP, \fBcode:switch_cases\fP is used instead.
.TP
.B \fBcode:switch_case_range\fP
A single switch case that covers a range of values (possibly consisting
of a single value). Supported variable: \fBval\fP (a list variable that
expands to all values in the range). Supported conditionals: \fB\&.many\fP
(true if there\(aqs more than one value in the range) and
\fB\&.char_literals\fP (true if this is a switch on character literals \-\-
some languages provide special syntax for this case).
.TP
.B \fBcode:switch_case_default\fP
Default switch case.
.TP
.B \fBcode:loop\fP
A loop that runs forever (unless interrupted from the loop body).
Supported variables: \fBlabel\fP (loop label), \fBstmt\fP (a list variable
that expands to all statements in the loop body).
.TP
.B \fBcode:continue\fP
Continue statement. Supported variables: \fBlabel\fP (label from which to
continue execution).
.TP
.B \fBcode:goto\fP
Goto statement. Supported variables: \fBlabel\fP (label of the jump
target).
.TP
.B \fBcode:cgoto\fP
Computed \fBgoto\fP statement.
Supported variables: \fBarray\fP (the table containing computed \fBgoto\fP
information), \fBindex\fP (index of the element in the table) and \fBbase\fP
(base label, only used if \fB\&.cgoto.relative\fP is true).
.TP
.B \fBcode:cgoto:data\fP
Initializer expression for a single element in computed \fBgoto\fP table.
Supported variables: \fBlabel\fP (the label that is used to initialize the
current element), \fBtype\fP (underlying type of the elements in the table)
and \fBbase\fP (base label \- only used if \fB\&.cgoto.relative\fP is true).
.TP
.B \fBcode:fndecl\fP
Function declaration. Supported variables: \fBname\fP (function name),
\fBtype\fP (return type), \fBthrow\fP (exceptions thrown by this function,
maps to \fBre2c:yyfn:throw\fP configuration), \fBarg\fP (a list variable that
does not itself produce code, but expands list expression as many times as
there are function arguments), \fBargname\fP (name of the current argument),
\fBargtype\fP (type of the current argument). Conditional: \fB\&.type\fP (true
if this is a non\-void function).
.TP
.B \fBcode:fndef\fP
Like \fBcode:fndecl\fP, but used for function definitions, so it has one
additional list variable \fBstmt\fP that expands to all statements in the
function body.
.TP
.B \fBcode:fncall\fP
Function call statement. Supported variables: \fBname\fP (function name),
\fBretval\fP (l\-value where the return value is stored, if any) and
\fBarg\fP (a list variable that expands to all function arguments).
Conditionals: \fB\&.args\fP (true if the function has arguments) and
\fB\&.retval\fP (true if return value needs to be saved).
.TP
.B \fBcode:tailcall\fP
Tail call statement. Supported variables: \fBname\fP (function name),
and \fBarg\fP (a list variable that expands to all function arguments).
Conditionals: \fB\&.args\fP (true if the function has arguments) and
\fB\&.retval\fP (true if this is a non\-void function).
.TP
.B \fBcode:recursive_functions\fP
Program body with \fB\-\-recursive\-functions\fP code model. Supported
variables: \fBfn\fP (a list variable that does not itself produce any
code, but expands list expression as many times as there are functions),
\fBfndecl\fP (declaration of the current function) and \fBfndef\fP
(definition of the current function).
.TP
.B \fBcode:fingerprint\fP
The fingerprint at the top of the generated output file. Supported
variables: \fBver\fP (re2java version that was used to generate this) and
\fBdate\fP (generation date).
.TP
.B \fBcode:line_info\fP
The format of line directives (if this is set to \fB<undefined>\fP, no
directives are generated). Supported variables: \fBline\fP (line number)
and \fBfile\fP (filename).
.TP
.B \fBcode:abort\fP
A statement that aborts program execution.
.TP
.B \fBcode:yydebug\fP
\fBYYDEBUG\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYDEBUG\fP, \fByyrecord\fP, \fByych\fP (map to the
corresponding \fBre2c:\fP configurations), \fBstate\fP (DFA state number).
.TP
.B \fBcode:yypeek\fP
\fBYYPEEK\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYPEEK\fP, \fBYYCTYPE\fP, \fBYYINPUT\fP, \fBYYCURSOR\fP,
\fByyrecord\fP, \fByych\fP (map to the corresponding \fBre2c:\fP
configurations). Conditionals: \fB\&.cast\fP (true if
\fBre2c:yych:conversion\fP is set to non\-zero).
.TP
.B \fBcode:yyskip\fP
\fBYYSKIP\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYSKIP\fP, \fBYYCURSOR\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations).
.TP
.B \fBcode:yybackup\fP
\fBYYBACKUP\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYBACKUP\fP, \fBYYCURSOR\fP, \fBYYMARKER\fP,
\fByyrecord\fP (map to the corresponding \fBre2c:\fP configurations).
.TP
.B \fBcode:yybackupctx\fP
\fBYYBACKUPCTX\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYBACKUPCTX\fP, \fBYYCURSOR\fP, \fBYYCTXMARKER\fP,
\fByyrecord\fP (map to the corresponding \fBre2c:\fP configurations).
.TP
.B \fBcode:yyskip_yypeek\fP
Combined \fBcode:yyskip\fP and \fBcode:yypeek\fP statement (defaults to
\fBcode:yyskip\fP followed by \fBcode:yypeek\fP).
.TP
.B \fBcode:yypeek_yyskip\fP
Combined \fBcode:yypeek\fP and \fBcode:yyskip\fP statement (defaults to
\fBcode:yypeek\fP followed by \fBcode:yyskip\fP).
.TP
.B \fBcode:yyskip_yybackup\fP
Combined \fBcode:yyskip\fP and \fBcode:yybackup\fP statement (defaults to
\fBcode:yyskip\fP followed by \fBcode:yybackup\fP).
.TP
.B \fBcode:yybackup_yyskip\fP
Combined \fBcode:yybackup\fP and \fBcode:yyskip\fP statement (defaults to
\fBcode:yybackup\fP followed by \fBcode:yyskip\fP).
.TP
.B \fBcode:yybackup_yypeek\fP
Combined \fBcode:yybackup\fP and \fBcode:yypeek\fP statement (defaults to
\fBcode:yybackup\fP followed by \fBcode:yypeek\fP).
.TP
.B \fBcode:yyskip_yybackup_yypeek\fP
Combined \fBcode:yyskip\fP, \fBcode:yybackup\fP and \fBcode:yypeek\fP
statement (defaults to\(ga\(gacode:yyskip\(ga\(ga followed by \fBcode:yybackup\fP
followed by \fBcode:yypeek\fP).
.TP
.B \fBcode:yybackup_yypeek_yyskip\fP
Combined \fBcode:yybackup\fP, \fBcode:yypeek\fP and \fBcode:yyskip\fP
statement (defaults to\(ga\(gacode:yybackup\(ga\(ga followed by \fBcode:yypeek\fP
followed by \fBcode:yyskip\fP).
.TP
.B \fBcode:yyrestore\fP
\fBYYRESTORE\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYRESTORE\fP, \fBYYCURSOR\fP, \fBYYMARKER\fP,
\fByyrecord\fP (map to the corresponding \fBre2c:\fP configurations).
.TP
.B \fBcode:yyrestorectx\fP
\fBYYRESTORECTX\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYRESTORECTX\fP, \fBYYCURSOR\fP, \fBYYCTXMARKER\fP,
\fByyrecord\fP (map to the corresponding \fBre2c:\fP configurations).
.TP
.B \fBcode:yyrestoretag\fP
\fBYYRESTORETAG\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYRESTORETAG\fP, \fBYYCURSOR\fP, \fByyrecord\fP (map
to the corresponding \fBre2c:\fP configurations), \fBtag\fP (the name of tag
variable used to restore position).
.TP
.B \fBcode:yyshift\fP
\fBYYSHIFT\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYSHIFT\fP, \fBYYCURSOR\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBoffset\fP (the number of code
units to shift the current position).
.TP
.B \fBcode:yyshiftstag\fP
\fBYYSHIFTSTAG\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYSHIFTSTAG\fP, \fByyrecord\fP, \fBnegative\fP (map
to the corresponding \fBre2c:\fP configurations), \fBtag\fP (tag variable
which needs to be shifted), \fBoffset\fP (the number of code units to
shift). Conditionals: \fB\&.nested\fP (true if this is a nested tag \-\- in
this case its value may equal to \fBre2c:tags:negative\fP, which should
not be shifted).
.TP
.B \fBcode:yyshiftmtag\fP
\fBYYSHIFTMTAG\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYSHIFTMTAG\fP (maps to the corresponding
\fBre2c:\fP configuration), \fBtag\fP (tag variable which needs to be
shifted), \fBoffset\fP (the number of code units to shift).
.TP
.B \fBcode:yystagp\fP
\fBYYSTAGP\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYSTAGP\fP, \fBYYCURSOR\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBtag\fP (tag variable that
should be updated).
.TP
.B \fBcode:yymtagp\fP
\fBYYMTAGP\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYMTAGP\fP (maps to the corresponding \fBre2c:\fP
configuration), \fBtag\fP (tag variable that should be updated).
.TP
.B \fBcode:yystagn\fP
\fBYYSTAGN\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYSTAGN\fP, \fBnegative\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBtag\fP (tag variable that
should be updated).
.TP
.B \fBcode:yymtagn\fP
\fBYYMTAGN\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYMTAGN\fP (maps to the corresponding \fBre2c:\fP
configuration), \fBtag\fP (tag variable that should be updated).
.TP
.B \fBcode:yycopystag\fP
\fBYYCOPYSTAG\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYCOPYSTAG\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBlhs\fP, \fBrhs\fP (left and
right hand side tag variables of the copy operation).
.TP
.B \fBcode:yycopymtag\fP
\fBYYCOPYMTAG\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYCOPYMTAG\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBlhs\fP, \fBrhs\fP (left and
right hand side tag variables of the copy operation).
.TP
.B \fBcode:yygetaccept\fP
\fBYYGETACCEPT\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYGETACCEPT\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBvar\fP (maps to
\fBre2c:yyaccept\fP configuration).
.TP
.B \fBcode:yysetaccept\fP
\fBYYSETACCEPT\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYSETACCEPT\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBvar\fP (maps to
\fBre2c:yyaccept\fP configuration) and \fBval\fP (numeric value of the
accepted rule).
.TP
.B \fBcode:yygetcond\fP
\fBYYGETCOND\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYGETCOND\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBvar\fP (maps to
\fBre2c:yycond\fP configuration).
.TP
.B \fBcode:yysetcond\fP
\fBYYSETCOND\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYSETCOND\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBvar\fP (maps to
\fBre2c:yycond\fP configuration) and \fBval\fP (numeric condition
identifier).
.TP
.B \fBcode:yygetstate\fP
\fBYYGETSTATE\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYGETSTATE\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBvar\fP (maps to
\fBre2c:yystate\fP configuration).
.TP
.B \fBcode:yysetstate\fP
\fBYYSETSTATE\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYSETSTATE\fP, \fByyrecord\fP (map to the
corresponding \fBre2c:\fP configurations), \fBvar\fP (maps to
\fBre2c:yystate\fP configuration) and \fBval\fP (state number).
.TP
.B \fBcode:yylessthan\fP
\fBYYLESSTHAN\fP statement, possibly specialized for different APIs.
Supported variables: \fBYYLESSTHAN\fP, \fBYYCURSOR\fP, \fBYYLIMIT\fP,
\fByyrecord\fP (map to the corresponding \fBre2c:\fP configurations),
\fBneed\fP (the number of code units to check against). Conditional:
\fB\&.many\fP (true if the \fBneed\fP is more than one).
.TP
.B \fBcode:yyend\fP
\fBYYEND\fP expression, possibly specialized for different APIs.
Supported variables: \fBYYEND\fP, \fBYYCURSOR\fP, \fBYYLIMIT\fP\&.
.TP
.B \fBcode:yybm_filter\fP
Condition that is used to filter out \fByych\fP values that are not
covered by the \fByybm\fP table (used with \fB\-\-bitmaps\fP option).
Supported variable: \fByych\fP (maps to \fBre2c:yych\fP configuration).
.TP
.B \fBcode:yybm_match\fP
The format of \fByybm\fP table check (generated with \fB\-\-bitmaps\fP
option). Supported variables: \fByybm\fP, \fByych\fP (map to the
corresponding \fBre2c:\fP configurations), \fBoffset\fP (offset in the
\fByybm\fP table that needs to be added to \fByych\fP) and \fBmask\fP (bit
mask that should be applied to the table entry to retrieve the boolean
value that needs to be checked)
.TP
.B \fBcode:yytarget_filter\fP
Condition that is used to filter out \fByych\fP values that are not
covered by the \fByytarget\fP table (used with \fB\-\-computed\-gotos\fP option).
Supported variable: \fByych\fP (maps to \fBre2c:yych\fP configuration).
.UNINDENT
.sp
Here\(aqs a list of all global variables that are allowed in syntax files:
.INDENT 0.0
.TP
.B \fBnl\fP
A newline.
.TP
.B \fBindent\fP
A variable that does not produce any code, but has a side\-effect of
increasing indentation level.
.TP
.B \fBdedent\fP
A variable that does not produce any code, but has a side\-effect of
decreasing indentation level.
.TP
.B \fBtopindent\fP
Indentation string for the current statement. Indentation level is
tracked and automatically updated by the code generator.
.UNINDENT
.sp
Here\(aqs a list of all global conditionals that are allowed in syntax files:
.INDENT 0.0
.TP
.B \fB\&.api.simple\fP
True if simple API is used (\fB\-\-api simple\fP or \fBre2c:api = simple\fP).
.TP
.B \fB\&.api.generic\fP
True if generic API is used (\fB\-\-api generic\fP or
\fBre2c:api = generic\fP).
.TP
.B \fB\&.api.record\fP
True if record API is used (\fB\-\-api record\fP or \fBre2c:api = record\fP).
.TP
.B \fB\&.api_style.functions\fP
True if function\-like API style is used
(\fBre2c:api\-style = functions\fP).
.TP
.B \fB\&.api_style.freeform\fP
True if free\-form API style is used (\fBre2c:api\-style = free\-form\fP).
.TP
.B \fB\&.case_ranges\fP
True if case ranges feature is enabled (\fB\-\-case\-ranges\fP or
\fBre2c:case\-ranges = 1\fP).
.TP
.B \fB\&.cgoto.relative\fP
True if the relative form of computed \fBgoto\fP is used
(\fB\-\-computed\-gotos\-relative\fP or \fBre2c:cgoto:relative = 1\fP).
.TP
.B \fB\&.code_model.goto_label\fP
True if  code model based on goto/label is used (\fB\-\-goto\-label\fP).
.TP
.B \fB\&.code_model.loop_switch\fP
True if code model based on loop/switch is used (\fB\-\-loop\-switch\fP).
.TP
.B \fB\&.code_model.recursive_functions\fP
True if code model based on recursive functions is used
(\fB\-\-recursive\-function\fP).
.TP
.B \fB\&.date\fP
True if the generated fingerprint should contain generation date.
.TP
.B \fB\&.loop_label\fP
True if re2java generated loops must have a label (\fBre2c:label:yyloop\fP
is set to a nonempty string).
.TP
.B \fB\&.monadic\fP
True if the generated code should be monadic (\fBre2c:monadic = 1\fP).
This is only relevant for pure functional languages.
.TP
.B \fB\&.start_conditions\fP
True if start conditions are enabled (\fB\-\-start\-conditions\fP).
.TP
.B \fB\&.storable_state\fP
True if storable state is enabled (\fB\-\-storable\-state\fP).
.TP
.B \fB\&.unsafe\fP
True if re2java should use \(dqunsafe\(dq blocks in order to generate faster
code (\fB\-\-unsafe\fP, \fBre2c:unsafe = 1\fP). This is only relevant for
languages that have \(dqunsafe\(dq feature.
.TP
.B \fB\&.version\fP
True if the generated fingerprint should contain re2java version.
.TP
.B \fB\&.yyfill.enable\fP
True if \fBYYFILL\fP is enabled (\fBre2c:yyfill:enable = 1\fP).
.TP
.B \fB\&.yyfn.throw\fP
True if \fBre2c:yyfn:throw\fP configuration is defined to a nonempty string.
.UNINDENT
.UNINDENT
.UNINDENT
.SH HANDLING THE END OF INPUT
.sp
One of the main problems for the lexer is to know when to stop.
There are a few terminating conditions:
.INDENT 0.0
.IP \(bu 2
the lexer may match some rule (including default rule \fB*\fP) and come to a
final state
.IP \(bu 2
the lexer may fail to match any rule and come to a default state
.IP \(bu 2
the lexer may reach the end of input
.UNINDENT
.sp
The first two conditions terminate the lexer in a \(dqnatural\(dq way: it comes to a
state with no outgoing transitions, and the matching automatically stops. The
third condition, end of input, is different: it may happen in any state, and the
lexer should be able to handle it. Checking for the end of input interrupts the
normal lexer workflow and adds conditional branches to the generated program,
therefore it is necessary to minimize the number of such checks. re2java supports
a few different methods for handling the end of input. Which one to use depends
on the complexity of regular expressions, the need for buffering, performance
considerations and other factors. Here is a list of methods:
.INDENT 0.0
.IP \(bu 2
\fBSentinel.\fP
This method eliminates the need for the end of input checks altogether. It is
simple and efficient, but limited to the case when there is a natural
\(dqsentinel\(dq character that can never occur in valid input. This character may
still occur in invalid input, but it should not be allowed by the regular
expressions, except perhaps as the last character of a rule. The sentinel is
appended at the end of input and serves as a stop signal: when the lexer reads
this character, it is either a syntax error or the end of input. In both
cases the lexer should stop. This method is used if \fBYYFILL\fP is disabled
with \fBre2c:yyfill:enable = 0;\fP and \fBre2c:eof\fP has the default value
\fB\-1\fP\&.
.nf

.fi
.sp
.IP \(bu 2
\fBSentinel with bounds checks.\fP
This method is generic: it allows one to handle any input without restrictions on
the regular expressions. The idea is to reduce the number of end of input
checks by performing them only on certain characters. Similar to the
\(dqsentinel\(dq method, one of the characters is chosen as a \(dqsentinel\(dq and
appended at the end of input. However, there is no restriction on where the
sentinel may occur (in fact, any character can be chosen for a sentinel).
When the lexer reads this character, it additionally performs a bounds check.
If the current position is within bounds, the lexer resumes matching and
handles the sentinel as a regular character. Otherwise it invokes \fBYYFILL\fP
(unless it is disabled). If more input is supplied, the lexer will rematch the
last character and continue as if the sentinel wasn\(aqt there. Otherwise it must
be the real end of input, and the lexer stops. This method is used when
\fBre2c:eof\fP has non\-negative value (it should be set to the numeric value of
the sentinel). \fBYYFILL\fP is optional.
.nf

.fi
.sp
.IP \(bu 2
\fBBounds checks with padding.\fP
This method is generic, and it may be faster than the \(dqsentinel with bounds
checks\(dq method, but it is also more complex. The idea is to partition DFA
states into strongly connected components (SCCs) and generate a single check
per SCC for enough characters to cover the longest non\-looping path in this
SCC. This reduces the number of checks, but there is a problem with short
lexemes at the end of input, as the check requires enough characters to cover
the longest lexeme. This can be fixed by padding the input with a few fake
characters that do not form a valid lexeme suffix (so that the lexer cannot
match them). The length of padding should be \fBYYMAXFILL\fP, generated with
a \fBmax\fP block. If there is not enough input, the lexer invokes \fBYYFILL\fP
which should supply at least the required number of characters or not return.
This method is used if \fBYYFILL\fP is enabled and \fBre2c:eof\fP is \fB\-1\fP
(this is the default configuration).
.nf

.fi
.sp
.IP \(bu 2
\fBCustom checks.\fP
Generic API allows one to override basic operations like reading a character,
which makes it possible to include the end\-of\-input checks as part of them.
This approach is error\-prone and should be used with caution. To use a custom
method, enable generic API with \fB\-\-api custom\fP or \fBre2c:api = custom;\fP and
disable default bounds checks with \fBre2c:yyfill:enable = 0;\fP or
\fBre2c:yyfill:check = 0;\fP\&.
.UNINDENT
.sp
The following subsections contain an example of each method.
.SS Sentinel
.sp
This example uses a sentinel character to handle the end of input. The program
counts space\-separated words in a null\-terminated string. The sentinel is null:
it is the last character of each input string, and it is not allowed in the
middle of a lexeme by any of the rules (in particular, it is not included in
character ranges where it is easy to overlook). If a null occurs in the middle
of a string, it is a syntax error and the lexer will match default rule \fB*\fP,
but it won\(aqt read past the end of input or crash (use
\fI\%\-Wsentinel\-in\-midrule\fP
warning and \fBre2c:sentinel\fP configuration to verify this). Configuration
\fBre2c:yyfill:enable = 0;\fP suppresses the generation of bounds checks and
\fBYYFILL\fP invocations.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

class Main {
    // Expects a null\-terminated string.
    static int lex(String yyinput) {
        int yycursor = 0;
        int count = 0;

        loop: while (true) {
            /*!re2c
                re2c:YYCTYPE = \(dqchar\(dq;
                re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;
                re2c:yyfill:enable = 0;

                *      { return \-1; }
                [\ex00] { return count; }
                [a\-z]+ { count += 1; continue loop; }
                [ ]+   { continue loop; }
            */
        }
    }

    public static void main(String []args) {
        assert lex(\(dq\e0\(dq) == 0;
        assert lex(\(dqone two three\e0\(dq) == 3;
        assert lex(\(dqf0ur\e0\(dq) == \-1;
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Sentinel with bounds checks
.sp
This example uses sentinel with bounds checks to handle the end of input (this
method was added in version 1.2). The program counts space\-separated
single\-quoted strings. The sentinel character is null, which is specified with
\fBre2c:eof = 0;\fP configuration. As in the \fI\%sentinel\fP method, null is the last
character of each input string, but it is allowed in the middle of a rule (for
example, \fB\(aqaaa\e0aa\(aq\e0\fP is valid input, but \fB\(aqaaa\e0\fP is a syntax error).
Bounds checks are generated in each state that matches an input character, but
they are scoped to the branch that handles null. Bounds checks are of the form
\fBYYLIMIT <= YYCURSOR\fP or \fBYYLESSTHAN(1)\fP with generic API. If the check
condition is true, lexer has reached the end of input and should stop
(\fBYYFILL\fP is disabled with \fBre2c:yyfill:enable = 0;\fP as the input fits into
one buffer, see the \fI\%YYFILL with sentinel\fP section for an example that uses
\fBYYFILL\fP). Reaching the end of input opens three possibilities: if the lexer
is in the initial state it will match the end\-of\-input rule \fB$\fP, otherwise it
may fallback to a previously matched rule (including default rule \fB*\fP) or go
to a default state, causing
\fI\%\-Wundefined\-control\-flow\fP\&.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

class Main {
    // Expects a null\-terminated string.
    static int lex(String yyinput) {
        int yycursor = 0;
        int yymarker = 0;
        int yylimit = yyinput.length() \- 1; // yylimit points at the terminating null
        int count = 0;

        loop: while (true) {
            /*!re2c
                re2c:YYCTYPE = \(dqchar\(dq;
                re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;
                re2c:yyfill:enable = 0;
                re2c:eof = 0;

                str = [\(aq] ([^\(aq\e\e] | [\e\e][^])* [\(aq];

                *    { return \-1; }
                $    { return count; }
                str  { count += 1; continue loop; }
                [ ]+ { continue loop; }
            */
        }
    }

    public static void main(String []args) {
        assert lex(\(dq\e0\(dq) == 0;
        assert lex(\(dq\(aqqu\e0tes\(aq \(aqare\(aq \(aqfine: \e\e\(aq\(aq \e0\(dq) == 3;
        assert lex(\(dq\(aqunterminated\e\e\(aq\e0\(dq) == \-1;
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Bounds checks with padding
.sp
This example uses bounds checks with padding to handle the end of input (this
method is enabled by default). The program counts space\-separated single\-quoted
strings. There is a padding of \fBYYMAXFILL\fP null characters appended at the end
of input, where \fBYYMAXFILL\fP value is autogenerated with a \fBmax\fP block. It
is not necessary to use null for padding \-\-\- any characters can be used as long
as they do not form a valid lexeme suffix (in this example padding should not
contain single quotes, as they may be mistaken for a suffix of a single\-quoted
string). There is a \(dqstop\(dq rule that matches the first padding character (null)
and terminates the lexer (note that it checks if null is at the beginning of
padding, otherwise it is a syntax error). Bounds checks are generated only in
some states that are determined by the strongly connected components of the
underlying automaton. Checks have the form \fB(YYLIMIT \- YYCURSOR) < n\fP or
\fBYYLESSTHAN(n)\fP with generic API, where \fBn\fP is the minimum number of
characters that are needed for the lexer to proceed (it also means that the next
bounds check will occur in at most \fBn\fP characters). If the check condition is
true, the lexer has reached the end of input and will invoke \fBYYFILL(n)\fP that
should either supply at least \fBn\fP input characters or not return. In this
example \fBYYFILL\fP always fails and terminates the lexer with an error (which is
fine because the input fits into one buffer). See the \fI\%YYFILL with padding\fP
section for an example that refills the input buffer with \fBYYFILL\fP\&.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

class Main {
    /*!max:re2c*/

    // Expects yymaxfill\-padded string.
    static int lex(byte[] str) {
        // Pad string with yymaxfill zeroes at the end.
        byte[] yyinput = new byte[str.length + YYMAXFILL];
        System.arraycopy(str, 0, yyinput, 0, str.length); 

        int yycursor = 0;
        int yylimit = yyinput.length;
        int count = 0;

        loop: while (true) {
            /*!re2c
                re2c:YYCTYPE = \(dqint\(dq;
                re2c:YYPEEK = \(dqByte.toUnsignedInt(yyinput[yycursor])\(dq;
                re2c:YYFILL = \(dqreturn \-1;\(dq;

                str = [\(aq] ([^\(aq\e\e] | [\e\e][^])* [\(aq];

                [\ex00] {
                    // Check that it is the sentinel, not some unexpected null.
                    return (yycursor \- 1 == str.length) ? count : \-1;
                }
                str  { count += 1; continue loop; }
                [ ]+ { continue loop; }
                *    { return \-1; }
            */
        }
    }

    public static void main(String []args) {
        assert lex(\(dq\(dq.getBytes()) == 0;
        assert lex(\(dq\(aqqu\e0tes\(aq \(aqare\(aq \(aqfine: \e\e\(aq\(aq \(dq.getBytes()) == 3;
        assert lex(\(dq\(aqunterminated\e\e\(aq\(dq.getBytes()) == \-1;
        assert lex(\(dq\(aqunexpected \e00 null\e\e\(aq\(dq.getBytes()) == \-1;
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Custom checks
.sp
This example uses a custom end\-of\-input handling method based on generic API.
The program counts space\-separated single\-quoted strings. It is the same as the
\fI\%sentinel\fP example, except that the input is not null\-terminated. To cover up
for the absence of a sentinel character at the end of input, \fBYYPEEK\fP is
redefined to perform a bounds check before it reads the next input character.
This is inefficient because checks are done very often. If the check condition
fails, \fBYYPEEK\fP returns the real character, otherwise it returns a fake
sentinel character.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

class Main {
    // Expects a string without terminating null.
    static int lex(String str) {
        byte[] yyinput = str.getBytes();
        int yycursor = 0;
        int count = 0;

        loop: while (true) {
            /*!re2c
                re2c:api = generic;
                re2c:YYCTYPE = \(dqint\(dq;
                re2c:YYPEEK = \(dq(yycursor < yyinput.length)\(dq
                              \(dq ? Byte.toUnsignedInt(yyinput[yycursor]) : 0\(dq;
                re2c:YYSKIP = \(dqyycursor += 1;\(dq;
                re2c:yyfill:enable = 0;

                *      { return \-1; }
                [\ex00] { return count; }
                [a\-z]+ { count += 1; continue loop; }
                [ ]+   { continue loop; }
            */
        }
    }

    public static void main(String []args) {
        assert lex(\(dq\(dq) == 0;
        assert lex(\(dqone two three\(dq) == 3;
        assert lex(\(dqf0ur\(dq) == \-1;
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SH BUFFER REFILLING
.sp
The need for buffering arises when the input cannot be mapped in memory all at
once: either it is too large, or it comes in a streaming fashion (like reading
from a socket). The usual technique in such cases is to allocate a fixed\-sized
memory buffer and process input in chunks that fit into the buffer. When the
current chunk is processed, it is moved out and new data is moved in. In
practice it is somewhat more complex, because lexer state consists not of a
single input position, but a set of interrelated positions:
.INDENT 0.0
.IP \(bu 2
cursor: the next input character to be read (\fBYYCURSOR\fP in C pointer API or
\fBYYSKIP\fP/\fBYYPEEK\fP in generic API)
.IP \(bu 2
limit: the position after the last available input character (\fBYYLIMIT\fP in
C pointer API, implicitly handled by \fBYYLESSTHAN\fP in generic API)
.IP \(bu 2
marker: the position of the most recent match, if any (\fBYYMARKER\fP in default
API or \fBYYBACKUP\fP/\fBYYRESTORE\fP in generic API)
.IP \(bu 2
token: the start of the current lexeme (implicit in re2java API, as it is not
needed for the normal lexer operation and can be defined and updated by the
user)
.IP \(bu 2
context marker: the position of the trailing context (\fBYYCTXMARKER\fP in
C pointer API or \fBYYBACKUPCTX\fP/\fBYYRESTORECTX\fP in generic API)
.IP \(bu 2
tag variables: submatch positions (defined with \fBstags\fP and \fBmtags\fP blocks
and generic API primitives \fBYYSTAGP\fP/\fBYYSTAGN\fP/\fBYYMTAGP\fP/\fBYYMTAGN\fP)
.UNINDENT
.sp
Not all these are used in every case, but if used, they must be updated by
\fBYYFILL\fP\&. All active positions are contained in the segment between token and
cursor, therefore everything between buffer start and token can be discarded,
the segment from token and up to limit should be moved to the beginning of
buffer, and the free space at the end of buffer should be filled with new data.
In order to avoid frequent \fBYYFILL\fP calls it is best to fill in as many input
characters as possible (even though fewer characters might suffice to resume the
lexer). The details of \fBYYFILL\fP implementation are slightly different
depending on which EOF handling method is used: the case of EOF rule is somewhat
simpler than the case of bounds\-checking with padding. Also note that if
\fB\-f \-\-storable\-state\fP option is used, \fBYYFILL\fP has slightly different
semantics (described in the section about storable state).
.SS YYFILL with sentinel
.sp
If EOF rule is used, \fBYYFILL\fP is a function\-like primitive that accepts
no arguments and returns a value which is checked against zero. \fBYYFILL\fP
invocation is triggered by condition \fBYYLIMIT <= YYCURSOR\fP in C pointer API and
\fBYYLESSTHAN()\fP in generic API. A non\-zero return value means that \fBYYFILL\fP
has failed. A successful \fBYYFILL\fP call must supply at least one character and
adjust input positions accordingly. Limit must always be set to one after the
last input position in buffer, and the character at the limit position must be
the sentinel symbol specified by \fBre2c:eof\fP configuration. The pictures below
show the relative locations of input positions in buffer before and after
\fBYYFILL\fP call (sentinel symbol is marked with \fB#\fP, and the second picture
shows the case when there is not enough input to fill the whole buffer).
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
               <\-\- shift \-\->
             >\-A\-\-\-\-\-\-\-\-\-\-\-\-B\-\-\-\-\-\-\-\-\-C\-\-\-\-\-\-\-\-\-\-\-\-\-D#\-\-\-\-\-\-\-\-\-\-\-E\->
             buffer       token    marker         limit,
                                                  cursor
>\-A\-\-\-\-\-\-\-\-\-\-\-\-B\-\-\-\-\-\-\-\-\-C\-\-\-\-\-\-\-\-\-\-\-\-\-D\-\-\-\-\-\-\-\-\-\-\-\-E#\->
             buffer,  marker        cursor        limit
             token

               <\-\- shift \-\->
             >\-A\-\-\-\-\-\-\-\-\-\-\-\-B\-\-\-\-\-\-\-\-\-C\-\-\-\-\-\-\-\-\-\-\-\-\-D#\-\-E (EOF)
             buffer       token    marker         limit,
                                                  cursor
>\-A\-\-\-\-\-\-\-\-\-\-\-\-B\-\-\-\-\-\-\-\-\-C\-\-\-\-\-\-\-\-\-\-\-\-\-D\-\-\-E#........
             buffer,  marker       cursor limit
             token
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Here is an example of a program that reads input file \fBinput.txt\fP in chunks of
4096 bytes and uses EOF rule.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

import java.io.*;
import java.nio.file.*;

class Lexer {
    public static final int BUFSIZE = 4096;

    private BufferedInputStream stream;
    private byte[] yyinput;
    private int yycursor;
    private int yymarker;
    private int yylimit;
    private int token;
    private boolean eof;

    public Lexer(File file) throws FileNotFoundException {
        stream = new BufferedInputStream(new FileInputStream(file));
        // Sentinel at \(gayylimit\(ga offset is set to zero, which triggers YYFILL.
        yyinput = new byte[BUFSIZE + 1];
        yycursor = yymarker = yylimit = token = BUFSIZE;
        eof = false;
    }

    private int fill() throws IOException {
        if (eof) { return \-1; } // unexpected EOF

        // Error: lexeme too long. In real life can reallocate a larger buffer.
        if (token < 1) { return \-2; }

        // Shift buffer contents (discard everything up to the current token).
        System.arraycopy(yyinput, token, yyinput, 0, yylimit \- token); 
        yycursor \-= token;
        yymarker \-= token;
        yylimit \-= token;
        token = 0;

        // Fill free space at the end of buffer with new data from file.
        yylimit += stream.read(yyinput, yylimit, BUFSIZE \- yylimit);
        yyinput[yylimit] = 0; // append sentinel symbol

        // If read less than expected, this is the end of input.
        eof = yylimit < BUFSIZE;

        return 0;
    }

    // Expects a null\-terminated string.
    public int lex() throws IOException {
        int count = 0;
        loop: while (true) {
            token = yycursor;
            /*!re2c
                re2c:YYCTYPE = \(dqint\(dq;
                re2c:YYPEEK = \(dqByte.toUnsignedInt(yyinput[yycursor])\(dq;
                re2c:YYFILL = \(dqfill() == 0\(dq;
                re2c:eof = 0;

                str = [\(aq] ([^\(aq\e\e] | [\e\e][^])* [\(aq];

                *    { return \-1; }
                $    { return count; }
                str  { count += 1; continue loop; }
                [ ]+ { continue loop; }
            */
        }
    }

    public static void main(String []args) throws FileNotFoundException, IOException {
        String fname = \(dqinput\(dq;
        String content = \(dq\(aqqu\e0tes\(aq \(aqare\(aq \(aqfine: \e\e\(aq\(aq \(dq.repeat(Lexer.BUFSIZE);

        // Prepare input file: a few times the size of the buffer, containing
        // strings with zeroes and escaped quotes.
        Files.writeString(Paths.get(fname), content);

        int count = 3 * Lexer.BUFSIZE; // number of quoted strings written to file

        // Prepare lexer state: all offsets are at the end of buffer.
        File file = new File(\(dq.\(dq, fname);
        Lexer lexer = new Lexer(file);

        // Run the lexer.
        int n = lexer.lex();
        assert n == count;

        // Cleanup: remove input file.
        file.delete();
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS YYFILL with padding
.sp
In the default case (when EOF rule is not used) \fBYYFILL\fP is a function\-like
primitive that accepts a single argument and does not return any value.
\fBYYFILL\fP invocation is triggered by condition \fB(YYLIMIT \- YYCURSOR) < n\fP in
C pointer API and \fBYYLESSTHAN(n)\fP in generic API. The argument passed to
\fBYYFILL\fP is the minimal number of characters that must be supplied. If it
fails to do so, \fBYYFILL\fP must not return to the lexer (for that reason it is
best implemented as a macro that returns from the calling function on failure).
In case of a successful \fBYYFILL\fP invocation the limit position must be set
either to one after the last input position in buffer, or to the end of
\fBYYMAXFILL\fP padding (in case \fBYYFILL\fP has successfully read at least \fBn\fP
characters, but not enough to fill the entire buffer). The pictures below show
the relative locations of input positions in buffer before and after \fBYYFILL\fP
invocation (\fBYYMAXFILL\fP padding on the second picture is marked with \fB#\fP
symbols).
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
               <\-\- shift \-\->                 <\-\- need \-\->
             >\-A\-\-\-\-\-\-\-\-\-\-\-\-B\-\-\-\-\-\-\-\-\-C\-\-\-\-\-D\-\-\-\-\-\-\-E\-\-\-F\-\-\-\-\-\-\-\-G\->
             buffer       token    marker cursor  limit

>\-A\-\-\-\-\-\-\-\-\-\-\-\-B\-\-\-\-\-\-\-\-\-C\-\-\-\-\-D\-\-\-\-\-\-\-E\-\-\-F\-\-\-\-\-\-\-\-G\->
             buffer,  marker cursor               limit
             token

               <\-\- shift \-\->                 <\-\- need \-\->
             >\-A\-\-\-\-\-\-\-\-\-\-\-\-B\-\-\-\-\-\-\-\-\-C\-\-\-\-\-D\-\-\-\-\-\-\-E\-F        (EOF)
             buffer       token    marker cursor  limit

>\-A\-\-\-\-\-\-\-\-\-\-\-\-B\-\-\-\-\-\-\-\-\-C\-\-\-\-\-D\-\-\-\-\-\-\-E\-F###############
             buffer,  marker cursor                   limit
             token                        <\- YYMAXFILL \->
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Here is an example of a program that reads input file \fBinput.txt\fP in chunks of
4096 bytes and uses bounds\-checking with padding.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

import java.io.*;
import java.nio.file.*;
import java.util.Arrays;

class Lexer {
    /*!max:re2c*/
    public static final int BUFSIZE = 4096;

    private BufferedInputStream stream;
    private byte[] yyinput;
    private int yycursor;
    private int yylimit;
    private int token;
    private boolean eof;

    public Lexer(File file) throws FileNotFoundException {
        stream = new BufferedInputStream(new FileInputStream(file));
        // Prepare lexer state: all offsets are at the end of buffer.
        // This immediately triggers YYFILL, as the YYLESSTHAN condition is true.
        yyinput = new byte[BUFSIZE + YYMAXFILL];
        yycursor = yylimit = token = BUFSIZE;
        eof = false;
    }

    private int fill(int need) throws IOException {
        if (eof) { return \-1; } // unexpected EOF

        // Error: lexeme too long. In real life can reallocate a larger buffer.
        if (token < need) { return \-2; }

        // Shift buffer contents (discard everything up to the current token).
        System.arraycopy(yyinput, token, yyinput, 0, yylimit \- token); 
        yycursor \-= token;
        yylimit \-= token;
        token = 0;

        // Fill free space at the end of buffer with new data from file.
        yylimit += stream.read(yyinput, yylimit, BUFSIZE \- yylimit);
        yyinput[yylimit] = 0; // append sentinel symbol

        // If read less than expected, this is the end of input.
        if (yylimit < BUFSIZE) {
            eof = true;
            Arrays.fill(yyinput, yylimit, yylimit + YYMAXFILL, (byte)0);
            yylimit += YYMAXFILL;
        }

        return 0;
    }

    // Expects a null\-terminated string.
    public int lex() throws IOException {
        int count = 0;
        loop: while (true) {
            token = yycursor;
            /*!re2c
                re2c:YYCTYPE = \(dqint\(dq;
                re2c:YYPEEK = \(dqByte.toUnsignedInt(yyinput[yycursor])\(dq;
                re2c:YYFILL = \(dqif (fill(@@) != 0) { return \-2; }\(dq;

                str = [\(aq] ([^\(aq\e\e] | [\e\e][^])* [\(aq];

                [\ex00] {
                    // Check that it is the sentinel, not some unexpected null.
                    return (token == yylimit \- YYMAXFILL) ? count : \-1;
                }
                str  { count += 1; continue loop; }
                [ ]+ { continue loop; }
                *    { return \-1; }
            */
        }
    }

    public static void main(String []args) throws FileNotFoundException, IOException {
        String fname = \(dqinput\(dq;
        String content = \(dq\(aqqu\e0tes\(aq \(aqare\(aq \(aqfine: \e\e\(aq\(aq \(dq.repeat(Lexer.BUFSIZE);

        // Prepare input file: a few times the size of the buffer, containing
        // strings with zeroes and escaped quotes.
        Files.writeString(Paths.get(fname), content);

        int count = 3 * Lexer.BUFSIZE; // number of quoted strings written to file

        // Prepare lexer state: all offsets are at the end of buffer.
        File file = new File(\(dq.\(dq, fname);
        Lexer lexer = new Lexer(file);

        // Run the lexer.
        int n = lexer.lex();
        assert n == count;

        // Cleanup: remove input file.
        file.delete();
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SH FEATURES
.SS Multiple blocks
.sp
Sometimes it is necessary to have multiple interrelated lexers (for example, if
there is a high\-level state machine that transitions between lexer modes). This
can be implemented using multiple connected re2java blocks. Another option is to
use \fI\%start conditions\fP\&.
.sp
The implementation of connections between blocks depends on the target language.
In languages that have \fBgoto\fP statement (such as C/C++ and Go) one can have
all blocks in one function, each of them prefixed with a label. Transition from
one block to another is a simple \fBgoto\fP\&.
In languages that do not have \fBgoto\fP (such as Rust) it is necessary to use a
loop with a switch on a state variable, similar to the \fByystate\fP loop/switch
generated by re2java, or else wrap each block in a function and use function calls.
.sp
The example below uses multiple blocks to parse binary, octal, decimal and
hexadecimal numbers. Each base has its own block. The initial block determines
base and dispatches to other blocks. Common configurations are defined in a
separate block at the beginning of the program; they are inherited by the other
blocks.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

class Parser {
    private String yyinput;
    private int yycursor;
    private int yymarker;
    private int number;

    private void add_digit(int base, int offset) throws ArithmeticException {
        number = Math.addExact(
            Math.multiplyExact(number, base),
            yyinput.charAt(yycursor \- 1) \- offset);
    }

    public int parse(String str) throws ArithmeticException, IllegalArgumentException {
        yyinput = str;
        yycursor = 0;
        number = 0;

        try {
            /*!re2c
                re2c:YYCTYPE = \(dqchar\(dq;
                re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;
                re2c:yyfill:enable = 0;

                end = \(dq\ex00\(dq;

                \(aq0b\(aq / [01]        { return parse_bin(); }
                \(dq0\(dq                { return parse_oct(); }
                \(dq\(dq   / [1\-9]       { return parse_dec(); }
                \(aq0x\(aq / [0\-9a\-fA\-F] { return parse_hex(); }
                *                  { throw new IllegalArgumentException(\(dqnot a number\(dq); }
            */
        } catch (Exception e) {
            return \-1;
        }
    }

    private int parse_bin() throws ArithmeticException, IllegalArgumentException {
        /*!re2c
            end   { return number; }
            [01]  { add_digit(2, 48); return parse_bin(); }
            *     { throw new IllegalArgumentException(\(dqill\-formed binary number\(dq); }
        */
    }

    private int parse_oct() throws ArithmeticException, IllegalArgumentException {
        /*!re2c
            end   { return number; }
            [0\-7] { add_digit(8, 48); return parse_oct(); }
            *     { throw new IllegalArgumentException(\(dqill\-formed octal number\(dq); }
        */
    }

    private int parse_dec() throws ArithmeticException, IllegalArgumentException {
        /*!re2c
            end   { return number; }
            [0\-9] { add_digit(10, 48); return parse_dec(); }
            *     { throw new IllegalArgumentException(\(dqill\-formed decimal number\(dq); }
        */
    }

    private int parse_hex() throws ArithmeticException, IllegalArgumentException {
        /*!re2c
            end   { return number; }
            [0\-9] { add_digit(16, 48); return parse_hex(); }
            [a\-f] { add_digit(16, 87); return parse_hex(); }
            [A\-F] { add_digit(16, 55); return parse_hex(); }
            *     { throw new IllegalArgumentException(\(dqill\-formed hexadecimal number\(dq); }
        */
    }

    public static void main(String []args) {
        Parser parser = new Parser();
        assert parser.parse(\(dq1234567890\e0\(dq) == 1234567890;
        assert parser.parse(\(dq0b1101\e0\(dq) == 0b1101;
        assert parser.parse(\(dq0x007Fe\e0\(dq) == 0x7fe;
        assert parser.parse(\(dq0644\e0\(dq) == 0644;
        assert parser.parse(\(dq9999999999\e0\(dq) == \-1;
        assert parser.parse(\(dq123??\e0\(dq) == \-1;
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Start conditions
.sp
Start conditions are enabled with \fB\-\-start\-conditions\fP option. They provide a
way to encode multiple interrelated automata within the same re2java block.
.sp
Each condition corresponds to a single automaton and has a unique name specified
by the user and a unique internal number defined by re2java\&. The numbers are used
to switch between conditions: the generated code uses \fBYYGETCOND\fP and
\fBYYSETCOND\fP primitives to get the current condition or set it to the
given number. Use \fBconditions\fP block, \fB\-\-header\fP option or \fBre2c:header\fP
configuration to generate numeric condition identifiers. Configuration
\fBre2c:cond:enumprefix\fP specifies the generated identifier prefix.
.sp
In condition mode every rule must be prefixed with a list of comma\-separated
condition names in angle brackets, or a wildcard \fB<*>\fP to denote all
conditions. The rule syntax is extended as follows:
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fB< condition\-list > regular\-expression code\fP
A rule that is merged to every condition on the \fBcondition\-list\fP\&.
It matches \fBregular\-expression\fP and executes the associated \fBcode\fP\&.
.TP
.B \fB< condition\-list > regular\-expression => condition code\fP
A rule that is merged to every condition on the \fBcondition\-list\fP\&.
It matches \fBregular\-expression\fP, sets the current condition to
\fBcondition\fP and executes the associated \fBcode\fP\&.
.TP
.B \fB< condition\-list > regular\-expression :=> condition\fP
A rule that is merged to every condition on the \fBcondition\-list\fP\&.
It matches \fBregular\-expression\fP and immediately transitions to
\fBcondition\fP (there is no semantic action).
.TP
.B \fB< condition\-list > !action code\fP
A rule that binds \fBcode\fP to the place defined by \fBaction\fP in every
condition on the \fBcondition\-list\fP (see the \fI\%actions\fP section for
various types of actions).
.TP
.B \fB<! condition\-list > code\fP
A rule that prepends \fBcode\fP to semantic actions of all rules for every
condition on the \fBcondition\-list\fP\&. This syntax is deprecated and the
\fB!pre_rule\fP action should be used instead (it does exactly the same).
.TP
.B \fB< > code\fP
A rule that creates a special entry condition with number zero and name
\fB\(dq0\(dq\fP that executes \fBcode\fP before jumping to other conditions.
This syntax is deprecated, and the \fB!entry\fP action should be used
instead (it provides a more fine\-grained control, as the code can be
specified on a per\-condition basis, and one can jump directly to
condition start without going through condition dispatch).
.TP
.B \fB< > => condition code\fP
Same as the previous rule, except that it sets the next \fBcondition\fP\&.
.TP
.B \fB< > :=> condition\fP
Same as the previous rule, except that it has no associated code and
immediately jumps to \fBcondition\fP\&.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
The code re2java generates for conditions depends on whether re2java uses
goto/label approach or loop/switch approach to encode the automata.
.sp
In languages that have \fBgoto\fP statement (such as C/C++ and Go) conditions are
naturally implemented as blocks of code prefixed with labels of the form
\fByyc_<cond>\fP, where \fBcond\fP is a condition name (label prefix can be changed
with \fBre2c:cond:prefix\fP). Transitions between conditions are implemented using
\fBgoto\fP and condition labels. Before all conditions re2java generates an initial
switch on \fBYYGETSTATE\fP that jumps to the start state of the current condition.
The shortcut rules \fB:=>\fP bypass the initial switch and jump directly to the
specified condition (\fBre2c:cond:goto\fP can be used to change the default
behavior). The rules with semantic actions do not automatically jump to the next
condition; this should be done by the user\-defined action code.
.sp
In languages that do not have \fBgoto\fP (such as Rust) re2java reuses the
\fByystate\fP variable to store condition numbers. Each condition gets a numeric
identifier equal to the number of its start state, and a switch between
conditions is no different than a switch between DFA states of a single
condition. There is no need for a separate initial condition switch.
(Since the same approach is used to implement storable states,
\fBYYGETCOND\fP/\fBYYSETCOND\fP are redundant if both storable states and
conditions are used).
.sp
The program below uses start conditions to parse binary, octal, decimal and
hexadecimal numbers. There is a single block where each base has its own
condition, and the initial condition is connected to all of them. User\-defined
variable \fBcond\fP stores the current condition number; it is initialized to the
number of the initial condition generated with \fBconditions\fP block.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT \-c

class Parser {
    /*!conditions:re2c*/
    private String yyinput;
    private int yycursor;
    private int yymarker;
    private int number;

    private void add_digit(int base, int offset) throws ArithmeticException {
        number = Math.addExact(
            Math.multiplyExact(number, base),
            yyinput.charAt(yycursor \- 1) \- offset);
    }

    public int parse(String str) throws ArithmeticException, IllegalArgumentException {
        yyinput = str;
        yycursor = 0;
        int yycond = YYC_init;

        number = 0;
        try {
            loop: while (true) {
            /*!re2c
                re2c:YYCTYPE = \(dqchar\(dq;
                re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;
                re2c:yyfill:enable = 0;

                <*> * { throw new IllegalArgumentException(\(dqill\-formed number\(dq); }

                <init> \(aq0b\(aq / [01]        :=> bin
                <init> \(dq0\(dq                :=> oct
                <init> \(dq\(dq   / [1\-9]       :=> dec
                <init> \(aq0x\(aq / [0\-9a\-fA\-F] :=> hex

                <bin, oct, dec, hex> \(dq\ex00\(dq { return number; }

                <bin> [01]  { add_digit(2, 48); continue loop; }
                <oct> [0\-7] { add_digit(8, 48); continue loop; }
                <dec> [0\-9] { add_digit(10, 48); continue loop; }
                <hex> [0\-9] { add_digit(16, 48); continue loop; }
                <hex> [a\-f] { add_digit(16, 87); continue loop; }
                <hex> [A\-F] { add_digit(16, 55); continue loop; }
            */
            }
        } catch (Exception e) {
            return \-1;
        }
    }

    public static void main(String []args) {
        Parser parser = new Parser();
        assert parser.parse(\(dq1234567890\e0\(dq) == 1234567890;
        assert parser.parse(\(dq0b1101\e0\(dq) == 0b1101;
        assert parser.parse(\(dq0x007Fe\e0\(dq) == 0x7fe;
        assert parser.parse(\(dq0644\e0\(dq) == 0644;
        assert parser.parse(\(dq9999999999\e0\(dq) == \-1;
        assert parser.parse(\(dq123??\e0\(dq) == \-1;
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Storable state
.sp
With \fB\-\-storable\-state\fP option re2java generates a lexer that can store
its current state, return to the caller, and later resume operations exactly
where it left off. The default mode of operation in re2java is a \(dqpull\(dq model,
in which the lexer \(dqpulls\(dq more input whenever it needs it. This may be
unacceptable in cases when the input becomes available piece by piece (for
example, if the lexer is invoked by the parser, or if the lexer program
communicates via a socket protocol with some other program that must wait for a
reply from the lexer before it transmits the next message). Storable state
feature is intended exactly for such cases: it allows one to generate lexers that
work in a \(dqpush\(dq model. When the lexer needs more input, it stores its state and
returns to the caller. Later, when more input becomes available, the caller
resumes the lexer exactly where it stopped. There are a few changes necessary
compared to the \(dqpull\(dq model:
.INDENT 0.0
.IP \(bu 2
Define \fBYYSETSTATE()\fP and \fBYYGETSTATE(state)\fP primitives.
.IP \(bu 2
Define \fByych\fP, \fByyaccept\fP (if used) and \fBstate\fP variables as a part of
persistent lexer state. The \fBstate\fP variable should be initialized to \fB\-1\fP\&.
.IP \(bu 2
\fBYYFILL\fP should return to the outer program instead of trying to supply more
input. Return code should indicate that lexer needs more input.
.IP \(bu 2
The outer program should recognize situations when lexer needs more input and
respond appropriately.
.IP \(bu 2
Optionally use \fBgetstate\fP block to generate \fBYYGETSTATE\fP switch detached
from the main lexer. This only works for languages that have \fBgoto\fP (not in
\fB\-\-loop\-switch\fP mode).
.IP \(bu 2
Use \fBre2c:eof\fP and the \fI\%sentinel with bounds checks\fP method to handle the
end of input. Padding\-based method may not work because it is unclear when to
append padding: the current end of input may not be the ultimate end of input,
and appending padding too early may cut off a partially read greedy lexeme.
Furthermore, due to high\-level program logic getting more input may depend on
processing the lexeme at the end of buffer (which already is blocked due to
the end\-of\-input condition).
.UNINDENT
.sp
Here is an example of a \(dqpush\(dq model lexer that simulates reading packets from a
socket. The lexer loops until it encounters the end of input and returns to the
calling function. The calling function provides more input by \(dqsending\(dq the next
packet and resumes lexing. This process stops when all the packets have been
sent, or when there is an error.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT \-f

import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.Pipe;

class Lexer {
    enum Status {
        END,
        READY,
        WAITING,
        BIG_PACKET,
        BAD_PACKET
    };

    // Use a small buffer to cover the case when a lexeme doesn\(aqt fit.
    // In real world use a larger buffer.
    public static final int BUFSIZE = 10;

    public static class State {
        Pipe.SourceChannel source;
        byte[] yyinput;
        int yycursor;
        int yymarker;
        int yylimit;
        int token;
        int yystate;
        int received;

        public State(Pipe pipe) {
            source = pipe.source();
            // Sentinel at \(gayylimit\(ga offset is set to zero, which triggers YYFILL.
            yyinput = new byte[BUFSIZE + 1];
            yycursor = yymarker = yylimit = token = BUFSIZE;
            yystate = \-1;
            received = 0;
        }
    }

    private static void log(String format, Object... args) {
        if (false) { System.out.printf(format + \(dq\en\(dq, args); }
    }

    private static Status fill(State st) throws IOException {
        // Error: lexeme too long. In real life can reallocate a larger buffer.
        if (st.token < 1) { return Status.BIG_PACKET; }

        // Shift buffer contents (discard everything up to the current token).
        System.arraycopy(st.yyinput, st.token, st.yyinput, 0, st.yylimit \- st.token); 
        st.yycursor \-= st.token;
        st.yymarker \-= st.token;
        st.yylimit \-= st.token;
        st.token = 0;

        // Fill free space at the end of buffer with new data from file.
        ByteBuffer buffer = ByteBuffer.wrap(st.yyinput, st.yylimit, BUFSIZE \- st.yylimit);
        int have = st.source.read(buffer);
        if (have != \-1) st.yylimit += have; // \-1 means that pipe is closed
        st.yyinput[st.yylimit] = 0; // append sentinel symbol

        return Status.READY;
    }

    private static Status lex(State yyrecord) {
        int yych;
        loop: while (true) {
            yyrecord.token = yyrecord.yycursor;
            /*!re2c
                re2c:api = record;
                re2c:YYCTYPE = \(dqint\(dq;
                re2c:YYPEEK = \(dqByte.toUnsignedInt(yyrecord.yyinput[yyrecord.yycursor])\(dq;
                re2c:YYFILL = \(dqreturn Status.WAITING;\(dq;
                re2c:eof = 0;

                packet = [a\-z]+[;];

                *      { return Status.BAD_PACKET; }
                $      { return Status.END; }
                packet { yyrecord.received += 1; continue loop; }
            */
        }
    }

    public static void test(String[] packets, Status expect) throws IOException {
        // Create a pipe.
        Pipe pipe = Pipe.open();
        Pipe.SinkChannel sink = pipe.sink();

        // Initialize lexer state
        Lexer.State st = new Lexer.State(pipe);

        // Main loop. The buffer contains incomplete data which appears packet by
        // packet. When the lexer needs more input it saves its internal state and
        // returns to the caller which should provide more input and resume lexing.
        int send = 0;
        Status status;
        while (true) {
            status = lex(st);

            if (status == Status.END) {
                log(\(dqdone: got %d packets\(dq, st.received);
                break;
            } else if (status == Status.WAITING) {
                log(\(dqwaiting...\(dq);

                if (send < packets.length) {
                    log(\(dqsent packet %d: %s\(dq, send, packets[send]);
                    ByteBuffer buffer = ByteBuffer.wrap(packets[send].getBytes());
                    sink.write(buffer);
                    send += 1;
                } else {
                    sink.close();
                }

                status = fill(st);
                if (status == Status.BIG_PACKET) {
                    log(\(dqerror: packet too big\(dq);
                    break;
                }
                assert status == Status.READY;
            } else {
                assert status == Status.BAD_PACKET;
                log(\(dqerror: ill\-formed packet\(dq);
                break;
            }
        }

        // Check results.
        assert status == expect;
        if (status == Status.END) {
            assert send == st.received;
        }
    }

    public static void main(String []args) throws IOException {
        test(new String[]{}, Status.END);
        test(new String[]{\(dqzero;\(dq, \(dqone;\(dq, \(dqtwo;\(dq, \(dqthree;\(dq, \(dqfour;\(dq}, Status.END);
        test(new String[]{\(dqzer0;\(dq}, Status.BAD_PACKET);
        test(new String[]{\(dqgoooooooooogle;\(dq}, Status.BIG_PACKET);
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Reusable blocks
.sp
Reusable blocks of the form \fB/*!rules:re2c[:<name>] ... */\fP or
\fB%{rules[:<name>] ... %}\fP can be reused any number of times and combined with
other re2java blocks. The \fB<name>\fP is optional. A rules block can be used in a
\fBuse\fP block or directive. The code for a rules block is generated at every
point of use.
.sp
Use blocks are defined with \fB/*!use:re2c[:<name>] ... */\fP or
\fB%{use[:<name>] ... %}\fP\&. The \fB<name>\fP is optional: if it\(aqs not specified,
the associated rules block is the most recent one (whether named or unnamed).
A use block can add named definitions, configurations and rules of its own.
An important use case for use blocks is a lexer that supports multiple input
encodings: the same rules block is reused multiple times with encoding\-specific
configurations (see the example below).
.sp
In\-block use directive \fB!use:<name>;\fP can be used from inside of a re2java
block. It merges the referenced block \fB<name>\fP into the current one. If some
of the merged rules and configurations overlap with the previously defined ones,
conflicts are resolved in the usual way: the earliest rule takes priority, and
latest configuration overrides preceding ones. One exception are the special
rules \fB*\fP, \fB$\fP and (in condition mode) \fB<!>\fP, for which a block\-local
definition overrides any inherited ones. Use directive allows one to combine
different re2java blocks together in one block (see the example below).
.sp
Named blocks and in\-block use directive were added in re2java version 2.2.
Since that version reusable blocks are allowed by default (no special option
is needed). Before version 2.2 reuse mode was enabled with \fB\-r \-\-reusable\fP
option. Before version 1.2 reusable blocks could not be mixed with normal
blocks.
.SS Example of a \fB!use\fP directive
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

// This example shows how to combine reusable re2c blocks: two blocks
// (\(aqcolors\(aq and \(aqfish\(aq) are merged into one. The \(aqsalmon\(aq rule occurs
// in both blocks; the \(aqfish\(aq block takes priority because it is used
// earlier. Default rule * occurs in all three blocks; the local (not
// inherited) definition takes priority.

/*!rules:re2c:colors
    *                            { throw new IllegalArgumentException(\(dqah\(dq); }
    \(dqred\(dq | \(dqsalmon\(dq | \(dqmagenta\(dq { return Ans.COLOR; }
*/

/*!rules:re2c:fish
    *                            { throw new IllegalArgumentException(\(dqoh\(dq); }
    \(dqhaddock\(dq | \(dqsalmon\(dq | \(dqeel\(dq { return Ans.FISH; }
*/

class Main {
    enum Ans {COLOR, FISH, DUNNO};

    static Ans lex(String yyinput) { // no\-throw, as \(aq*\(aq rules are overridden
        int yycursor = 0;
        int yymarker = 0;

        /*!re2c
            re2c:yyfill:enable = 0;
            re2c:YYCTYPE = \(dqchar\(dq;
            re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;

            !use:fish;
            !use:colors;
            * { return Ans.DUNNO; } // overrides inherited \(aq*\(aq rules
        */
    }

    public static void main(String []args) {
        assert lex(\(dqsalmon\(dq) == Ans.FISH;
        assert lex(\(dqwhat?\(dq) == Ans.DUNNO;
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Example of a \fB/*!use:re2c ... */\fP block
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT \-\-input\-encoding utf8

// This example supports multiple input encodings: UTF\-8 and UTF\-32.
// Both lexers are generated from the same rules block, and the use
// blocks add only encoding\-specific configurations.

/*!rules:re2c
    re2c:yyfill:enable = 0;
    re2c:YYPEEK = \(dqyyinput[yycursor]\(dq;
    re2c:indent:top = 1;

    \(dq∀x ∃y\(dq { return true; }
    *       { return false; }
*/

class Main {
    static boolean lex_utf8(int[] yyinput) {
        int yycursor = 0;
        int yymarker = 0;
        /*!use:re2c
            re2c:YYCTYPE = \(dqint\(dq; // Java lacks unsigned 8\-bit integer type
            re2c:encoding:utf8 = 1;
        */
    }

    static boolean lex_utf32(int[] yyinput) {
        int yycursor = 0;
        int yymarker = 0;
        /*!use:re2c
            re2c:YYCTYPE = \(dqint\(dq;
            re2c:encoding:utf32 = 1;
        */
    }

    public static void main(String []args) {
        // we have to use \(gaint\(ga, because \(gabyte\(gain Java cannot represent values greater than 127
        int[] s_utf8 = new int[]{0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79};
        assert lex_utf8(s_utf8);

        int[] s_utf32 = new int[]{0x2200, 0x78, 0x20, 0x2203, 0x79};
        assert lex_utf32(s_utf32);
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Submatch extraction
.sp
re2java has two options for submatch extraction.
.INDENT 0.0
.TP
.B \fBTags\fP
The first option is to use standalone \fItags\fP of the form \fB@stag\fP or
\fB#mtag\fP, where \fBstag\fP and \fBmtag\fP are arbitrary used\-defined names.
Tags are enabled with \fB\-T \-\-tags\fP option or \fBre2c:tags = 1\fP
configuration. Semantically tags are position markers: they can be
inserted anywhere in a regular expression, and they bind to the
corresponding position (or multiple positions) in the input string.
\fIS\-tags\fP bind to the last matching position, and \fIm\-tags\fP bind to a list of
positions (they may be used in repetition subexpressions, where a single
position in a regular expression corresponds to multiple positions in the
input string). All tags should be defined by the user, either manually or
with the help of \fBsvars\fP and \fBmvars\fP blocks. If there is more than one
way tags can be matched against the input, ambiguity is resolved using
leftmost greedy disambiguation strategy.
.TP
.B \fBCaptures\fP
The second option is to use \fIcapturing groups\fP\&. They are enabled with
\fB\-\-captures\fP option or \fBre2c:captures = 1\fP configuration. There are two
flavours for different disambiguation policies, \fB\-\-leftmost\-captures\fP
(the default) is for leftmost greedy policy, and, \fB\-\-posix\-captures\fP is
for POSIX longest\-match policy. In this mode all parenthesized
subexpressions are considered capturing groups, and a bang can be used to
mark non\-capturing groups: \fB(! ... )\fP\&. With \fB\-\-invert\-captures\fP option or
\fBre2c:invert\-captures = 1\fP configuration the meaning of bang is inverted.
The number of groups for the matching rule is stored in a variable
\fByynmatch\fP (the whole regular expression is group number zero), and
submatch results are stored in \fByypmatch\fP array. Both \fByynmatch\fP and
\fByypmatch\fP should be defined by the user, and \fByypmatch\fP size must be at
least \fB[yynmatch * 2]\fP\&. Use \fBmaxnmatch\fP block to  define \fBYYMAXNMATCH\fP,
a constant that equals to the maximum value of \fByynmatch\fP among all rules.
.TP
.B \fBCaptvars\fP
Another way to use capturing groups is the \fB\-\-captvars\fP option or
\fBre2c:captvars = 1\fP configuration. The only difference with \fB\-\-captures\fP
is in the way the generated code stores submatch results: instead of
\fByynmatch\fP and \fByypmatch\fP re2java generates variables \fByytl<k>\fP and
\fByytr<k>\fP for \fIk\fP\-th capturing group (the user should declare these using
an \fBsvars\fP block). Captures with variables support two disambiguation
policies: \fB\-\-leftmost\-captvars\fP or \fBre2c:leftmost\-captvars = 1\fP for
leftmost greedy policy (the default one) and \fB\-\-posix\-captvars\fP or
\fBre2c:posix\-captvars\fP for POSIX longest\-match policy.
.UNINDENT
.sp
Under the hood all these options translate into tags and
\fI\%Tagged Deterministic Finite Automata with Lookahead\fP\&.
The core idea of TDFA is to minimize the overhead on submatch extraction.
In the extreme, if there\(aqre no tags or captures in a regular expression, TDFA is
just an ordinary DFA. If the number of tags is moderate, the overhead is barely
noticeable. The generated TDFA uses a number of \fItag variables\fP which do not map
directly to tags: a single variable may be used for different tags, and a tag
may require multiple variables to hold all its possible values. Eventually
ambiguity is resolved, and only one final variable per tag survives. Tag
variables should be defined using \fBstags\fP or \fBmtags\fP blocks. If lexer state
is stored, tag variables should be part of it. They also need to be updated  by
\fBYYFILL\fP\&.
.sp
S\-tags support the following operations:
.INDENT 0.0
.IP \(bu 2
save input position to an s\-tag: \fBt = YYCURSOR\fP with C pointer API or a
user\-defined operation \fBYYSTAGP(t)\fP with generic API
.IP \(bu 2
save default value to an s\-tag: \fBt = NULL\fP with C pointer API or a
user\-defined operation \fBYYSTAGN(t)\fP with generic API
.IP \(bu 2
copy one s\-tag to another: \fBt1 = t2\fP
.UNINDENT
.sp
M\-tags support the following operations:
.INDENT 0.0
.IP \(bu 2
append input position to an m\-tag: a user\-defined operation \fBYYMTAGP(t)\fP
with both default and generic API
.IP \(bu 2
append default value to an m\-tag: a user\-defined operation \fBYYMTAGN(t)\fP
with both default and generic API
.IP \(bu 2
copy one m\-tag to another: \fBt1 = t2\fP
.UNINDENT
.sp
S\-tags can be implemented as scalar values (pointers or offsets). M\-tags need a
more complex representation, as they need to store a sequence of tag values. The
most naive and inefficient representation of an m\-tag is a list (array, vector)
of tag values; a more efficient representation is to store all m\-tags in a
prefix\-tree represented as array of nodes \fB(v, p)\fP, where \fBv\fP is tag value
and \fBp\fP is a pointer to parent node.
.sp
Here is a simple example of using s\-tags to parse semantic versions consisting
of three numeric components: major, minor, patch (the latter is optional).
See below for a more complex example that uses \fBYYFILL\fP\&.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

import java.util.Optional;

class Main {
    static class SemVer {
        int major;
        int minor;
        int patch;

        public SemVer(int m, int n, int k) {
            major = m;
            minor = n;
            patch = k;
        }

        public boolean equals(SemVer v) {
            return major == v.major && minor == v.minor && patch == v.patch;
        }
    };

    static Optional<SemVer> parse(String yyinput) {
        int yycursor = 0;
        int yymarker = 0;

        // Final tag variables available in semantic action.
        /*!svars:re2c format = \(dqint @@;\(dq; */

        // Intermediate tag variables used by the lexer (must be autogenerated).
        /*!stags:re2c format = \(dqint @@ = \-1;\(dq; */

        /*!re2c
            re2c:YYCTYPE = \(dqchar\(dq;
            re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;
            re2c:yyfill:enable = 0;
            re2c:tags = 1;

            num = [0\-9]+;

            @t1 num @t2 \(dq.\(dq @t3 num @t4 (\(dq.\(dq @t5 num)? [\ex00] {
                int major = Integer.valueOf(yyinput.substring(t1, t2));
                int minor = Integer.valueOf(yyinput.substring(t3, t4));
                int patch = (t5 == \-1) ? 0 : Integer.valueOf(yyinput.substring(t5, yycursor \- 1));
                return Optional.of(new SemVer(major, minor, patch));
            }
            * { return Optional.empty(); }
        */
    }

    public static void main(String []args) {
        assert parse(\(dq23.34\e0\(dq).get().equals(new SemVer(23, 34, 0));
        assert parse(\(dq1.2.99999\e0\(dq).get().equals(new SemVer(1, 2, 99999));
        assert !parse(\(dq1.a\e0\(dq).isPresent();
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Here is a more complex example of using s\-tags with \fBYYFILL\fP to parse a file
with newline\-separated semantic versions. Tag variables are part of the lexer
state, and they are adjusted in \fBYYFILL\fP like other input positions.
Note that it is necessary for s\-tags because their values are invalidated after
shifting buffer contents. It may not be necessary in a custom implementation
where tag variables store offsets relative to the start of the input string
rather than the buffer, which may be the case with m\-tags.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

import java.io.*;
import java.nio.file.*;
import java.util.*;

class Lexer {
    static class SemVer {
        int major;
        int minor;
        int patch;

        public SemVer(int m, int n, int k) {
            major = m;
            minor = n;
            patch = k;
        }

        public boolean equals(SemVer v) {
            return major == v.major && minor == v.minor && patch == v.patch;
        }
    };

    public static final int BUFSIZE = 4096;

    private BufferedInputStream stream;
    private byte[] yyinput;
    private int yycursor;
    private int yymarker;
    private int yylimit;
    private int token;
    // Intermediate tag variables used by the lexer (must be autogenerated).
    /*!stags:re2c format = \(dqprivate int @@;\en\(dq; */
    private boolean eof;

    public Lexer(File file) throws FileNotFoundException {
        stream = new BufferedInputStream(new FileInputStream(file));
        // Sentinel at \(gayylimit\(ga offset is set to zero, which triggers YYFILL.
        yyinput = new byte[BUFSIZE + 1];
        yycursor = yymarker = yylimit = token = BUFSIZE;
        /*!stags:re2c format = \(dq@@ = \-1;\en\(dq; */
        eof = false;
    }

    private int fill() throws IOException {
        if (eof) { return \-1; } // unexpected EOF

        // Error: lexeme too long. In real life can reallocate a larger buffer.
        if (token < 1) { return \-2; }

        // Shift buffer contents (discard everything up to the current token).
        System.arraycopy(yyinput, token, yyinput, 0, yylimit \- token); 
        yycursor \-= token;
        yymarker \-= token;
        yylimit \-= token;
        /*!stags:re2c format = \(dqif (@@ != \-1) {@@ \-= token;}\en\(dq; */
        token = 0;

        // Fill free space at the end of buffer with new data from file.
        yylimit += stream.read(yyinput, yylimit, BUFSIZE \- yylimit);
        yyinput[yylimit] = 0; // append sentinel symbol

        // If read less than expected, this is the end of input.
        eof = yylimit < BUFSIZE;

        return 0;
    }

    private int readInt(int tag1, int tag2) {
        int n = 0;
        for (int i = tag1; i < tag2; ++i) { n = n * 10 + (yyinput[i] \- 48); }
        return n;
    }

    public Optional<ArrayList<SemVer>> lex() throws IOException {
        ArrayList<SemVer> vers = new ArrayList<SemVer>();

        // Final tag variables available in semantic action.
        /*!svars:re2c format = \(dqint @@;\(dq; */

        loop: while (true) {
            token = yycursor;
            /*!re2c
                re2c:YYCTYPE = \(dqint\(dq;
                re2c:YYPEEK = \(dqByte.toUnsignedInt(yyinput[yycursor])\(dq;
                re2c:YYFILL = \(dqfill() == 0\(dq;
                re2c:eof = 0;
                re2c:tags = 1;

                num = [0\-9]+;

                @t1 num @t2 \(dq.\(dq @t3 num @t4 (\(dq.\(dq @t5 num)? [\en] {
                    int major = readInt(t1, t2);
                    int minor = readInt(t3, t4);
                    int patch = (t5 == \-1) ? 0 : readInt(t5, yycursor \- 1);
                    vers.add(new SemVer(major, minor, patch));
                    continue loop;
                }
                $ { return Optional.of(vers); }
                * { return Optional.empty(); }
            */
        }
    }

    public static void main(String []args) throws FileNotFoundException, IOException {
        String fname = \(dqinput\(dq;
        String content = \(dq1.22.333\en\(dq.repeat(Lexer.BUFSIZE);

        // Prepare input file: a few times the size of the buffer, containing
        // strings with zeroes and escaped quotes.
        Files.writeString(Paths.get(fname), content);

        // Prepare lexer state: all offsets are at the end of buffer.
        File file = new File(\(dq.\(dq, fname);
        Lexer lexer = new Lexer(file);

        // Run the lexer.
        Optional<ArrayList<SemVer>> vers = lexer.lex();

        // Check results.
        assert vers.isPresent() && vers.get().size() == BUFSIZE;
        SemVer v = new SemVer(1, 22, 333);
        for (int i = 0; i < BUFSIZE; ++i) {
            assert vers.get().get(i).equals(v);
        }

        // Cleanup: remove input file.
        file.delete();
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Here is an example of using capturing groups to parse semantic versions.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

import java.util.Optional;

class Main {
    static class SemVer {
        int major;
        int minor;
        int patch;

        public SemVer(int m, int n, int k) {
            major = m;
            minor = n;
            patch = k;
        }

        public boolean equals(SemVer v) {
            return major == v.major && minor == v.minor && patch == v.patch;
        }
    };

    static Optional<SemVer> parse(String yyinput) {
        int yycursor = 0;
        int yymarker = 0;

        // Final tag variables available in semantic action.
        /*!svars:re2c format = \(dqint @@;\(dq; */

        // Intermediate tag variables used by the lexer (must be autogenerated).
        /*!stags:re2c format = \(dqint @@ = \-1;\(dq; */

        /*!re2c
            re2c:YYCTYPE = \(dqchar\(dq;
            re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;
            re2c:yyfill:enable = 0;
            re2c:captvars = 1;

            num = [0\-9]+;

            (num) \(dq.\(dq (num) (\(dq.\(dq num)? [\ex00] {
                int major = Integer.valueOf(yyinput.substring(yytl1, yytr1));
                int minor = Integer.valueOf(yyinput.substring(yytl2, yytr2));
                int patch = (yytl3 == \-1) ? 0
                        : Integer.valueOf(yyinput.substring(yytl3 + 1, yytr3));
                return Optional.of(new SemVer(major, minor, patch));
            }
            * { return Optional.empty(); }
        */
    }

    public static void main(String []args) {
        assert parse(\(dq23.34\e0\(dq).get().equals(new SemVer(23, 34, 0));
        assert parse(\(dq1.2.99999\e0\(dq).get().equals(new SemVer(1, 2, 99999));
        assert !parse(\(dq1.a\e0\(dq).isPresent();
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Here is an example of using m\-tags to parse a version with a variable number of
components. Tag variables are stored in a trie.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

import java.util.*;

class Main {
    static Optional<int[]> parse(String yyinput) {
        int yycursor = 0;
        int yymarker = 0;

        // Final tag variables available in semantic action.
        /*!svars:re2c format = \(dqint @@;\(dq; */
        /*!mvars:re2c format = \(dqList<Integer> @@;\(dq; */

        // Intermediate tag variables used by the lexer (must be autogenerated).
        /*!stags:re2c format = \(dqint @@ = \-1;\(dq; */
        /*!mtags:re2c format = \(dqList<Integer> @@ = new ArrayList<>();\(dq; */

        /*!re2c
            re2c:YYCTYPE = \(dqchar\(dq;
            re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;
            re2c:YYMTAGP = \(dq@@.add(yycursor);\(dq;
            re2c:YYMTAGN = \(dq\(dq; // do nothing
            re2c:yyfill:enable = 0;
            re2c:tags = 1;

            num = [0\-9]+;

            @t1 num @t2 (\(dq.\(dq #t3 num #t4)* [\ex00] {
                int[] vers = new int[t3.size() + 1];
                vers[0] = Integer.valueOf(yyinput.substring(t1, t2));
                for (int i = 0; i < t3.size(); ++i) {
                    vers[i + 1] = Integer.valueOf(yyinput.substring(t3.get(i), t4.get(i)));
                }
                return Optional.of(vers);
            }
            * { return Optional.empty(); }
        */
    }

    public static void main(String []args) {
        assert Arrays.equals(parse(\(dq1\e0\(dq).get(), new int[]{1});
        assert Arrays.equals(parse(\(dq1.2.3.4.5.6.7\e0\(dq).get(), new int[]{1, 2, 3, 4, 5, 6, 7});
        assert !parse(\(dq1.2.\e0\(dq).isPresent();
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Encoding support
.sp
It is necessary to understand the difference between \fBcode points\fP and
\fBcode units\fP\&. A code point is a numeric identifier of a symbol. A code unit is
the smallest unit of storage in the encoded text. A single code point may be
represented with one or more code units. In a fixed\-length encoding all code
points are represented with the same number of code units. In a variable\-length
encoding code points may be represented with a different number of code units.
Note that the \(dqany\(dq rule \fB[^]\fP matches any code point, but not necessarily
any code unit (the only way to match any code unit regardless of the encoding
is the default rule \fB*\fP).
The generated lexer works with a stream of code units: \fByych\fP stores a code
unit, and \fBYYCTYPE\fP is the code unit type. Regular expressions, on the other
hand, are specified in terms of code points. When re2java compiles regular
expressions to automata it translates code points to code units. This is
generally not a simple mapping: in variable\-length encodings a single code point
range may get translated to a complex code unit graph.
The following encodings are supported:
.INDENT 0.0
.IP \(bu 2
\fBASCII\fP (enabled by default). It is a fixed\-length encoding with code space
\fB[0\-255]\fP and 1\-byte code points and code units.
.IP \(bu 2
\fBEBCDIC\fP (enabled with \fB\-\-ebcdic\fP or \fBre2c:encoding:ebcdic\fP). It is a
fixed\-length encoding with code space \fB[0\-255]\fP and 1\-byte code points and
code units.
.IP \(bu 2
\fBUCS2\fP (enabled with \fB\-\-ucs2\fP or \fBre2c:encoding:ucs2\fP). It is a
fixed\-length encoding with code space \fB[0\-0xFFFF]\fP and 2\-byte code points
and code units.
.IP \(bu 2
\fBUTF8\fP (enabled with \fB\-\-utf8\fP or \fBre2c:encoding:utf8\fP). It is a
variable\-length Unicode encoding. Code unit size is 1 byte. Code points are
represented with 1 \-\- 4 code units.
.IP \(bu 2
\fBUTF16\fP (enabled with \fB\-\-utf16\fP or \fBre2c:encoding:utf16\fP). It is a
variable\-length Unicode encoding. Code unit size is 2 bytes. Code points are
represented with 1 \-\- 2 code units.
.IP \(bu 2
\fBUTF32\fP (enabled with \fB\-\-utf32\fP or \fBre2c:encoding:utf32\fP). It is a
fixed\-length Unicode encoding with code space \fB[0\-0x10FFFF]\fP and 4\-byte code
points and code units.
.UNINDENT
.sp
Include file \fBinclude/unicode_categories.re\fP provides re2java definitions for the
standard Unicode categories.
.sp
Option \fB\-\-input\-encoding\fP specifies source file encoding, which can be used to
enable Unicode literals in regular expressions. For example
\fB\-\-input\-encoding utf8\fP tells re2java that the source file is in UTF8 (it differs
from \fB\-\-utf8\fP which sets input text encoding). Option \fB\-\-encoding\-policy\fP
specifies the way re2java handles Unicode surrogates (code points in range
\fB[0xD800\-0xDFFF]\fP).
.sp
Below is an example of a lexer for UTF8 encoded Unicode identifiers.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT \-\-utf8 \-s

/*!include:re2c \(dqunicode_categories.re\(dq */

class Main {
    static boolean lex(String yyinput) {
        int yycursor = 0;
        int yymarker = 0;

        /*!re2c
            re2c:YYCTYPE = \(dqchar\(dq;
            re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;
            re2c:yyfill:enable = 0;

            // Simplified \(dqUnicode Identifier and Pattern Syntax\(dq
            // (see https://unicode.org/reports/tr31)
            id_start    = L | Nl | [$_];
            id_continue = id_start | Mn | Mc | Nd | Pc | [\eu200D\eu05F3];
            identifier  = id_start+;
            // It should be \(gaid_start id_continue*\(ga, but that causes \(gaerror: code too large\(ga

            identifier { return true; }
            *          { return false; }
        */
    }

    public static void main(String []args) {
        assert lex(\(dq_Ыдентификатор\e0\(dq);
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Include files
.sp
re2java allows one to include other files using a block of the form
\fB/*!include:re2c FILE */\fP or \fB%{include FILE %}\fP, or an in\-block directive
\fB!include FILE ;\fP, where \fBFILE\fP is a path to the file to be included.
re2java looks for include files in the directory of the including file and in
include locations, which can be specified with the \fB\-I\fP option. Include
blocks/directives in re2java work in the same way as C/C++ \fB#include\fP: \fBFILE\fP
contents are copy\-pasted verbatim in place of the block/directive. Include files
may have further includes of their own. Use \fB\-\-depfile\fP option to track build
dependencies of the output file on include files.
re2java provides some predefined include files that can be found in the
\fBinclude/\fP subdirectory of the project. These files contain definitions that
may be useful to other projects (such as Unicode categories) and form something
like a standard library for re2java\&. Below is an example of using include files.
.SS Include file 1 (definitions.java)
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
/*!re2c
    number = [1\-9][0\-9]*;
*/

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Include file 2 (extra_rules.re.inc)
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// floating\-point numbers
frac  = [0\-9]* \(dq.\(dq [0\-9]+ | [0\-9]+ \(dq.\(dq;
exp   = \(aqe\(aq [+\-]? [0\-9]+;
float = frac exp? | [0\-9]+ exp;

float { return Num.FLOAT; }

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Input file
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT

/*!include:re2c \(dqdefinitions.java\(dq */

class Main {
    enum Num {INT, FLOAT, NAN};

    static Num lex(String yyinput) {
        int yycursor = 0;
        int yymarker = 0;

        /*!re2c
            re2c:YYCTYPE = \(dqchar\(dq;
            re2c:YYPEEK = \(dqyyinput.charAt(yycursor)\(dq;
            re2c:yyfill:enable = 0;

            *      { return Num.NAN; }
            number { return Num.INT; }
            !include \(dqextra_rules.re.inc\(dq;
        */
    }

    public static void main(String []args) {
        assert lex(\(dq123\e0\(dq) == Num.INT;
        assert lex(\(dq123.4567\e0\(dq) == Num.FLOAT;
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Header files
.sp
re2java allows one to generate header file from the input \fB\&.re\fP file using
\fB\-\-header\fP option or \fBre2c:header\fP configuration and block pairs of the form
\fB/*!header:re2c:on*/\fP and \fB/*!header:re2c:off*/\fP, or \fB%{header:on%}\fP and
\fB%{header:off%}\fP\&. The first block marks the beginning of header file, and the
second block marks the end of it. Everything between these blocks is processed by
re2java, and the generated code is written to the file specified with \fB\-\-header\fP
option or \fBre2c:header\fP configuration (or \fBstdout\fP if neither option nor
configuration is used). Autogenerated header file may be needed in cases when
re2java is used to generate definitions  that must be visible from other
translation units.
.sp
Here is an example of generating a header file that contains definition of the
lexer state with tag variables (the number variables depends on the regular
grammar and is unknown to the programmer).
.SS Input file
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// re2java $INPUT \-o $OUTPUT \-\-header lexer/state.java

package headers;

import headers.lexer.State;

/*!header:re2c:on*/
package headers.lexer;

public class State {
    public String yyinput;
    public int yycursor;
    /*!stags:re2c format = \(dqpublic int @@;\en\(dq; */

    public State(String str) {
        yyinput = str;
        yycursor = 0;
        /*!stags:re2c format = \(dq@@ = 0;\en\(dq; */
    }
};
/*!header:re2c:off*/

class Main {
    static int lex(String str) {
        State yyrecord = new State(str);
        int t;
        /*!re2c
            re2c:api = record;
            re2c:tags = 1;
            re2c:yyfill:enable = 0;
            re2c:YYCTYPE = \(dqchar\(dq;
            re2c:YYPEEK = \(dqyyrecord.yyinput.charAt(yyrecord.yycursor)\(dq;
            re2c:header = \(dqlexer/state.java\(dq;

            [a]* @t [b]* { return t; }
        */
    }

    public static void main(String []args) {
        assert lex(\(dqab\e0\(dq) == 1;
    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Header file
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
// Generated by re2c

package headers.lexer;

public class State {
    public String yyinput;
    public int yycursor;
    public int yyt1;


    public State(String str) {
        yyinput = str;
        yycursor = 0;
        yyt1 = 0;

    }
};

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Skeleton programs
.sp
With the \fB\-S, \-\-skeleton\fP option, re2java ignores all non\-re2java code and
generates a self\-contained C program that can be further compiled and executed.
The program consists of lexer code and input data. For each constructed DFA
(block or condition) re2java generates a standalone lexer and two files: an
\fB\&.input\fP file with strings derived from the DFA and a \fB\&.keys\fP file with
expected match results. The program runs each lexer on the corresponding
\fB\&.input\fP file and compares results with the expectations.
Skeleton programs are very useful for a number of reasons:
.INDENT 0.0
.IP \(bu 2
They can check correctness of various re2java optimizations (the data is
generated early in the process, before any DFA transformations have taken
place).
.IP \(bu 2
Generating a set of input data with good coverage may be useful for both
testing and benchmarking.
.IP \(bu 2
Generating self\-contained executable programs allows one to get minimized test
cases (the original code may be large or have a lot of dependencies).
.UNINDENT
.sp
The difficulty with generating input data is that for all but the most trivial
cases the number of possible input strings is too large (even if the string
length is limited). re2java solves this difficulty by generating sufficiently
many strings to cover almost all DFA transitions. It uses the following
algorithm. First, it constructs a skeleton of the DFA. For encodings with 1\-byte
code unit size (such as ASCII, UTF\-8 and EBCDIC) skeleton is just an exact copy
of the original DFA. For encodings with multibyte code units skeleton is a copy
of DFA with certain transitions omitted: namely, re2java takes at most 256 code
units for each disjoint continuous range that corresponds to a DFA transition.
The chosen values are evenly distributed and include range bounds. Instead of
trying to cover all possible paths in the skeleton (which is infeasible) re2java
generates sufficiently many paths to cover all skeleton transitions, and thus
trigger the corresponding conditional jumps in the lexer.
The algorithm implementation is limited by ~1Gb of transitions and consumes
constant amount of memory (re2java writes data to file as soon as it is
generated).
.SS Visualization and debug
.sp
With the \fB\-D, \-\-emit\-dot\fP option, re2java does not generate code. Instead,
it dumps the generated DFA in DOT format.
One can convert this dump to an image of the DFA using Graphviz or another library.
Note that this option shows the final DFA after it has gone through a number of
optimizations and transformations. Earlier stages can be dumped with various debug
options, such as \fB\-\-dump\-nfa\fP, \fB\-\-dump\-dfa\-raw\fP etc. (see the full list of options).
.SH SEE ALSO
.sp
You can find more information about re2c at the official website: \fI\%http://re2c.org\fP\&.
Similar programs are flex(1), lex(1), quex(\fI\%http://quex.sourceforge.net\fP).
.SH AUTHORS
.sp
re2java was originally written by Peter Bumbulis (\fI\%peter@csg.uwaterloo.ca\fP) in 1993.
Marcus Boerger and Dan Nuffer spent several years to turn the original idea into
a production ready code generator. Since then it has been maintained and
developed by multiple volunteers, most notably,
Brian Young (\fI\%bayoung@acm.org\fP),
\fI\%Marcus Boerger\fP,
Dan Nuffer (\fI\%nuffer@users.sourceforge.net\fP),
\fI\%Ulya Trofimovich\fP (\fI\%skvadrik@gmail.com\fP),
\fI\%Serghei Iakovlev\fP,
\fI\%Sergei Trofimovich\fP,
\fI\%Petr Skocik\fP,
\fI\%ligfx\fP
\fI\%raekye\fP
and \fI\%PolarGoose\fP\&.
.\" Generated by docutils manpage writer.
.