1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 6165 6166 6167 6168 6169 6170 6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 6192 6193 6194 6195 6196 6197 6198 6199 6200 6201 6202 6203 6204 6205 6206 6207 6208 6209 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 6300 6301 6302 6303 6304 6305 6306 6307 6308 6309 6310 6311 6312 6313 6314 6315 6316 6317 6318 6319 6320 6321 6322 6323 6324 6325 6326 6327 6328 6329 6330 6331 6332 6333 6334 6335 6336 6337 6338 6339 6340 6341 6342 6343 6344 6345 6346 6347 6348 6349 6350 6351 6352 6353 6354 6355 6356 6357 6358 6359 6360 6361 6362 6363 6364 6365 6366 6367 6368 6369 6370 6371 6372 6373 6374 6375 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435 6436 6437 6438 6439 6440 6441 6442 6443 6444 6445 6446 6447 6448 6449 6450 6451 6452 6453 6454 6455 6456 6457 6458 6459 6460 6461 6462 6463 6464 6465 6466 6467 6468 6469 6470 6471 6472 6473 6474 6475 6476 6477 6478 6479 6480 6481 6482 6483 6484 6485 6486 6487 6488 6489 6490 6491 6492 6493 6494 6495 6496 6497 6498 6499 6500 6501 6502 6503 6504 6505 6506 6507 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6562 6563 6564 6565 6566 6567 6568 6569 6570 6571 6572 6573 6574 6575 6576 6577 6578 6579 6580 6581 6582 6583 6584 6585 6586 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596 6597 6598 6599 6600 6601 6602 6603 6604 6605 6606 6607 6608 6609 6610 6611 6612 6613 6614 6615 6616 6617 6618 6619 6620 6621 6622 6623 6624 6625 6626 6627 6628 6629 6630 6631 6632 6633 6634 6635 6636 6637 6638 6639 6640 6641 6642 6643 6644 6645 6646 6647 6648 6649 6650 6651 6652 6653 6654 6655 6656 6657 6658 6659 6660 6661 6662 6663 6664 6665 6666 6667 6668 6669 6670 6671 6672 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684 6685 6686 6687 6688 6689 6690 6691 6692 6693 6694 6695 6696 6697 6698 6699 6700 6701 6702 6703 6704 6705 6706 6707 6708 6709 6710 6711 6712 6713 6714 6715 6716 6717 6718 6719 6720 6721 6722 6723 6724 6725 6726 6727 6728 6729 6730 6731 6732 6733 6734 6735 6736 6737 6738 6739 6740 6741 6742 6743 6744 6745 6746 6747 6748 6749 6750 6751 6752 6753 6754 6755 6756 6757 6758 6759 6760 6761 6762 6763 6764 6765 6766 6767 6768 6769 6770 6771 6772 6773 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6784 6785 6786 6787 6788 6789 6790 6791 6792 6793 6794 6795 6796 6797 6798 6799 6800 6801 6802 6803 6804 6805 6806 6807 6808 6809 6810 6811 6812 6813 6814 6815 6816 6817 6818 6819 6820 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 6833 6834 6835 6836 6837 6838 6839 6840 6841 6842 6843 6844 6845 6846 6847 6848 6849 6850 6851 6852 6853 6854 6855 6856 6857 6858 6859 6860 6861 6862 6863 6864 6865 6866 6867 6868 6869 6870 6871 6872 6873 6874 6875 6876 6877 6878 6879 6880 6881 6882 6883 6884 6885 6886 6887 6888 6889 6890 6891 6892 6893 6894 6895 6896 6897 6898 6899 6900 6901 6902 6903 6904 6905 6906 6907 6908 6909 6910 6911 6912 6913 6914 6915 6916 6917 6918 6919 6920 6921 6922 6923 6924 6925 6926 6927 6928 6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960 6961 6962 6963 6964 6965 6966 6967 6968 6969 6970 6971 6972 6973 6974 6975 6976 6977 6978 6979 6980 6981 6982 6983 6984 6985 6986 6987 6988 6989 6990 6991 6992 6993 6994 6995 6996 6997 6998 6999 7000 7001 7002 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012 7013 7014 7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028 7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042 7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056 7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070 7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 7116
|
-----------------------------------------------------------------------------
This file contains a concatenation of the PCRE man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
synopses of each function in the library have not been included. Neither has
the pcredemo program. There are separate text files for the pcregrep and
pcretest commands.
-----------------------------------------------------------------------------
PCRE(3) PCRE(3)
NAME
PCRE - Perl-compatible regular expressions
INTRODUCTION
The PCRE library is a set of functions that implement regular expres-
sion pattern matching using the same syntax and semantics as Perl, with
just a few differences. Some features that appeared in Python and PCRE
before they appeared in Perl are also available using the Python syn-
tax, there is some support for one or two .NET and Oniguruma syntax
items, and there is an option for requesting some minor changes that
give better JavaScript compatibility.
The current implementation of PCRE corresponds approximately with Perl
5.10, including support for UTF-8 encoded strings and Unicode general
category properties. However, UTF-8 and Unicode support has to be
explicitly enabled; it is not the default. The Unicode tables corre-
spond to Unicode release 5.2.0.
In addition to the Perl-compatible matching function, PCRE contains an
alternative function that matches the same compiled patterns in a dif-
ferent way. In certain circumstances, the alternative function has some
advantages. For a discussion of the two matching algorithms, see the
pcrematching page.
PCRE is written in C and released as a C library. A number of people
have written wrappers and interfaces of various kinds. In particular,
Google Inc. have provided a comprehensive C++ wrapper. This is now
included as part of the PCRE distribution. The pcrecpp page has details
of this interface. Other people's contributions can be found in the
Contrib directory at the primary FTP site, which is:
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
Details of exactly which Perl regular expression features are and are
not supported by PCRE are given in separate documents. See the pcrepat-
tern and pcrecompat pages. There is a syntax summary in the pcresyntax
page.
Some features of PCRE can be included, excluded, or changed when the
library is built. The pcre_config() function makes it possible for a
client to discover which features are available. The features them-
selves are described in the pcrebuild page. Documentation about build-
ing PCRE for various operating systems can be found in the README and
NON-UNIX-USE files in the source distribution.
The library contains a number of undocumented internal functions and
data tables that are used by more than one of the exported external
functions, but which are not intended for use by external callers.
Their names all begin with "_pcre_", which hopefully will not provoke
any name clashes. In some environments, it is possible to control which
external symbols are exported when a shared library is built, and in
these cases the undocumented symbols are not exported.
USER DOCUMENTATION
The user documentation for PCRE comprises a number of different sec-
tions. In the "man" format, each of these is a separate "man page". In
the HTML format, each is a separate page, linked from the index page.
In the plain text format, all the sections, except the pcredemo sec-
tion, are concatenated, for ease of searching. The sections are as fol-
lows:
pcre this document
pcre-config show PCRE installation configuration information
pcreapi details of PCRE's native C API
pcrebuild options for building PCRE
pcrecallout details of the callout feature
pcrecompat discussion of Perl compatibility
pcrecpp details of the C++ wrapper
pcredemo a demonstration C program that uses PCRE
pcregrep description of the pcregrep command
pcrematching discussion of the two matching algorithms
pcrepartial details of the partial matching facility
pcrepattern syntax and semantics of supported
regular expressions
pcreperform discussion of performance issues
pcreposix the POSIX-compatible C API
pcreprecompile details of saving and re-using precompiled patterns
pcresample discussion of the pcredemo program
pcrestack discussion of stack usage
pcresyntax quick syntax reference
pcretest description of the pcretest testing command
In addition, in the "man" and HTML formats, there is a short page for
each C library function, listing its arguments and results.
LIMITATIONS
There are some size limitations in PCRE but it is hoped that they will
never in practice be relevant.
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE
is compiled with the default internal linkage size of 2. If you want to
process regular expressions that are truly enormous, you can compile
PCRE with an internal linkage size of 3 or 4 (see the README file in
the source distribution and the pcrebuild documentation for details).
In these cases the limit is substantially larger. However, the speed
of execution is slower.
All values in repeating quantifiers must be less than 65536.
There is no limit to the number of parenthesized subpatterns, but there
can be no more than 65535 capturing subpatterns.
The maximum length of name for a named subpattern is 32 characters, and
the maximum number of named subpatterns is 10000.
The maximum length of a subject string is the largest positive number
that an integer variable can hold. However, when using the traditional
matching function, PCRE uses recursion to handle subpatterns and indef-
inite repetition. This means that the available stack space may limit
the size of a subject string that can be processed by certain patterns.
For a discussion of stack issues, see the pcrestack documentation.
UTF-8 AND UNICODE PROPERTY SUPPORT
From release 3.3, PCRE has had some support for character strings
encoded in the UTF-8 format. For release 4.0 this was greatly extended
to cover most common requirements, and in release 5.0 additional sup-
port for Unicode general category properties was added.
In order process UTF-8 strings, you must build PCRE to include UTF-8
support in the code, and, in addition, you must call pcre_compile()
with the PCRE_UTF8 option flag, or the pattern must start with the
sequence (*UTF8). When either of these is the case, both the pattern
and any subject strings that are matched against it are treated as
UTF-8 strings instead of strings of 1-byte characters.
If you compile PCRE with UTF-8 support, but do not use it at run time,
the library will be a bit bigger, but the additional run time overhead
is limited to testing the PCRE_UTF8 flag occasionally, so should not be
very big.
If PCRE is built with Unicode character property support (which implies
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup-
ported. The available properties that can be tested are limited to the
general category properties such as Lu for an upper case letter or Nd
for a decimal number, the Unicode script names such as Arabic or Han,
and the derived properties Any and L&. A full list is given in the
pcrepattern documentation. Only the short names for properties are sup-
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
ter}, is not supported. Furthermore, in Perl, many properties may
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE
does not support this.
Validity of UTF-8 strings
When you set the PCRE_UTF8 flag, the strings passed as patterns and
subjects are (by default) checked for validity on entry to the relevant
functions. From release 7.3 of PCRE, the check is according the rules
of RFC 3629, which are themselves derived from the Unicode specifica-
tion. Earlier releases of PCRE followed the rules of RFC 2279, which
allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current
check allows only values in the range U+0 to U+10FFFF, excluding U+D800
to U+DFFF.
The excluded code points are the "Low Surrogate Area" of Unicode, of
which the Unicode Standard says this: "The Low Surrogate Area does not
contain any character assignments, consequently no character code
charts or namelists are provided for this area. Surrogates are reserved
for use with UTF-16 and then must be used in pairs." The code points
that are encoded by UTF-16 pairs are available as independent code
points in the UTF-8 encoding. (In other words, the whole surrogate
thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
If an invalid UTF-8 string is passed to PCRE, an error return
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
that your strings are valid, and therefore want to skip these checks in
order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
compile time or at run time, PCRE assumes that the pattern or subject
it is given (respectively) contains only valid UTF-8 codes. In this
case, it does not diagnose an invalid UTF-8 string.
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
what happens depends on why the string is invalid. If the string con-
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
string of characters in the range 0 to 0x7FFFFFFF. In other words,
apart from the initial validity test, PCRE (when in UTF-8 mode) handles
strings according to the more liberal rules of RFC 2279. However, if
the string does not even conform to RFC 2279, the result is undefined.
Your program may crash.
If you want to process strings of values in the full range 0 to
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
this situation, you will have to apply your own validity check.
General comments about UTF-8 mode
1. An unbraced hexadecimal escape sequence (such as \xb3) matches a
two-byte UTF-8 character if the value is greater than 127.
2. Octal numbers up to \777 are recognized, and match two-byte UTF-8
characters for values greater than \177.
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi-
vidual bytes, for example: \x{100}{3}.
4. The dot metacharacter matches one UTF-8 character instead of a sin-
gle byte.
5. The escape sequence \C can be used to match a single byte in UTF-8
mode, but its use can lead to some strange effects. This facility is
not available in the alternative matching function, pcre_dfa_exec().
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
test characters of any code value, but the characters that PCRE recog-
nizes as digits, spaces, or word characters remain the same set as
before, all with values less than 256. This remains true even when PCRE
includes Unicode property support, because to do otherwise would slow
down PCRE in many common cases. If you really want to test for a wider
sense of, say, "digit", you must use Unicode property tests such as
\p{Nd}. Note that this also applies to \b, because it is defined in
terms of \w and \W.
7. Similarly, characters that match the POSIX named character classes
are all low-valued characters.
8. However, the Perl 5.10 horizontal and vertical whitespace matching
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
acters.
9. Case-insensitive matching applies only to characters whose values
are less than 128, unless PCRE is built with Unicode property support.
Even when Unicode property support is available, PCRE still uses its
own character tables when checking the case of low-valued characters,
so as not to degrade performance. The Unicode property information is
used only for characters with higher values. Even when Unicode property
support is available, PCRE supports case-insensitive matching only when
there is a one-to-one mapping between a letter's cases. There are a
small number of many-to-one mappings in Unicode; these are not sup-
ported by PCRE.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
Putting an actual email address here seems to have been a spam magnet,
so I've taken it away. If you want to email me, use my two initials,
followed by the two digits 10, at the domain cam.ac.uk.
REVISION
Last updated: 01 March 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
PCREBUILD(3) PCREBUILD(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE BUILD-TIME OPTIONS
This document describes the optional features of PCRE that can be
selected when the library is compiled. It assumes use of the configure
script, where the optional features are selected or deselected by pro-
viding options to configure before running the make command. However,
the same options can be selected in both Unix-like and non-Unix-like
environments using the GUI facility of cmake-gui if you are using CMake
instead of configure to build PCRE.
There is a lot more information about building PCRE in non-Unix-like
environments in the file called NON_UNIX_USE, which is part of the PCRE
distribution. You should consult this file as well as the README file
if you are building in a non-Unix-like environment.
The complete list of options for configure (which includes the standard
ones such as the selection of the installation directory) can be
obtained by running
./configure --help
The following sections include descriptions of options whose names
begin with --enable or --disable. These settings specify changes to the
defaults for the configure command. Because of the way that configure
works, --enable and --disable always come in pairs, so the complemen-
tary option always exists as well, but as it specifies the default, it
is not described.
C++ SUPPORT
By default, the configure script will search for a C++ compiler and C++
header files. If it finds them, it automatically builds the C++ wrapper
library for PCRE. You can disable this by adding
--disable-cpp
to the configure command.
UTF-8 SUPPORT
To build PCRE with support for UTF-8 Unicode character strings, add
--enable-utf8
to the configure command. Of itself, this does not make PCRE treat
strings as UTF-8. As well as compiling PCRE with this option, you also
have have to set the PCRE_UTF8 option when you call the pcre_compile()
or pcre_compile2() functions.
If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE
expects its input to be either ASCII or UTF-8 (depending on the runtime
option). It is not possible to support both EBCDIC and UTF-8 codes in
the same version of the library. Consequently, --enable-utf8 and
--enable-ebcdic are mutually exclusive.
UNICODE CHARACTER PROPERTY SUPPORT
UTF-8 support allows PCRE to process character values greater than 255
in the strings that it handles. On its own, however, it does not pro-
vide any facilities for accessing the properties of such characters. If
you want to be able to use the pattern escapes \P, \p, and \X, which
refer to Unicode character properties, you must add
--enable-unicode-properties
to the configure command. This implies UTF-8 support, even if you have
not explicitly requested it.
Including Unicode property support adds around 30K of tables to the
PCRE library. Only the general category properties such as Lu and Nd
are supported. Details are given in the pcrepattern documentation.
CODE VALUE OF NEWLINE
By default, PCRE interprets the linefeed (LF) character as indicating
the end of a line. This is the normal newline character on Unix-like
systems. You can compile PCRE to use carriage return (CR) instead, by
adding
--enable-newline-is-cr
to the configure command. There is also a --enable-newline-is-lf
option, which explicitly specifies linefeed as the newline character.
Alternatively, you can specify that line endings are to be indicated by
the two character sequence CRLF. If you want this, add
--enable-newline-is-crlf
to the configure command. There is a fourth option, specified by
--enable-newline-is-anycrlf
which causes PCRE to recognize any of the three sequences CR, LF, or
CRLF as indicating a line ending. Finally, a fifth option, specified by
--enable-newline-is-any
causes PCRE to recognize any Unicode newline sequence.
Whatever line ending convention is selected when PCRE is built can be
overridden when the library functions are called. At build time it is
conventional to use the standard for your operating system.
WHAT \R MATCHES
By default, the sequence \R in a pattern matches any Unicode newline
sequence, whatever has been selected as the line ending sequence. If
you specify
--enable-bsr-anycrlf
the default is changed so that \R matches only CR, LF, or CRLF. What-
ever is selected when PCRE is built can be overridden when the library
functions are called.
BUILDING SHARED AND STATIC LIBRARIES
The PCRE building process uses libtool to build both shared and static
Unix libraries by default. You can suppress one of these by adding one
of
--disable-shared
--disable-static
to the configure command, as required.
POSIX MALLOC USAGE
When PCRE is called through the POSIX interface (see the pcreposix doc-
umentation), additional working storage is required for holding the
pointers to capturing substrings, because PCRE requires three integers
per substring, whereas the POSIX interface provides only two. If the
number of expected substrings is small, the wrapper function uses space
on the stack, because this is faster than using malloc() for each call.
The default threshold above which the stack is no longer used is 10; it
can be changed by adding a setting such as
--with-posix-malloc-threshold=20
to the configure command.
HANDLING VERY LARGE PATTERNS
Within a compiled pattern, offset values are used to point from one
part to another (for example, from an opening parenthesis to an alter-
nation metacharacter). By default, two-byte values are used for these
offsets, leading to a maximum size for a compiled pattern of around
64K. This is sufficient to handle all but the most gigantic patterns.
Nevertheless, some people do want to process truyl enormous patterns,
so it is possible to compile PCRE to use three-byte or four-byte off-
sets by adding a setting such as
--with-link-size=3
to the configure command. The value given must be 2, 3, or 4. Using
longer offsets slows down the operation of PCRE because it has to load
additional bytes when handling them.
AVOIDING EXCESSIVE STACK USAGE
When matching with the pcre_exec() function, PCRE implements backtrack-
ing by making recursive calls to an internal function called match().
In environments where the size of the stack is limited, this can se-
verely limit PCRE's operation. (The Unix environment does not usually
suffer from this problem, but it may sometimes be necessary to increase
the maximum stack size. There is a discussion in the pcrestack docu-
mentation.) An alternative approach to recursion that uses memory from
the heap to remember data, instead of using recursive function calls,
has been implemented to work round the problem of limited stack size.
If you want to build a version of PCRE that works this way, add
--disable-stack-for-recursion
to the configure command. With this configuration, PCRE will use the
pcre_stack_malloc and pcre_stack_free variables to call memory manage-
ment functions. By default these point to malloc() and free(), but you
can replace the pointers so that your own functions are used instead.
Separate functions are provided rather than using pcre_malloc and
pcre_free because the usage is very predictable: the block sizes
requested are always the same, and the blocks are always freed in
reverse order. A calling program might be able to implement optimized
functions that perform better than malloc() and free(). PCRE runs
noticeably more slowly when built in this way. This option affects only
the pcre_exec() function; it is not relevant for pcre_dfa_exec().
LIMITING PCRE RESOURCE USAGE
Internally, PCRE has a function called match(), which it calls repeat-
edly (sometimes recursively) when matching a pattern with the
pcre_exec() function. By controlling the maximum number of times this
function may be called during a single matching operation, a limit can
be placed on the resources used by a single call to pcre_exec(). The
limit can be changed at run time, as described in the pcreapi documen-
tation. The default is 10 million, but this can be changed by adding a
setting such as
--with-match-limit=500000
to the configure command. This setting has no effect on the
pcre_dfa_exec() matching function.
In some environments it is desirable to limit the depth of recursive
calls of match() more strictly than the total number of calls, in order
to restrict the maximum amount of stack (or heap, if --disable-stack-
for-recursion is specified) that is used. A second limit controls this;
it defaults to the value that is set for --with-match-limit, which
imposes no additional constraints. However, you can set a lower limit
by adding, for example,
--with-match-limit-recursion=10000
to the configure command. This value can also be overridden at run
time.
CREATING CHARACTER TABLES AT BUILD TIME
PCRE uses fixed tables for processing characters whose code values are
less than 256. By default, PCRE is built with a set of tables that are
distributed in the file pcre_chartables.c.dist. These tables are for
ASCII codes only. If you add
--enable-rebuild-chartables
to the configure command, the distributed tables are no longer used.
Instead, a program called dftables is compiled and run. This outputs
the source for new set of tables, created in the default locale of your
C runtime system. (This method of replacing the tables does not work if
you are cross compiling, because dftables is run on the local host. If
you need to create alternative tables when cross compiling, you will
have to do so "by hand".)
USING EBCDIC CODE
PCRE assumes by default that it will run in an environment where the
character code is ASCII (or Unicode, which is a superset of ASCII).
This is the case for most computer operating systems. PCRE can, how-
ever, be compiled to run in an EBCDIC environment by adding
--enable-ebcdic
to the configure command. This setting implies --enable-rebuild-charta-
bles. You should only use it if you know that you are in an EBCDIC
environment (for example, an IBM mainframe operating system). The
--enable-ebcdic option is incompatible with --enable-utf8.
PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
By default, pcregrep reads all files as plain text. You can build it so
that it recognizes files whose names end in .gz or .bz2, and reads them
with libz or libbz2, respectively, by adding one or both of
--enable-pcregrep-libz
--enable-pcregrep-libbz2
to the configure command. These options naturally require that the rel-
evant libraries are installed on your system. Configuration will fail
if they are not.
PCRETEST OPTION FOR LIBREADLINE SUPPORT
If you add
--enable-pcretest-libreadline
to the configure command, pcretest is linked with the libreadline
library, and when its input is from a terminal, it reads it using the
readline() function. This provides line-editing and history facilities.
Note that libreadline is GPL-licensed, so if you distribute a binary of
pcretest linked in this way, there may be licensing issues.
Setting this option causes the -lreadline option to be added to the
pcretest build. In many operating environments with a sytem-installed
libreadline this is sufficient. However, in some environments (e.g. if
an unmodified distribution version of readline is in use), some extra
configuration may be necessary. The INSTALL file for libreadline says
this:
"Readline uses the termcap functions, but does not link with the
termcap or curses library itself, allowing applications which link
with readline the to choose an appropriate library."
If your environment has not been set up so that an appropriate library
is automatically included, you may need to add something like
LIBS="-ncurses"
immediately before the configure command.
SEE ALSO
pcreapi(3), pcre_config(3).
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
PCREMATCHING(3) PCREMATCHING(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE MATCHING ALGORITHMS
This document describes the two different algorithms that are available
in PCRE for matching a compiled regular expression against a given sub-
ject string. The "standard" algorithm is the one provided by the
pcre_exec() function. This works in the same was as Perl's matching
function, and provides a Perl-compatible matching operation.
An alternative algorithm is provided by the pcre_dfa_exec() function;
this operates in a different way, and is not Perl-compatible. It has
advantages and disadvantages compared with the standard algorithm, and
these are described below.
When there is only one possible way in which a given subject string can
match a pattern, the two algorithms give the same answer. A difference
arises, however, when there are multiple possibilities. For example, if
the pattern
^<.*>
is matched against the string
<something> <something else> <something further>
there are three possible answers. The standard algorithm finds only one
of them, whereas the alternative algorithm finds all three.
REGULAR EXPRESSIONS AS TREES
The set of strings that are matched by a regular expression can be rep-
resented as a tree structure. An unlimited repetition in the pattern
makes the tree of infinite size, but it is still a tree. Matching the
pattern to a given subject string (from a given starting point) can be
thought of as a search of the tree. There are two ways to search a
tree: depth-first and breadth-first, and these correspond to the two
matching algorithms provided by PCRE.
THE STANDARD MATCHING ALGORITHM
In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
sions", the standard algorithm is an "NFA algorithm". It conducts a
depth-first search of the pattern tree. That is, it proceeds along a
single path through the tree, checking that the subject matches what is
required. When there is a mismatch, the algorithm tries any alterna-
tives at the current point, and if they all fail, it backs up to the
previous branch point in the tree, and tries the next alternative
branch at that level. This often involves backing up (moving to the
left) in the subject string as well. The order in which repetition
branches are tried is controlled by the greedy or ungreedy nature of
the quantifier.
If a leaf node is reached, a matching string has been found, and at
that point the algorithm stops. Thus, if there is more than one possi-
ble match, this algorithm returns the first one that it finds. Whether
this is the shortest, the longest, or some intermediate length depends
on the way the greedy and ungreedy repetition quantifiers are specified
in the pattern.
Because it ends up with a single path through the tree, it is rela-
tively straightforward for this algorithm to keep track of the sub-
strings that are matched by portions of the pattern in parentheses.
This provides support for capturing parentheses and back references.
THE ALTERNATIVE MATCHING ALGORITHM
This algorithm conducts a breadth-first search of the tree. Starting
from the first matching point in the subject, it scans the subject
string from left to right, once, character by character, and as it does
this, it remembers all the paths through the tree that represent valid
matches. In Friedl's terminology, this is a kind of "DFA algorithm",
though it is not implemented as a traditional finite state machine (it
keeps multiple states active simultaneously).
Although the general principle of this matching algorithm is that it
scans the subject string only once, without backtracking, there is one
exception: when a lookaround assertion is encountered, the characters
following or preceding the current point have to be independently
inspected.
The scan continues until either the end of the subject is reached, or
there are no more unterminated paths. At this point, terminated paths
represent the different matching possibilities (if there are none, the
match has failed). Thus, if there is more than one possible match,
this algorithm finds all of them, and in particular, it finds the long-
est. There is an option to stop the algorithm after the first match
(which is necessarily the shortest) is found.
Note that all the matches that are found start at the same point in the
subject. If the pattern
cat(er(pillar)?)
is matched against the string "the caterpillar catchment", the result
will be the three strings "cat", "cater", and "caterpillar" that start
at the fourth character of the subject. The algorithm does not automat-
ically move on to find matches that start at later positions.
There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
1. Because the algorithm finds all possible matches, the greedy or
ungreedy nature of repetition quantifiers is not relevant. Greedy and
ungreedy quantifiers are treated in exactly the same way. However, pos-
sessive quantifiers can make a difference when what follows could also
match what is quantified, for example in a pattern like this:
^a++\w!
This pattern matches "aaab!" but not "aaa!", which would be matched by
a non-possessive quantifier. Similarly, if an atomic group is present,
it is matched as if it were a standalone pattern at the current point,
and the longest match is then "locked in" for the rest of the overall
pattern.
2. When dealing with multiple paths through the tree simultaneously, it
is not straightforward to keep track of captured substrings for the
different matching possibilities, and PCRE's implementation of this
algorithm does not attempt to do this. This means that no captured sub-
strings are available.
3. Because no substrings are captured, back references within the pat-
tern are not supported, and cause errors if encountered.
4. For the same reason, conditional expressions that use a backrefer-
ence as the condition or test for a specific group recursion are not
supported.
5. Because many paths through the tree may be active, the \K escape
sequence, which resets the start of the match when encountered (but may
be on some paths and not on others), is not supported. It causes an
error if encountered.
6. Callouts are supported, but the value of the capture_top field is
always 1, and the value of the capture_last field is always -1.
7. The \C escape sequence, which (in the standard algorithm) matches a
single byte, even in UTF-8 mode, is not supported because the alterna-
tive algorithm moves through the subject string one character at a
time, for all active paths through the tree.
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
are not supported. (*FAIL) is supported, and behaves like a failing
negative assertion.
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
Using the alternative matching algorithm provides the following advan-
tages:
1. All possible matches (at a single point in the subject) are automat-
ically found, and in particular, the longest match is found. To find
more than one match using the standard algorithm, you have to do kludgy
things with callouts.
2. Because the alternative algorithm scans the subject string just
once, and never needs to backtrack, it is possible to pass very long
subject strings to the matching function in several pieces, checking
for partial matching each time. The pcrepartial documentation gives
details of partial matching.
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
The alternative algorithm suffers from a number of disadvantages:
1. It is substantially slower than the standard algorithm. This is
partly because it has to search for all possible matches, but is also
because it is less susceptible to optimization.
2. Capturing parentheses and back references are not supported.
3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
PCREAPI(3) PCREAPI(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE NATIVE API
#include <pcre.h>
pcre *pcre_compile(const char *pattern, int options,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
pcre *pcre_compile2(const char *pattern, int options,
int *errorcodeptr,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
pcre_extra *pcre_study(const pcre *code, int options,
const char **errptr);
int pcre_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize);
int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize,
int *workspace, int wscount);
int pcre_copy_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
char *buffer, int buffersize);
int pcre_copy_substring(const char *subject, int *ovector,
int stringcount, int stringnumber, char *buffer,
int buffersize);
int pcre_get_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
const char **stringptr);
int pcre_get_stringnumber(const pcre *code,
const char *name);
int pcre_get_stringtable_entries(const pcre *code,
const char *name, char **first, char **last);
int pcre_get_substring(const char *subject, int *ovector,
int stringcount, int stringnumber,
const char **stringptr);
int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);
void pcre_free_substring(const char *stringptr);
void pcre_free_substring_list(const char **stringptr);
const unsigned char *pcre_maketables(void);
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
int what, void *where);
int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
int pcre_refcount(pcre *code, int adjust);
int pcre_config(int what, void *where);
char *pcre_version(void);
void *(*pcre_malloc)(size_t);
void (*pcre_free)(void *);
void *(*pcre_stack_malloc)(size_t);
void (*pcre_stack_free)(void *);
int (*pcre_callout)(pcre_callout_block *);
PCRE API OVERVIEW
PCRE has its own native API, which is described in this document. There
are also some wrapper functions that correspond to the POSIX regular
expression API. These are described in the pcreposix documentation.
Both of these APIs define a set of C function calls. A C++ wrapper is
distributed with PCRE. It is documented in the pcrecpp page.
The native API C function prototypes are defined in the header file
pcre.h, and on Unix systems the library itself is called libpcre. It
can normally be accessed by adding -lpcre to the command for linking an
application that uses PCRE. The header file defines the macros
PCRE_MAJOR and PCRE_MINOR to contain the major and minor release num-
bers for the library. Applications can use these to include support
for different releases of PCRE.
The functions pcre_compile(), pcre_compile2(), pcre_study(), and
pcre_exec() are used for compiling and matching regular expressions in
a Perl-compatible manner. A sample program that demonstrates the sim-
plest way of using them is provided in the file called pcredemo.c in
the PCRE source distribution. A listing of this program is given in the
pcredemo documentation, and the pcresample documentation describes how
to compile and run it.
A second matching function, pcre_dfa_exec(), which is not Perl-compati-
ble, is also provided. This uses a different algorithm for the match-
ing. The alternative algorithm finds all possible matches (at a given
point in the subject), and scans the subject just once (unless there
are lookbehind assertions). However, this algorithm does not return
captured substrings. A description of the two matching algorithms and
their advantages and disadvantages is given in the pcrematching docu-
mentation.
In addition to the main compiling and matching functions, there are
convenience functions for extracting captured substrings from a subject
string that is matched by pcre_exec(). They are:
pcre_copy_substring()
pcre_copy_named_substring()
pcre_get_substring()
pcre_get_named_substring()
pcre_get_substring_list()
pcre_get_stringnumber()
pcre_get_stringtable_entries()
pcre_free_substring() and pcre_free_substring_list() are also provided,
to free the memory used for extracted strings.
The function pcre_maketables() is used to build a set of character
tables in the current locale for passing to pcre_compile(),
pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
provided for specialist use. Most commonly, no special tables are
passed, in which case internal tables that are generated when PCRE is
built are used.
The function pcre_fullinfo() is used to find out information about a
compiled pattern; pcre_info() is an obsolete version that returns only
some of the available information, but is retained for backwards com-
patibility. The function pcre_version() returns a pointer to a string
containing the version of PCRE and its date of release.
The function pcre_refcount() maintains a reference count in a data
block containing a compiled pattern. This is provided for the benefit
of object-oriented applications.
The global variables pcre_malloc and pcre_free initially contain the
entry points of the standard malloc() and free() functions, respec-
tively. PCRE calls the memory management functions via these variables,
so a calling program can replace them if it wishes to intercept the
calls. This should be done before calling any PCRE functions.
The global variables pcre_stack_malloc and pcre_stack_free are also
indirections to memory management functions. These special functions
are used only when PCRE is compiled to use the heap for remembering
data, instead of recursive function calls, when running the pcre_exec()
function. See the pcrebuild documentation for details of how to do
this. It is a non-standard way of building PCRE, for use in environ-
ments that have limited stacks. Because of the greater use of memory
management, it runs more slowly. Separate functions are provided so
that special-purpose external code can be used for this case. When
used, these functions are always called in a stack-like manner (last
obtained, first freed), and always for memory blocks of the same size.
There is a discussion about PCRE's stack usage in the pcrestack docu-
mentation.
The global variable pcre_callout initially contains NULL. It can be set
by the caller to a "callout" function, which PCRE will then call at
specified points during a matching operation. Details are given in the
pcrecallout documentation.
NEWLINES
PCRE supports five different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (line-
feed) character, the two-character sequence CRLF, any of the three pre-
ceding, or any Unicode newline sequence. The Unicode newline sequences
are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029).
Each of the first three conventions is used by at least one operating
system as its standard newline sequence. When PCRE is built, a default
can be specified. The default default is LF, which is the Unix stan-
dard. When PCRE is run, the default can be overridden, either when a
pattern is compiled, or when it is matched.
At compile time, the newline convention can be specified by the options
argument of pcre_compile(), or it can be specified by special text at
the start of the pattern itself; this overrides any other settings. See
the pcrepattern page for details of the special character sequences.
In the PCRE documentation the word "newline" is used to mean "the char-
acter or pair of characters that indicate a line break". The choice of
newline convention affects the handling of the dot, circumflex, and
dollar metacharacters, the handling of #-comments in /x mode, and, when
CRLF is a recognized line ending sequence, the match position advance-
ment for a non-anchored pattern. There is more detail about this in the
section on pcre_exec() options below.
The choice of newline convention does not affect the interpretation of
the \n or \r escape sequences, nor does it affect what \R matches,
which is controlled in a similar way, but by separate options.
MULTITHREADING
The PCRE functions can be used in multi-threading applications, with
the proviso that the memory management functions pointed to by
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
callout function pointed to by pcre_callout, are shared by all threads.
The compiled form of a regular expression is not altered during match-
ing, so the same compiled pattern can safely be used by several threads
at once.
SAVING PRECOMPILED PATTERNS FOR LATER USE
The compiled form of a regular expression can be saved and re-used at a
later time, possibly by a different program, and even on a host other
than the one on which it was compiled. Details are given in the
pcreprecompile documentation. However, compiling a regular expression
with one version of PCRE for use with a different version is not guar-
anteed to work and may cause crashes.
CHECKING BUILD-TIME OPTIONS
int pcre_config(int what, void *where);
The function pcre_config() makes it possible for a PCRE client to dis-
cover which optional features have been compiled into the PCRE library.
The pcrebuild documentation has more details about these optional fea-
tures.
The first argument for pcre_config() is an integer, specifying which
information is required; the second argument is a pointer to a variable
into which the information is placed. The following information is
available:
PCRE_CONFIG_UTF8
The output is an integer that is set to one if UTF-8 support is avail-
able; otherwise it is set to zero.
PCRE_CONFIG_UNICODE_PROPERTIES
The output is an integer that is set to one if support for Unicode
character properties is available; otherwise it is set to zero.
PCRE_CONFIG_NEWLINE
The output is an integer whose value specifies the default character
sequence that is recognized as meaning "newline". The four values that
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
and -1 for ANY. Though they are derived from ASCII, the same values
are returned in EBCDIC environments. The default should normally corre-
spond to the standard sequence for your operating system.
PCRE_CONFIG_BSR
The output is an integer whose value indicates what character sequences
the \R escape sequence matches by default. A value of 0 means that \R
matches any Unicode line ending sequence; a value of 1 means that \R
matches only CR, LF, or CRLF. The default can be overridden when a pat-
tern is compiled or matched.
PCRE_CONFIG_LINK_SIZE
The output is an integer that contains the number of bytes used for
internal linkage in compiled regular expressions. The value is 2, 3, or
4. Larger values allow larger regular expressions to be compiled, at
the expense of slower matching. The default value of 2 is sufficient
for all but the most massive patterns, since it allows the compiled
pattern to be up to 64K in size.
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
The output is an integer that contains the threshold above which the
POSIX interface uses malloc() for output vectors. Further details are
given in the pcreposix documentation.
PCRE_CONFIG_MATCH_LIMIT
The output is a long integer that gives the default limit for the num-
ber of internal matching function calls in a pcre_exec() execution.
Further details are given with pcre_exec() below.
PCRE_CONFIG_MATCH_LIMIT_RECURSION
The output is a long integer that gives the default limit for the depth
of recursion when calling the internal matching function in a
pcre_exec() execution. Further details are given with pcre_exec()
below.
PCRE_CONFIG_STACKRECURSE
The output is an integer that is set to one if internal recursion when
running pcre_exec() is implemented by recursive function calls that use
the stack to remember their state. This is the usual way that PCRE is
compiled. The output is zero if PCRE was compiled to use blocks of data
on the heap instead of recursive function calls. In this case,
pcre_stack_malloc and pcre_stack_free are called to manage memory
blocks on the heap, thus avoiding the use of the stack.
COMPILING A PATTERN
pcre *pcre_compile(const char *pattern, int options,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
pcre *pcre_compile2(const char *pattern, int options,
int *errorcodeptr,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
Either of the functions pcre_compile() or pcre_compile2() can be called
to compile a pattern into an internal form. The only difference between
the two interfaces is that pcre_compile2() has an additional argument,
errorcodeptr, via which a numerical error code can be returned. To
avoid too much repetition, we refer just to pcre_compile() below, but
the information applies equally to pcre_compile2().
The pattern is a C string terminated by a binary zero, and is passed in
the pattern argument. A pointer to a single block of memory that is
obtained via pcre_malloc is returned. This contains the compiled code
and related data. The pcre type is defined for the returned block; this
is a typedef for a structure whose contents are not externally defined.
It is up to the caller to free the memory (via pcre_free) when it is no
longer required.
Although the compiled code of a PCRE regex is relocatable, that is, it
does not depend on memory location, the complete pcre data block is not
fully relocatable, because it may contain a copy of the tableptr argu-
ment, which is an address (see below).
The options argument contains various bit settings that affect the com-
pilation. It should be zero if no options are required. The available
options are described below. Some of them (in particular, those that
are compatible with Perl, but some others as well) can also be set and
unset from within the pattern (see the detailed description in the
pcrepattern documentation). For those options that can be different in
different parts of the pattern, the contents of the options argument
specifies their settings at the start of compilation and execution. The
PCRE_ANCHORED, PCRE_BSR_xxx, and PCRE_NEWLINE_xxx options can be set at
the time of matching as well as at compile time.
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
if compilation of a pattern fails, pcre_compile() returns NULL, and
sets the variable pointed to by errptr to point to a textual error mes-
sage. This is a static string that is part of the library. You must not
try to free it. The byte offset from the start of the pattern to the
character that was being processed when the error was discovered is
placed in the variable pointed to by erroffset, which must not be NULL.
If it is, an immediate error is given. Some errors are not detected
until checks are carried out when the whole pattern has been scanned;
in this case the offset is set to the end of the pattern.
If pcre_compile2() is used instead of pcre_compile(), and the error-
codeptr argument is not NULL, a non-zero error code number is returned
via this argument in the event of an error. This is in addition to the
textual error message. Error codes and messages are listed below.
If the final argument, tableptr, is NULL, PCRE uses a default set of
character tables that are built when PCRE is compiled, using the
default C locale. Otherwise, tableptr must be an address that is the
result of a call to pcre_maketables(). This value is stored with the
compiled pattern, and used again by pcre_exec(), unless another table
pointer is passed to it. For more discussion, see the section on locale
support below.
This code fragment shows a typical straightforward call to pcre_com-
pile():
pcre *re;
const char *error;
int erroffset;
re = pcre_compile(
"^A.*Z", /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
The following names for option bits are defined in the pcre.h header
file:
PCRE_ANCHORED
If this bit is set, the pattern is forced to be "anchored", that is, it
is constrained to match only at the first matching point in the string
that is being searched (the "subject string"). This effect can also be
achieved by appropriate constructs in the pattern itself, which is the
only way to do it in Perl.
PCRE_AUTO_CALLOUT
If this bit is set, pcre_compile() automatically inserts callout items,
all with number 255, before each pattern item. For discussion of the
callout facility, see the pcrecallout documentation.
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
These options (which are mutually exclusive) control what the \R escape
sequence matches. The choice is either to match only CR, LF, or CRLF,
or to match any Unicode newline sequence. The default is specified when
PCRE is built. It can be overridden from within the pattern, or by set-
ting an option when a compiled pattern is matched.
PCRE_CASELESS
If this bit is set, letters in the pattern match both upper and lower
case letters. It is equivalent to Perl's /i option, and it can be
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
always understands the concept of case for characters whose values are
less than 128, so caseless matching is always possible. For characters
with higher values, the concept of case is supported if PCRE is com-
piled with Unicode property support, but not otherwise. If you want to
use caseless matching for characters 128 and above, you must ensure
that PCRE is compiled with Unicode property support as well as with
UTF-8 support.
PCRE_DOLLAR_ENDONLY
If this bit is set, a dollar metacharacter in the pattern matches only
at the end of the subject string. Without this option, a dollar also
matches immediately before a newline at the end of the string (but not
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
if PCRE_MULTILINE is set. There is no equivalent to this option in
Perl, and no way to set it within a pattern.
PCRE_DOTALL
If this bit is set, a dot metacharater in the pattern matches all char-
acters, including those that indicate newline. Without it, a dot does
not match when the current position is at a newline. This option is
equivalent to Perl's /s option, and it can be changed within a pattern
by a (?s) option setting. A negative class such as [^a] always matches
newline characters, independent of the setting of this option.
PCRE_DUPNAMES
If this bit is set, names used to identify capturing subpatterns need
not be unique. This can be helpful for certain types of pattern when it
is known that only one instance of the named subpattern can ever be
matched. There are more details of named subpatterns below; see also
the pcrepattern documentation.
PCRE_EXTENDED
If this bit is set, whitespace data characters in the pattern are
totally ignored except when escaped or inside a character class. White-
space does not include the VT character (code 11). In addition, charac-
ters between an unescaped # outside a character class and the next new-
line, inclusive, are also ignored. This is equivalent to Perl's /x
option, and it can be changed within a pattern by a (?x) option set-
ting.
This option makes it possible to include comments inside complicated
patterns. Note, however, that this applies only to data characters.
Whitespace characters may never appear within special character
sequences in a pattern, for example within the sequence (?( which
introduces a conditional subpattern.
PCRE_EXTRA
This option was invented in order to turn on additional functionality
of PCRE that is incompatible with Perl, but it is currently of very
little use. When set, any backslash in a pattern that is followed by a
letter that has no special meaning causes an error, thus reserving
these combinations for future expansion. By default, as in Perl, a
backslash followed by a letter with no special meaning is treated as a
literal. (Perl can, however, be persuaded to give a warning for this.)
There are at present no other features controlled by this option. It
can also be set by a (?X) option setting within a pattern.
PCRE_FIRSTLINE
If this option is set, an unanchored pattern is required to match
before or at the first newline in the subject string, though the
matched text may continue over the newline.
PCRE_JAVASCRIPT_COMPAT
If this option is set, PCRE's behaviour is changed in some ways so that
it is compatible with JavaScript rather than Perl. The changes are as
follows:
(1) A lone closing square bracket in a pattern causes a compile-time
error, because this is illegal in JavaScript (by default it is treated
as a data character). Thus, the pattern AB]CD becomes illegal when this
option is set.
(2) At run time, a back reference to an unset subpattern group matches
an empty string (by default this causes the current matching alterna-
tive to fail). A pattern such as (\1)(a) succeeds when this option is
set (assuming it can find an "a" in the subject), whereas it fails by
default, for Perl compatibility.
PCRE_MULTILINE
By default, PCRE treats the subject string as consisting of a single
line of characters (even if it actually contains newlines). The "start
of line" metacharacter (^) matches only at the start of the string,
while the "end of line" metacharacter ($) matches only at the end of
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
is set). This is the same as Perl.
When PCRE_MULTILINE it is set, the "start of line" and "end of line"
constructs match immediately following or immediately before internal
newlines in the subject string, respectively, as well as at the very
start and end. This is equivalent to Perl's /m option, and it can be
changed within a pattern by a (?m) option setting. If there are no new-
lines in a subject string, or no occurrences of ^ or $ in a pattern,
setting PCRE_MULTILINE has no effect.
PCRE_NEWLINE_CR
PCRE_NEWLINE_LF
PCRE_NEWLINE_CRLF
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
These options override the default newline definition that was chosen
when PCRE was built. Setting the first or the second specifies that a
newline is indicated by a single character (CR or LF, respectively).
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
that any of the three preceding sequences should be recognized. Setting
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
recognized. The Unicode newline sequences are the three just mentioned,
plus the single characters VT (vertical tab, U+000B), FF (formfeed,
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
(paragraph separator, U+2029). The last two are recognized only in
UTF-8 mode.
The newline setting in the options word uses three bits that are
treated as a number, giving eight possibilities. Currently only six are
used (default plus the five values above). This means that if you set
more than one newline option, the combination may or may not be sensi-
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
cause an error.
The only time that a line break is specially recognized when compiling
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a
character class is encountered. This indicates a comment that lasts
until after the next line break sequence. In other circumstances, line
break sequences are treated as literal data, except that in
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
and are therefore ignored.
The newline option that is set at compile time becomes the default that
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
PCRE_NO_AUTO_CAPTURE
If this option is set, it disables the use of numbered capturing paren-
theses in the pattern. Any opening parenthesis that is not followed by
? behaves as if it were followed by ?: but named parentheses can still
be used for capturing (and they acquire numbers in the usual way).
There is no equivalent of this option in Perl.
PCRE_UNGREEDY
This option inverts the "greediness" of the quantifiers so that they
are not greedy by default, but become greedy if followed by "?". It is
not compatible with Perl. It can also be set by a (?U) option setting
within the pattern.
PCRE_UTF8
This option causes PCRE to regard both the pattern and the subject as
strings of UTF-8 characters instead of single-byte character strings.
However, it is available only when PCRE is built to include UTF-8 sup-
port. If not, the use of this option provokes an error. Details of how
this option changes the behaviour of PCRE are given in the section on
UTF-8 support in the main pcre page.
PCRE_NO_UTF8_CHECK
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
automatically checked. There is a discussion about the validity of
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of
bytes is found, pcre_compile() returns an error. If you already know
that your pattern is valid, and you want to skip this check for perfor-
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is
set, the effect of passing an invalid UTF-8 string as a pattern is
undefined. It may cause your program to crash. Note that this option
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
UTF-8 validity checking of subject strings.
COMPILATION ERROR CODES
The following table lists the error codes than may be returned by
pcre_compile2(), along with the error messages that may be returned by
both compiling functions. As PCRE has developed, some error codes have
fallen out of use. To avoid confusion, they have not been re-used.
0 no error
1 \ at end of pattern
2 \c at end of pattern
3 unrecognized character follows \
4 numbers out of order in {} quantifier
5 number too big in {} quantifier
6 missing terminating ] for character class
7 invalid escape sequence in character class
8 range out of order in character class
9 nothing to repeat
10 [this code is not in use]
11 internal error: unexpected repeat
12 unrecognized character after (? or (?-
13 POSIX named classes are supported only within a class
14 missing )
15 reference to non-existent subpattern
16 erroffset passed as NULL
17 unknown option bit(s) set
18 missing ) after comment
19 [this code is not in use]
20 regular expression is too large
21 failed to get memory
22 unmatched parentheses
23 internal error: code overflow
24 unrecognized character after (?<
25 lookbehind assertion is not fixed length
26 malformed number or name after (?(
27 conditional group contains more than two branches
28 assertion expected after (?(
29 (?R or (?[+-]digits must be followed by )
30 unknown POSIX class name
31 POSIX collating elements are not supported
32 this version of PCRE is not compiled with PCRE_UTF8 support
33 [this code is not in use]
34 character value in \x{...} sequence is too large
35 invalid condition (?(0)
36 \C not allowed in lookbehind assertion
37 PCRE does not support \L, \l, \N, \U, or \u
38 number after (?C is > 255
39 closing ) for (?C expected
40 recursive call could loop indefinitely
41 unrecognized character after (?P
42 syntax error in subpattern name (missing terminator)
43 two named subpatterns have the same name
44 invalid UTF-8 string
45 support for \P, \p, and \X has not been compiled
46 malformed \P or \p sequence
47 unknown property name after \P or \p
48 subpattern name is too long (maximum 32 characters)
49 too many named subpatterns (maximum 10000)
50 [this code is not in use]
51 octal value is greater than \377 (not in UTF-8 mode)
52 internal error: overran compiling workspace
53 internal error: previously-checked referenced subpattern not
found
54 DEFINE group contains more than one branch
55 repeating a DEFINE group is not allowed
56 inconsistent NEWLINE options
57 \g is not followed by a braced, angle-bracketed, or quoted
name/number or by a plain number
58 a numbered reference must not be zero
59 (*VERB) with an argument is not supported
60 (*VERB) not recognized
61 number is too big
62 subpattern name expected
63 digit expected after (?+
64 ] is an invalid data character in JavaScript compatibility mode
The numbers 32 and 10000 in errors 48 and 49 are defaults; different
values may be used if the limits were changed when PCRE was built.
STUDYING A PATTERN
pcre_extra *pcre_study(const pcre *code, int options
const char **errptr);
If a compiled pattern is going to be used several times, it is worth
spending more time analyzing it in order to speed up the time taken for
matching. The function pcre_study() takes a pointer to a compiled pat-
tern as its first argument. If studying the pattern produces additional
information that will help speed up matching, pcre_study() returns a
pointer to a pcre_extra block, in which the study_data field points to
the results of the study.
The returned value from pcre_study() can be passed directly to
pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con-
tains other fields that can be set by the caller before the block is
passed; these are described below in the section on matching a pattern.
If studying the pattern does not produce any useful information,
pcre_study() returns NULL. In that circumstance, if the calling program
wants to pass any of the other fields to pcre_exec() or
pcre_dfa_exec(), it must set up its own pcre_extra block.
The second argument of pcre_study() contains option bits. At present,
no options are defined, and this argument should always be zero.
The third argument for pcre_study() is a pointer for an error message.
If studying succeeds (even if no data is returned), the variable it
points to is set to NULL. Otherwise it is set to point to a textual
error message. This is a static string that is part of the library. You
must not try to free it. You should test the error pointer for NULL
after calling pcre_study(), to be sure that it has run successfully.
This is a typical call to pcre_study():
pcre_extra *pe;
pe = pcre_study(
re, /* result of pcre_compile() */
0, /* no options exist */
&error); /* set to NULL or points to a message */
Studying a pattern does two things: first, a lower bound for the length
of subject string that is needed to match the pattern is computed. This
does not mean that there are any strings of that length that match, but
it does guarantee that no shorter strings match. The value is used by
pcre_exec() and pcre_dfa_exec() to avoid wasting time by trying to
match strings that are shorter than the lower bound. You can find out
the value in a calling program via the pcre_fullinfo() function.
Studying a pattern is also useful for non-anchored patterns that do not
have a single fixed starting character. A bitmap of possible starting
bytes is created. This speeds up finding a position in the subject at
which to start matching.
LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables, indexed
by character value. When running in UTF-8 mode, this applies only to
characters with codes less than 128. Higher-valued codes never match
escapes such as \w or \d, but can be tested with \p if PCRE is built
with Unicode character property support. The use of locales with Uni-
code is discouraged. If you are handling characters with codes greater
than 128, you should either use UTF-8 and Unicode, or use locales, but
not try to mix the two.
PCRE contains an internal set of tables that are used when the final
argument of pcre_compile() is NULL. These are sufficient for many
applications. Normally, the internal tables recognize only ASCII char-
acters. However, when PCRE is built, it is possible to cause the inter-
nal tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different.
The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale
from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away.
External tables are built by calling the pcre_maketables() function,
which has no arguments, in the relevant locale. The result can then be
passed to pcre_compile() or pcre_exec() as often as necessary. For
example, to build and use tables that are appropriate for the French
locale (where accented characters with values greater than 128 are
treated as letters), the following code could be used:
setlocale(LC_CTYPE, "fr_FR");
tables = pcre_maketables();
re = pcre_compile(..., tables);
The locale name "fr_FR" is used on Linux and other Unix-like systems;
if you are using Windows, the name for the French locale is "french".
When pcre_maketables() runs, the tables are built in memory that is
obtained via pcre_malloc. It is the caller's responsibility to ensure
that the memory containing the tables remains available for as long as
it is needed.
The pointer that is passed to pcre_compile() is saved with the compiled
pattern, and the same tables are used via this pointer by pcre_study()
and normally also by pcre_exec(). Thus, by default, for any single pat-
tern, compilation, studying and matching all happen in the same locale,
but different patterns can be compiled in different locales.
It is possible to pass a table pointer or NULL (indicating the use of
the internal tables) to pcre_exec(). Although not intended for this
purpose, this facility could be used to match a pattern in a different
locale from the one in which it was compiled. Passing table pointers at
run time is discussed below in the section on matching a pattern.
INFORMATION ABOUT A PATTERN
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
int what, void *where);
The pcre_fullinfo() function returns information about a compiled pat-
tern. It replaces the obsolete pcre_info() function, which is neverthe-
less retained for backwards compability (and is documented below).
The first argument for pcre_fullinfo() is a pointer to the compiled
pattern. The second argument is the result of pcre_study(), or NULL if
the pattern was not studied. The third argument specifies which piece
of information is required, and the fourth argument is a pointer to a
variable to receive the data. The yield of the function is zero for
success, or one of the following negative numbers:
PCRE_ERROR_NULL the argument code was NULL
the argument where was NULL
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADOPTION the value of what was invalid
The "magic number" is placed at the start of each compiled pattern as
an simple check against passing an arbitrary memory pointer. Here is a
typical call of pcre_fullinfo(), to obtain the length of the compiled
pattern:
int rc;
size_t length;
rc = pcre_fullinfo(
re, /* result of pcre_compile() */
pe, /* result of pcre_study(), or NULL */
PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */
The possible values for the third argument are defined in pcre.h, and
are as follows:
PCRE_INFO_BACKREFMAX
Return the number of the highest back reference in the pattern. The
fourth argument should point to an int variable. Zero is returned if
there are no back references.
PCRE_INFO_CAPTURECOUNT
Return the number of capturing subpatterns in the pattern. The fourth
argument should point to an int variable.
PCRE_INFO_DEFAULT_TABLES
Return a pointer to the internal default character tables within PCRE.
The fourth argument should point to an unsigned char * variable. This
information call is provided for internal use by the pcre_study() func-
tion. External callers can cause PCRE to use its internal tables by
passing a NULL table pointer.
PCRE_INFO_FIRSTBYTE
Return information about the first byte of any matched string, for a
non-anchored pattern. The fourth argument should point to an int vari-
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
is still recognized for backwards compatibility.)
If there is a fixed first byte, for example, from a pattern such as
(cat|cow|coyote), its value is returned. Otherwise, if either
(a) the pattern was compiled with the PCRE_MULTILINE option, and every
branch starts with "^", or
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
set (if it were set, the pattern would be anchored),
-1 is returned, indicating that the pattern matches only at the start
of a subject string or after any newline within the string. Otherwise
-2 is returned. For anchored patterns, -2 is returned.
PCRE_INFO_FIRSTTABLE
If the pattern was studied, and this resulted in the construction of a
256-bit table indicating a fixed set of bytes for the first byte in any
matching string, a pointer to the table is returned. Otherwise NULL is
returned. The fourth argument should point to an unsigned char * vari-
able.
PCRE_INFO_HASCRORLF
Return 1 if the pattern contains any explicit matches for CR or LF
characters, otherwise 0. The fourth argument should point to an int
variable. An explicit match is either a literal CR or LF character, or
\r or \n.
PCRE_INFO_JCHANGED
Return 1 if the (?J) or (?-J) option setting is used in the pattern,
otherwise 0. The fourth argument should point to an int variable. (?J)
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
PCRE_INFO_LASTLITERAL
Return the value of the rightmost literal byte that must exist in any
matched string, other than at its start, if such a byte has been
recorded. The fourth argument should point to an int variable. If there
is no such byte, -1 is returned. For anchored patterns, a last literal
byte is recorded only if it follows something of variable length. For
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
/^a\dz\d/ the returned value is -1.
PCRE_INFO_MINLENGTH
If the pattern was studied and a minimum length for matching subject
strings was computed, its value is returned. Otherwise the returned
value is -1. The value is a number of characters, not bytes (this may
be relevant in UTF-8 mode). The fourth argument should point to an int
variable. A non-negative value is a lower bound to the length of any
matching string. There may not be any strings of that length that do
actually match, but every string that does match is at least that long.
PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE
PCRE supports the use of named as well as numbered capturing parenthe-
ses. The names are just an additional way of identifying the parenthe-
ses, which still acquire numbers. Several convenience functions such as
pcre_get_named_substring() are provided for extracting captured sub-
strings by name. It is also possible to extract the data directly, by
first converting the name to a number in order to access the correct
pointers in the output vector (described with pcre_exec() below). To do
the conversion, you need to use the name-to-number map, which is
described by these three values.
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
of each entry; both of these return an int value. The entry size
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
a pointer to the first entry of the table (a pointer to char). The
first two bytes of each entry are the number of the capturing parenthe-
sis, most significant byte first. The rest of the entry is the corre-
sponding name, zero terminated.
The names are in alphabetical order. Duplicate names may appear if (?|
is used to create multiple groups with the same number, as described in
the section on duplicate subpattern numbers in the pcrepattern page.
Duplicate names for subpatterns with different numbers are permitted
only if PCRE_DUPNAMES is set. In all cases of duplicate names, they
appear in the table in the order in which they were found in the pat-
tern. In the absence of (?| this is the order of increasing number;
when (?| is used this is not necessarily the case because later subpat-
terns may have lower numbers.
As a simple example of the name/number table, consider the following
pattern (assume PCRE_EXTENDED is set, so white space - including new-
lines - is ignored):
(?<date> (?<year>(\d\d)?\d\d) -
(?<month>\d\d) - (?<day>\d\d) )
There are four named subpatterns, so the table has four entries, and
each entry in the table is eight bytes long. The table is as follows,
with non-printing bytes shows in hexadecimal, and undefined bytes shown
as ??:
00 01 d a t e 00 ??
00 05 d a y 00 ?? ??
00 04 m o n t h 00
00 02 y e a r 00 ??
When writing code to extract data from named subpatterns using the
name-to-number map, remember that the length of the entries is likely
to be different for each compiled pattern.
PCRE_INFO_OKPARTIAL
Return 1 if the pattern can be used for partial matching with
pcre_exec(), otherwise 0. The fourth argument should point to an int
variable. From release 8.00, this always returns 1, because the
restrictions that previously applied to partial matching have been
lifted. The pcrepartial documentation gives details of partial match-
ing.
PCRE_INFO_OPTIONS
Return a copy of the options with which the pattern was compiled. The
fourth argument should point to an unsigned long int variable. These
option bits are those specified in the call to pcre_compile(), modified
by any top-level option settings at the start of the pattern itself. In
other words, they are the options that will be in force when matching
starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with
the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
and PCRE_EXTENDED.
A pattern is automatically anchored by PCRE if all of its top-level
alternatives begin with one of the following:
^ unless PCRE_MULTILINE is set
\A always
\G always
.* if PCRE_DOTALL is set and there are no back
references to the subpattern in which .* appears
For such patterns, the PCRE_ANCHORED bit is set in the options returned
by pcre_fullinfo().
PCRE_INFO_SIZE
Return the size of the compiled pattern, that is, the value that was
passed as the argument to pcre_malloc() when PCRE was getting memory in
which to place the compiled data. The fourth argument should point to a
size_t variable.
PCRE_INFO_STUDYSIZE
Return the size of the data block pointed to by the study_data field in
a pcre_extra block. That is, it is the value that was passed to
pcre_malloc() when PCRE was getting memory into which to place the data
created by pcre_study(). If pcre_extra is NULL, or there is no study
data, zero is returned. The fourth argument should point to a size_t
variable.
OBSOLETE INFO FUNCTION
int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
The pcre_info() function is now obsolete because its interface is too
restrictive to return all the available data about a compiled pattern.
New programs should use pcre_fullinfo() instead. The yield of
pcre_info() is the number of capturing subpatterns, or one of the fol-
lowing negative numbers:
PCRE_ERROR_NULL the argument code was NULL
PCRE_ERROR_BADMAGIC the "magic number" was not found
If the optptr argument is not NULL, a copy of the options with which
the pattern was compiled is placed in the integer it points to (see
PCRE_INFO_OPTIONS above).
If the pattern is not anchored and the firstcharptr argument is not
NULL, it is used to pass back information about the first character of
any matched string (see PCRE_INFO_FIRSTBYTE above).
REFERENCE COUNTS
int pcre_refcount(pcre *code, int adjust);
The pcre_refcount() function is used to maintain a reference count in
the data block that contains a compiled pattern. It is provided for the
benefit of applications that operate in an object-oriented manner,
where different parts of the application may be using the same compiled
pattern, but you want to free the block when they are all done.
When a pattern is compiled, the reference count field is initialized to
zero. It is changed only by calling this function, whose action is to
add the adjust value (which may be positive or negative) to it. The
yield of the function is the new value. However, the value of the count
is constrained to lie between 0 and 65535, inclusive. If the new value
is outside these limits, it is forced to the appropriate limit value.
Except when it is zero, the reference count is not correctly preserved
if a pattern is compiled on one host and then transferred to a host
whose byte-order is different. (This seems a highly unlikely scenario.)
MATCHING A PATTERN: THE TRADITIONAL FUNCTION
int pcre_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize);
The function pcre_exec() is called to match a subject string against a
compiled pattern, which is passed in the code argument. If the pattern
was studied, the result of the study should be passed in the extra
argument. This function is the main matching facility of the library,
and it operates in a Perl-like manner. For specialist use there is also
an alternative matching function, which is described below in the sec-
tion about the pcre_dfa_exec() function.
In most applications, the pattern will have been compiled (and option-
ally studied) in the same process that calls pcre_exec(). However, it
is possible to save compiled patterns and study data, and then use them
later in different processes, possibly even on different hosts. For a
discussion about this, see the pcreprecompile documentation.
Here is an example of a simple call to pcre_exec():
int rc;
int ovector[30];
rc = pcre_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector of integers for substring information */
30); /* number of elements (NOT size in bytes) */
Extra data for pcre_exec()
If the extra argument is not NULL, it must point to a pcre_extra data
block. The pcre_study() function returns such a block (when it doesn't
return NULL), but you can also create one for yourself, and pass addi-
tional information in it. The pcre_extra block contains the following
fields (not necessarily in this order):
unsigned long int flags;
void *study_data;
unsigned long int match_limit;
unsigned long int match_limit_recursion;
void *callout_data;
const unsigned char *tables;
The flags field is a bitmap that specifies which of the other fields
are set. The flag bits are:
PCRE_EXTRA_STUDY_DATA
PCRE_EXTRA_MATCH_LIMIT
PCRE_EXTRA_MATCH_LIMIT_RECURSION
PCRE_EXTRA_CALLOUT_DATA
PCRE_EXTRA_TABLES
Other flag bits should be set to zero. The study_data field is set in
the pcre_extra block that is returned by pcre_study(), together with
the appropriate flag bit. You should not set this yourself, but you may
add to the block by setting the other fields and their corresponding
flag bits.
The match_limit field provides a means of preventing PCRE from using up
a vast amount of resources when running patterns that are not going to
match, but which have a very large number of possibilities in their
search trees. The classic example is a pattern that uses nested unlim-
ited repeats.
Internally, PCRE uses a function called match() which it calls repeat-
edly (sometimes recursively). The limit set by match_limit is imposed
on the number of times this function is called during a match, which
has the effect of limiting the amount of backtracking that can take
place. For patterns that are not anchored, the count restarts from zero
for each position in the subject string.
The default value for the limit can be set when PCRE is built; the
default default is 10 million, which handles all but the most extreme
cases. You can override the default by suppling pcre_exec() with a
pcre_extra block in which match_limit is set, and
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
The match_limit_recursion field is similar to match_limit, but instead
of limiting the total number of times that match() is called, it limits
the depth of recursion. The recursion depth is a smaller number than
the total number of calls, because not all calls to match() are recur-
sive. This limit is of use only if it is set smaller than match_limit.
Limiting the recursion depth limits the amount of stack that can be
used, or, when PCRE has been compiled to use memory on the heap instead
of the stack, the amount of heap memory that can be used.
The default value for match_limit_recursion can be set when PCRE is
built; the default default is the same value as the default for
match_limit. You can override the default by suppling pcre_exec() with
a pcre_extra block in which match_limit_recursion is set, and
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
The callout_data field is used in conjunction with the "callout" fea-
ture, and is described in the pcrecallout documentation.
The tables field is used to pass a character tables pointer to
pcre_exec(); this overrides the value that is stored with the compiled
pattern. A non-NULL value is stored with the compiled pattern only if
custom tables were supplied to pcre_compile() via its tableptr argu-
ment. If NULL is passed to pcre_exec() using this mechanism, it forces
PCRE's internal tables to be used. This facility is helpful when re-
using patterns that have been saved after compiling with an external
set of tables, because the external tables might be at a different
address when pcre_exec() is called. See the pcreprecompile documenta-
tion for a discussion of saving compiled patterns for later use.
Option bits for pcre_exec()
The unused bits of the options argument for pcre_exec() must be zero.
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and
PCRE_PARTIAL_HARD.
PCRE_ANCHORED
The PCRE_ANCHORED option limits pcre_exec() to matching at the first
matching position. If a pattern was compiled with PCRE_ANCHORED, or
turned out to be anchored by virtue of its contents, it cannot be made
unachored at matching time.
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
These options (which are mutually exclusive) control what the \R escape
sequence matches. The choice is either to match only CR, LF, or CRLF,
or to match any Unicode newline sequence. These options override the
choice that was made or defaulted when the pattern was compiled.
PCRE_NEWLINE_CR
PCRE_NEWLINE_LF
PCRE_NEWLINE_CRLF
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
These options override the newline definition that was chosen or
defaulted when the pattern was compiled. For details, see the descrip-
tion of pcre_compile() above. During matching, the newline choice
affects the behaviour of the dot, circumflex, and dollar metacharac-
ters. It may also alter the way the match position is advanced after a
match failure for an unanchored pattern.
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
set, and a match attempt for an unanchored pattern fails when the cur-
rent position is at a CRLF sequence, and the pattern contains no
explicit matches for CR or LF characters, the match position is
advanced by two characters instead of one, in other words, to after the
CRLF.
The above rule is a compromise that makes the most common cases work as
expected. For example, if the pattern is .+A (and the PCRE_DOTALL
option is not set), it does not match the string "\r\nA" because, after
failing at the start, it skips both the CR and the LF before retrying.
However, the pattern [\r\n]A does match that string, because it con-
tains an explicit CR or LF reference, and so advances only by one char-
acter after the first failure.
An explicit match for CR of LF is either a literal appearance of one of
those characters, or one of the \r or \n escape sequences. Implicit
matches such as [^X] do not count, nor does \s (which includes CR and
LF in the characters that it matches).
Notwithstanding the above, anomalous effects may still occur when CRLF
is a valid newline sequence and explicit \r or \n escapes appear in the
pattern.
PCRE_NOTBOL
This option specifies that first character of the subject string is not
the beginning of a line, so the circumflex metacharacter should not
match before it. Setting this without PCRE_MULTILINE (at compile time)
causes circumflex never to match. This option affects only the behav-
iour of the circumflex metacharacter. It does not affect \A.
PCRE_NOTEOL
This option specifies that the end of the subject string is not the end
of a line, so the dollar metacharacter should not match it nor (except
in multiline mode) a newline immediately before it. Setting this with-
out PCRE_MULTILINE (at compile time) causes dollar never to match. This
option affects only the behaviour of the dollar metacharacter. It does
not affect \Z or \z.
PCRE_NOTEMPTY
An empty string is not considered to be a valid match if this option is
set. If there are alternatives in the pattern, they are tried. If all
the alternatives match the empty string, the entire match fails. For
example, if the pattern
a?b?
is applied to a string not beginning with "a" or "b", it matches an
empty string at the start of the subject. With PCRE_NOTEMPTY set, this
match is not valid, so PCRE searches further into the string for occur-
rences of "a" or "b".
PCRE_NOTEMPTY_ATSTART
This is like PCRE_NOTEMPTY, except that an empty string match that is
not at the start of the subject is permitted. If the pattern is
anchored, such a match can occur only if the pattern contains \K.
Perl has no direct equivalent of PCRE_NOTEMPTY or
PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern
match of the empty string within its split() function, and when using
the /g modifier. It is possible to emulate Perl's behaviour after
matching a null string by first trying the match again at the same off-
set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that
fails, by advancing the starting offset (see below) and trying an ordi-
nary match again. There is some code that demonstrates how to do this
in the pcredemo sample program.
PCRE_NO_START_OPTIMIZE
There are a number of optimizations that pcre_exec() uses at the start
of a match, in order to speed up the process. For example, if it is
known that a match must start with a specific character, it searches
the subject for that character, and fails immediately if it cannot find
it, without actually running the main matching function. When callouts
are in use, these optimizations can cause them to be skipped. This
option disables the "start-up" optimizations, causing performance to
suffer, but ensuring that the callouts do occur.
PCRE_NO_UTF8_CHECK
When PCRE_UTF8 is set at compile time, the validity of the subject as a
UTF-8 string is automatically checked when pcre_exec() is subsequently
called. The value of startoffset is also checked to ensure that it
points to the start of a UTF-8 character. There is a discussion about
the validity of UTF-8 strings in the section on UTF-8 support in the
main pcre page. If an invalid UTF-8 sequence of bytes is found,
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con-
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
If you already know that your subject is valid, and you want to skip
these checks for performance reasons, you can set the
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
do this for the second and subsequent calls to pcre_exec() if you are
making repeated calls to find all the matches in a single subject
string. However, you should be sure that the value of startoffset
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
set, the effect of passing an invalid UTF-8 string as a subject, or a
value of startoffset that does not point to the start of a UTF-8 char-
acter, is undefined. Your program may crash.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
These options turn on the partial matching feature. For backwards com-
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
match occurs if the end of the subject string is reached successfully,
but there are not enough subject characters to complete the match. If
this happens when PCRE_PARTIAL_HARD is set, pcre_exec() immediately
returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set,
matching continues by testing any other alternatives. Only if they all
fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH).
The portion of the string that was inspected when the partial match was
found is set as the first matching string. There is a more detailed
discussion in the pcrepartial documentation.
The string to be matched by pcre_exec()
The subject string is passed to pcre_exec() as a pointer in subject, a
length (in bytes) in length, and a starting byte offset in startoffset.
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-
acter. Unlike the pattern string, the subject may contain binary zero
bytes. When the starting offset is zero, the search for a match starts
at the beginning of the subject, and this is by far the most common
case.
A non-zero starting offset is useful when searching for another match
in the same subject by calling pcre_exec() again after a previous suc-
cess. Setting startoffset differs from just passing over a shortened
string and setting PCRE_NOTBOL in the case of a pattern that begins
with any kind of lookbehind. For example, consider the pattern
\Biss\B
which finds occurrences of "iss" in the middle of words. (\B matches
only if the current position in the subject is not a word boundary.)
When applied to the string "Mississipi" the first call to pcre_exec()
finds the first occurrence. If pcre_exec() is called again with just
the remainder of the subject, namely "issipi", it does not match,
because \B is always false at the start of the subject, which is deemed
to be a word boundary. However, if pcre_exec() is passed the entire
string again, but with startoffset set to 4, it finds the second occur-
rence of "iss" because it is able to look behind the starting point to
discover that it is preceded by a letter.
If a non-zero starting offset is passed when the pattern is anchored,
one attempt to match at the given offset is made. This can only succeed
if the pattern does not require the match to be at the start of the
subject.
How pcre_exec() returns captured substrings
In general, a pattern matches a certain portion of the subject, and in
addition, further substrings from the subject may be picked out by
parts of the pattern. Following the usage in Jeffrey Friedl's book,
this is called "capturing" in what follows, and the phrase "capturing
subpattern" is used for a fragment of a pattern that picks out a sub-
string. PCRE supports several other kinds of parenthesized subpattern
that do not cause substrings to be captured.
Captured substrings are returned to the caller via a vector of integers
whose address is passed in ovector. The number of elements in the vec-
tor is passed in ovecsize, which must be a non-negative number. Note:
this argument is NOT the size of ovector in bytes.
The first two-thirds of the vector is used to pass back captured sub-
strings, each substring using a pair of integers. The remaining third
of the vector is used as workspace by pcre_exec() while matching cap-
turing subpatterns, and is not available for passing back information.
The number passed in ovecsize should always be a multiple of three. If
it is not, it is rounded down.
When a match is successful, information about captured substrings is
returned in pairs of integers, starting at the beginning of ovector,
and continuing up to two-thirds of its length at the most. The first
element of each pair is set to the byte offset of the first character
in a substring, and the second is set to the byte offset of the first
character after the end of a substring. Note: these values are always
byte offsets, even in UTF-8 mode. They are not character counts.
The first pair of integers, ovector[0] and ovector[1], identify the
portion of the subject string matched by the entire pattern. The next
pair is used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair that
has been set. For example, if two substrings have been captured, the
returned value is 3. If there are no capturing subpatterns, the return
value from a successful match is 1, indicating that just the first pair
of offsets has been set.
If a capturing subpattern is matched repeatedly, it is the last portion
of the string that it matched that is returned.
If the vector is too small to hold all the captured substring offsets,
it is used as far as possible (up to two-thirds of its length), and the
function returns a value of zero. If the substring offsets are not of
interest, pcre_exec() may be called with ovector passed as NULL and
ovecsize as zero. However, if the pattern contains back references and
the ovector is not big enough to remember the related substrings, PCRE
has to get additional memory for use during matching. Thus it is usu-
ally advisable to supply an ovector.
The pcre_fullinfo() function can be used to find out how many capturing
subpatterns there are in a compiled pattern. The smallest size for
ovector that will allow for n captured substrings, in addition to the
offsets of the substring matched by the whole pattern, is (n+1)*3.
It is possible for capturing subpattern number n+1 to match some part
of the subject when subpattern n has not been used at all. For example,
if the string "abc" is matched against the pattern (a|(z))(bc) the
return from the function is 4, and subpatterns 1 and 3 are matched, but
2 is not. When this happens, both values in the offset pairs corre-
sponding to unused subpatterns are set to -1.
Offset values that correspond to unused subpatterns at the end of the
expression are also set to -1. For example, if the string "abc" is
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
matched. The return from the function is 2, because the highest used
capturing subpattern number is 1. However, you can refer to the offsets
for the second and third capturing subpatterns if you wish (assuming
the vector is large enough, of course).
Some convenience functions are provided for extracting the captured
substrings as separate strings. These are described below.
Error return values from pcre_exec()
If pcre_exec() fails, it returns a negative number. The following are
defined in the header file:
PCRE_ERROR_NOMATCH (-1)
The subject string did not match the pattern.
PCRE_ERROR_NULL (-2)
Either code or subject was passed as NULL, or ovector was NULL and
ovecsize was not zero.
PCRE_ERROR_BADOPTION (-3)
An unrecognized bit was set in the options argument.
PCRE_ERROR_BADMAGIC (-4)
PCRE stores a 4-byte "magic number" at the start of the compiled code,
to catch the case when it is passed a junk pointer and to detect when a
pattern that was compiled in an environment of one endianness is run in
an environment with the other endianness. This is the error that PCRE
gives when the magic number is not present.
PCRE_ERROR_UNKNOWN_OPCODE (-5)
While running the pattern match, an unknown item was encountered in the
compiled pattern. This error could be caused by a bug in PCRE or by
overwriting of the compiled pattern.
PCRE_ERROR_NOMEMORY (-6)
If a pattern contains back references, but the ovector that is passed
to pcre_exec() is not big enough to remember the referenced substrings,
PCRE gets a block of memory at the start of matching to use for this
purpose. If the call via pcre_malloc() fails, this error is given. The
memory is automatically freed at the end of matching.
PCRE_ERROR_NOSUBSTRING (-7)
This error is used by the pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() functions (see below). It is never
returned by pcre_exec().
PCRE_ERROR_MATCHLIMIT (-8)
The backtracking limit, as specified by the match_limit field in a
pcre_extra structure (or defaulted) was reached. See the description
above.
PCRE_ERROR_CALLOUT (-9)
This error is never generated by pcre_exec() itself. It is provided for
use by callout functions that want to yield a distinctive error code.
See the pcrecallout documentation for details.
PCRE_ERROR_BADUTF8 (-10)
A string that contains an invalid UTF-8 byte sequence was passed as a
subject.
PCRE_ERROR_BADUTF8_OFFSET (-11)
The UTF-8 byte sequence that was passed as a subject was valid, but the
value of startoffset did not point to the beginning of a UTF-8 charac-
ter.
PCRE_ERROR_PARTIAL (-12)
The subject string did not match, but it did match partially. See the
pcrepartial documentation for details of partial matching.
PCRE_ERROR_BADPARTIAL (-13)
This code is no longer in use. It was formerly returned when the
PCRE_PARTIAL option was used with a compiled pattern containing items
that were not supported for partial matching. From release 8.00
onwards, there are no restrictions on partial matching.
PCRE_ERROR_INTERNAL (-14)
An unexpected internal error has occurred. This error could be caused
by a bug in PCRE or by overwriting of the compiled pattern.
PCRE_ERROR_BADCOUNT (-15)
This error is given if the value of the ovecsize argument is negative.
PCRE_ERROR_RECURSIONLIMIT (-21)
The internal recursion limit, as specified by the match_limit_recursion
field in a pcre_extra structure (or defaulted) was reached. See the
description above.
PCRE_ERROR_BADNEWLINE (-23)
An invalid combination of PCRE_NEWLINE_xxx options was given.
Error numbers -16 to -20 and -22 are not used by pcre_exec().
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
int pcre_copy_substring(const char *subject, int *ovector,
int stringcount, int stringnumber, char *buffer,
int buffersize);
int pcre_get_substring(const char *subject, int *ovector,
int stringcount, int stringnumber,
const char **stringptr);
int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);
Captured substrings can be accessed directly by using the offsets
returned by pcre_exec() in ovector. For convenience, the functions
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
string_list() are provided for extracting captured substrings as new,
separate, zero-terminated strings. These functions identify substrings
by number. The next section describes functions for extracting named
substrings.
A substring that contains a binary zero is correctly extracted and has
a further zero added on the end, but the result is not, of course, a C
string. However, you can process such a string by referring to the
length that is returned by pcre_copy_substring() and pcre_get_sub-
string(). Unfortunately, the interface to pcre_get_substring_list() is
not adequate for handling strings containing binary zeros, because the
end of the final string is not independently indicated.
The first three arguments are the same for all three of these func-
tions: subject is the subject string that has just been successfully
matched, ovector is a pointer to the vector of integer offsets that was
passed to pcre_exec(), and stringcount is the number of substrings that
were captured by the match, including the substring that matched the
entire regular expression. This is the value returned by pcre_exec() if
it is greater than zero. If pcre_exec() returned zero, indicating that
it ran out of space in ovector, the value passed as stringcount should
be the number of elements in the vector divided by three.
The functions pcre_copy_substring() and pcre_get_substring() extract a
single substring, whose number is given as stringnumber. A value of
zero extracts the substring that matched the entire pattern, whereas
higher values extract the captured substrings. For pcre_copy_sub-
string(), the string is placed in buffer, whose length is given by
buffersize, while for pcre_get_substring() a new block of memory is
obtained via pcre_malloc, and its address is returned via stringptr.
The yield of the function is the length of the string, not including
the terminating zero, or one of these error codes:
PCRE_ERROR_NOMEMORY (-6)
The buffer was too small for pcre_copy_substring(), or the attempt to
get memory failed for pcre_get_substring().
PCRE_ERROR_NOSUBSTRING (-7)
There is no substring whose number is stringnumber.
The pcre_get_substring_list() function extracts all available sub-
strings and builds a list of pointers to them. All this is done in a
single block of memory that is obtained via pcre_malloc. The address of
the memory block is returned via listptr, which is also the start of
the list of string pointers. The end of the list is marked by a NULL
pointer. The yield of the function is zero if all went well, or the
error code
PCRE_ERROR_NOMEMORY (-6)
if the attempt to get the memory block failed.
When any of these functions encounter a substring that is unset, which
can happen when capturing subpattern number n+1 matches some part of
the subject, but subpattern n has not been used at all, they return an
empty string. This can be distinguished from a genuine zero-length sub-
string by inspecting the appropriate offset in ovector, which is nega-
tive for unset substrings.
The two convenience functions pcre_free_substring() and pcre_free_sub-
string_list() can be used to free the memory returned by a previous
call of pcre_get_substring() or pcre_get_substring_list(), respec-
tively. They do nothing more than call the function pointed to by
pcre_free, which of course could be called directly from a C program.
However, PCRE is used in some situations where it is linked via a spe-
cial interface to another programming language that cannot use
pcre_free directly; it is for these cases that the functions are pro-
vided.
EXTRACTING CAPTURED SUBSTRINGS BY NAME
int pcre_get_stringnumber(const pcre *code,
const char *name);
int pcre_copy_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
char *buffer, int buffersize);
int pcre_get_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
const char **stringptr);
To extract a substring by name, you first have to find associated num-
ber. For example, for this pattern
(a+)b(?<xxx>\d+)...
the number of the subpattern called "xxx" is 2. If the name is known to
be unique (PCRE_DUPNAMES was not set), you can find the number from the
name by calling pcre_get_stringnumber(). The first argument is the com-
piled pattern, and the second is the name. The yield of the function is
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
subpattern of that name.
Given the number, you can extract the substring directly, or use one of
the functions described in the previous section. For convenience, there
are also two functions that do the whole job.
Most of the arguments of pcre_copy_named_substring() and
pcre_get_named_substring() are the same as those for the similarly
named functions that extract by number. As these are described in the
previous section, they are not re-described here. There are just two
differences:
First, instead of a substring number, a substring name is given. Sec-
ond, there is an extra argument, given at the start, which is a pointer
to the compiled pattern. This is needed in order to gain access to the
name-to-number translation table.
These functions call pcre_get_stringnumber(), and if it succeeds, they
then call pcre_copy_substring() or pcre_get_substring(), as appropri-
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
behaviour may not be what you want (see the next section).
Warning: If the pattern uses the (?| feature to set up multiple subpat-
terns with the same number, as described in the section on duplicate
subpattern numbers in the pcrepattern page, you cannot use names to
distinguish the different subpatterns, because names are not included
in the compiled code. The matching process uses only numbers. For this
reason, the use of different names for subpatterns of the same number
causes an error at compile time.
DUPLICATE SUBPATTERN NAMES
int pcre_get_stringtable_entries(const pcre *code,
const char *name, char **first, char **last);
When a pattern is compiled with the PCRE_DUPNAMES option, names for
subpatterns are not required to be unique. (Duplicate names are always
allowed for subpatterns with the same number, created by using the (?|
feature. Indeed, if such subpatterns are named, they are required to
use the same names.)
Normally, patterns with duplicate names are such that in any one match,
only one of the named subpatterns participates. An example is shown in
the pcrepattern documentation.
When duplicates are present, pcre_copy_named_substring() and
pcre_get_named_substring() return the first substring corresponding to
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
(-7) is returned; no data is returned. The pcre_get_stringnumber()
function returns one of the numbers that are associated with the name,
but it is not defined which it is.
If you want to get full details of all captured substrings for a given
name, you must use the pcre_get_stringtable_entries() function. The
first argument is the compiled pattern, and the second is the name. The
third and fourth are pointers to variables which are updated by the
function. After it has run, they point to the first and last entries in
the name-to-number table for the given name. The function itself
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
there are none. The format of the table is described above in the sec-
tion entitled Information about a pattern. Given all the relevant
entries for the name, you can extract each of their numbers, and hence
the captured data, if any.
FINDING ALL POSSIBLE MATCHES
The traditional matching function uses a similar algorithm to Perl,
which stops when it finds the first match, starting at a given point in
the subject. If you want to find all possible matches, or the longest
possible match, consider using the alternative matching function (see
below) instead. If you cannot use the alternative function, but still
need to find all possible matches, you can kludge it up by making use
of the callout facility, which is described in the pcrecallout documen-
tation.
What you have to do is to insert a callout right at the end of the pat-
tern. When your callout function is called, extract and save the cur-
rent matched substring. Then return 1, which forces pcre_exec() to
backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize,
int *workspace, int wscount);
The function pcre_dfa_exec() is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the
subject string just once, and does not backtrack. This has different
characteristics to the normal algorithm, and is not compatible with
Perl. Some of the features of PCRE patterns are not supported. Never-
theless, there are times when this kind of matching can be useful. For
a discussion of the two matching algorithms, and a list of features
that pcre_dfa_exec() does not support, see the pcrematching documenta-
tion.
The arguments for the pcre_dfa_exec() function are the same as for
pcre_exec(), plus two extras. The ovector argument is used in a differ-
ent way, and this is described below. The other common arguments are
used in the same way as for pcre_exec(), so their description is not
repeated here.
The two additional arguments provide workspace for the function. The
workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More
workspace will be needed for patterns and subjects where there are a
lot of potential matches.
Here is an example of a simple call to pcre_dfa_exec():
int rc;
int ovector[10];
int wspace[20];
rc = pcre_dfa_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector of integers for substring information */
10, /* number of elements (NOT size in bytes) */
wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */
Option bits for pcre_dfa_exec()
The unused bits of the options argument for pcre_dfa_exec() must be
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PAR-
TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
four of these are exactly the same as for pcre_exec(), so their
description is not repeated here.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
These have the same general effect as they do for pcre_exec(), but the
details are slightly different. When PCRE_PARTIAL_HARD is set for
pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
ject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete
matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
of the subject is reached, there have been no complete matches, but
there is still at least one matching possibility. The portion of the
string that was inspected when the longest partial match was found is
set as the first matching string in both cases.
PCRE_DFA_SHORTEST
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna-
tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string.
PCRE_DFA_RESTART
When pcre_dfa_exec() returns a partial match, it is possible to call it
again, with additional subject characters, and have it continue with
the same match. The PCRE_DFA_RESTART option requests this action; when
it is set, the workspace and wscount options must reference the same
vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the
pcrepartial documentation.
Successful returns from pcre_dfa_exec()
When pcre_dfa_exec() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run
of the function start at the same point in the subject. The shorter
matches are all initial substrings of the longer matches. For example,
if the pattern
<.*>
is matched against the string
This is <something> <something else> <something further> no more
the three matched strings are
<something>
<something> <something else>
<something> <something else> <something further>
On success, the yield of the function is a number greater than zero,
which is the number of matched substrings. The substrings themselves
are returned in ovector. Each string uses two elements; the first is
the offset to the start, and the second is the offset to the end. In
fact, all the strings have the same start offset. (Space could have
been saved by giving this only once, but it was decided to retain some
compatibility with the way pcre_exec() returns data, even though the
meaning of the strings is different.)
The strings are returned in reverse order of length; that is, the long-
est matching string is given first. If there were too many matches to
fit into ovector, the yield of the function is zero, and the vector is
filled with the longest matches.
Error returns from pcre_dfa_exec()
The pcre_dfa_exec() function returns a negative number when it fails.
Many of the errors are the same as for pcre_exec(), and these are
described above. There are in addition the following errors that are
specific to pcre_dfa_exec():
PCRE_ERROR_DFA_UITEM (-16)
This return is given if pcre_dfa_exec() encounters an item in the pat-
tern that it does not support, for instance, the use of \C or a back
reference.
PCRE_ERROR_DFA_UCOND (-17)
This return is given if pcre_dfa_exec() encounters a condition item
that uses a back reference for the condition, or a test for recursion
in a specific group. These are not supported.
PCRE_ERROR_DFA_UMLIMIT (-18)
This return is given if pcre_dfa_exec() is called with an extra block
that contains a setting of the match_limit field. This is not supported
(it is meaningless).
PCRE_ERROR_DFA_WSSIZE (-19)
This return is given if pcre_dfa_exec() runs out of space in the
workspace vector.
PCRE_ERROR_DFA_RECURSE (-20)
When a recursive subpattern is processed, the matching function calls
itself recursively, using private vectors for ovector and workspace.
This error is given if the output vector is not large enough. This
should be extremely rare, as a vector of size 1000 is used.
SEE ALSO
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 03 October 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
PCRECALLOUT(3) PCRECALLOUT(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE CALLOUTS
int (*pcre_callout)(pcre_callout_block *);
PCRE provides a feature called "callout", which is a means of temporar-
ily passing control to the caller of PCRE in the middle of pattern
matching. The caller of PCRE provides an external function by putting
its entry point in the global variable pcre_callout. By default, this
variable contains NULL, which disables all calling out.
Within a regular expression, (?C) indicates the points at which the
external function is to be called. Different callout points can be
identified by putting a number less than 256 after the letter C. The
default value is zero. For example, this pattern has two callout
points:
(?C1)abc(?C2)def
If the PCRE_AUTO_CALLOUT option bit is set when pcre_compile() or
pcre_compile2() is called, PCRE automatically inserts callouts, all
with number 255, before each item in the pattern. For example, if
PCRE_AUTO_CALLOUT is used with the pattern
A(\d{2}|--)
it is processed as if it were
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
Notice that there is a callout before and after each parenthesis and
alternation bar. Automatic callouts can be used for tracking the
progress of pattern matching. The pcretest command has an option that
sets automatic callouts; when it is used, the output indicates how the
pattern is matched. This is useful information when you are trying to
optimize the performance of a particular pattern.
MISSING CALLOUTS
You should be aware that, because of optimizations in the way PCRE
matches patterns by default, callouts sometimes do not happen. For
example, if the pattern is
ab(?C4)cd
PCRE knows that any matching string must contain the letter "d". If the
subject string is "abyz", the lack of "d" means that matching doesn't
ever start, and the callout is never reached. However, with "abyd",
though the result is still no match, the callout is obeyed.
If the pattern is studied, PCRE knows the minimum length of a matching
string, and will immediately give a "no match" return without actually
running a match if the subject is not long enough, or, for unanchored
patterns, if it has been scanned far enough.
You can disable these optimizations by passing the PCRE_NO_START_OPTI-
MIZE option to pcre_exec() or pcre_dfa_exec(). This slows down the
matching process, but does ensure that callouts such as the example
above are obeyed.
THE CALLOUT INTERFACE
During matching, when PCRE reaches a callout point, the external func-
tion defined by pcre_callout is called (if it is set). This applies to
both the pcre_exec() and the pcre_dfa_exec() matching functions. The
only argument to the callout function is a pointer to a pcre_callout
block. This structure contains the following fields:
int version;
int callout_number;
int *offset_vector;
const char *subject;
int subject_length;
int start_match;
int current_position;
int capture_top;
int capture_last;
void *callout_data;
int pattern_position;
int next_item_length;
The version field is an integer containing the version number of the
block format. The initial version was 0; the current version is 1. The
version number will change again in future if additional fields are
added, but the intention is never to remove any of the existing fields.
The callout_number field contains the number of the callout, as com-
piled into the pattern (that is, the number after ?C for manual call-
outs, and 255 for automatically generated callouts).
The offset_vector field is a pointer to the vector of offsets that was
passed by the caller to pcre_exec() or pcre_dfa_exec(). When
pcre_exec() is used, the contents can be inspected in order to extract
substrings that have been matched so far, in the same way as for
extracting substrings after a match has completed. For pcre_dfa_exec()
this field is not useful.
The subject and subject_length fields contain copies of the values that
were passed to pcre_exec().
The start_match field normally contains the offset within the subject
at which the current match attempt started. However, if the escape
sequence \K has been encountered, this value is changed to reflect the
modified starting point. If the pattern is not anchored, the callout
function may be called several times from the same point in the pattern
for different starting points in the subject.
The current_position field contains the offset within the subject of
the current match pointer.
When the pcre_exec() function is used, the capture_top field contains
one more than the number of the highest numbered captured substring so
far. If no substrings have been captured, the value of capture_top is
one. This is always the case when pcre_dfa_exec() is used, because it
does not support captured substrings.
The capture_last field contains the number of the most recently cap-
tured substring. If no substrings have been captured, its value is -1.
This is always the case when pcre_dfa_exec() is used.
The callout_data field contains a value that is passed to pcre_exec()
or pcre_dfa_exec() specifically so that it can be passed back in call-
outs. It is passed in the pcre_callout field of the pcre_extra data
structure. If no such data was passed, the value of callout_data in a
pcre_callout block is NULL. There is a description of the pcre_extra
structure in the pcreapi documentation.
The pattern_position field is present from version 1 of the pcre_call-
out structure. It contains the offset to the next item to be matched in
the pattern string.
The next_item_length field is present from version 1 of the pcre_call-
out structure. It contains the length of the next item to be matched in
the pattern string. When the callout immediately precedes an alterna-
tion bar, a closing parenthesis, or the end of the pattern, the length
is zero. When the callout precedes an opening parenthesis, the length
is that of the entire subpattern.
The pattern_position and next_item_length fields are intended to help
in distinguishing between different automatic callouts, which all have
the same callout number. However, they are set for all callouts.
RETURN VALUES
The external callout function returns an integer to PCRE. If the value
is zero, matching proceeds as normal. If the value is greater than
zero, matching fails at the current point, but the testing of other
matching possibilities goes ahead, just as if a lookahead assertion had
failed. If the value is less than zero, the match is abandoned, and
pcre_exec() or pcre_dfa_exec() returns the negative value.
Negative values should normally be chosen from the set of
PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
reserved for use by callout functions; it will never be used by PCRE
itself.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
PCRECOMPAT(3) PCRECOMPAT(3)
NAME
PCRE - Perl-compatible regular expressions
DIFFERENCES BETWEEN PCRE AND PERL
This document describes the differences in the ways that PCRE and Perl
handle regular expressions. The differences described here are with
respect to Perl 5.10.
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
of what it does have are given in the section on UTF-8 support in the
main pcre page.
2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
permits them, but they do not mean what you might think. For example,
(?!a){3} does not assert that the next three characters are not "a". It
just asserts that the next character is not "a" three times.
3. Capturing subpatterns that occur inside negative lookahead asser-
tions are counted, but their entries in the offsets vector are never
set. Perl sets its numerical variables from any such patterns that are
matched before the assertion fails to match something (thereby succeed-
ing), but only if the negative lookahead assertion contains just one
branch.
4. Though binary zero characters are supported in the subject string,
they are not allowed in a pattern string because it is passed as a nor-
mal C string, terminated by zero. The escape sequence \0 can be used in
the pattern to represent a binary zero.
5. The following Perl escape sequences are not supported: \l, \u, \L,
\U, and \N. In fact these are implemented by Perl's general string-han-
dling and are not part of its pattern matching engine. If any of these
are encountered by PCRE, an error is generated.
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
is built with Unicode character property support. The properties that
can be tested with \p and \P are limited to the general category prop-
erties such as Lu and Nd, script names such as Greek or Han, and the
derived properties Any and L&. PCRE does support the Cs (surrogate)
property, which Perl does not; the Perl documentation says "Because
Perl hides the need for the user to understand the internal representa-
tion of Unicode characters, there is no need to implement the somewhat
messy concept of surrogates."
7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
ters in between are treated as literals. This is slightly different
from Perl in that $ and @ are also handled as literals inside the
quotes. In Perl, they cause variable interpolation (but of course PCRE
does not have variables). Note the following examples:
Pattern PCRE matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the
contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
The \Q...\E sequence is recognized both inside and outside character
classes.
8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
constructions. However, there is support for recursive patterns. This
is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
"callout" feature allows an external function to be called during pat-
tern matching. See the pcrecallout documentation for details.
9. Subpatterns that are called recursively or as "subroutines" are
always treated as atomic groups in PCRE. This is like Python, but
unlike Perl. There is a discussion of an example that explains this in
more detail in the section on recursion differences from Perl in the
pcrepattern page.
10. There are some differences that are concerned with the settings of
captured strings when part of a pattern is repeated. For example,
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
unset, but in PCRE it is set to "b".
11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT),
(*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in
the forms without an argument. PCRE does not support (*MARK).
12. PCRE's handling of duplicate subpattern numbers and duplicate sub-
pattern names is not as general as Perl's. This is a consequence of the
fact the PCRE works internally just with numbers, using an external ta-
ble to translate between numbers and names. In particular, a pattern
such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have
the same number but different names, is not supported, and causes an
error at compile time. If it were allowed, it would not be possible to
distinguish which parentheses matched, because both names map to cap-
turing subpattern number 1. To avoid this confusing situation, an error
is given at compile time.
13. PCRE provides some extensions to the Perl regular expression facil-
ities. Perl 5.10 includes new features that are not in earlier ver-
sions of Perl, some of which (such as named parentheses) have been in
PCRE for some time. This list is with respect to Perl 5.10:
(a) Although lookbehind assertions in PCRE must match fixed length
strings, each alternative branch of a lookbehind assertion can match a
different length of string. Perl requires them all to have the same
length.
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
ignored. (Perl can be made to issue a warning.)
(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
fiers is inverted, that is, by default they are not greedy, but if fol-
lowed by a question mark they are.
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be
tried only at the first matching position in the subject string.
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva-
lents.
(g) The \R escape sequence can be restricted to match only CR, LF, or
CRLF by the PCRE_BSR_ANYCRLF option.
(h) The callout facility is PCRE-specific.
(i) The partial matching facility is PCRE-specific.
(j) Patterns compiled by PCRE can be saved and re-used at a later time,
even on different hosts that have the other endianness.
(k) The alternative matching function (pcre_dfa_exec()) matches in a
different way and is not Perl-compatible.
(l) PCRE recognizes some special sequences such as (*CR) at the start
of a pattern that set overall options that cannot be changed within the
pattern.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 04 October 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
PCREPATTERN(3) PCREPATTERN(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE REGULAR EXPRESSION DETAILS
The syntax and semantics of the regular expressions that are supported
by PCRE are described in detail below. There is a quick-reference syn-
tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
semantics as closely as it can. PCRE also supports some alternative
regular expression syntax (which does not conflict with the Perl syn-
tax) in order to provide some compatibility with regular expressions in
Python, .NET, and Oniguruma.
Perl's regular expressions are described in its own documentation, and
regular expressions in general are covered in a number of books, some
of which have copious examples. Jeffrey Friedl's "Mastering Regular
Expressions", published by O'Reilly, covers regular expressions in
great detail. This description of PCRE's regular expressions is
intended as reference material.
The original operation of PCRE was on strings of one-byte characters.
However, there is now also support for UTF-8 character strings. To use
this, PCRE must be built to include UTF-8 support, and you must call
pcre_compile() or pcre_compile2() with the PCRE_UTF8 option. There is
also a special sequence that can be given at the start of a pattern:
(*UTF8)
Starting a pattern with this sequence is equivalent to setting the
PCRE_UTF8 option. This feature is not Perl-compatible. How setting
UTF-8 mode affects pattern matching is mentioned in several places
below. There is also a summary of UTF-8 features in the section on
UTF-8 support in the main pcre page.
The remainder of this document discusses the patterns that are sup-
ported by PCRE when its main matching function, pcre_exec(), is used.
From release 6.0, PCRE offers a second matching function,
pcre_dfa_exec(), which matches using a different algorithm that is not
Perl-compatible. Some of the features discussed below are not available
when pcre_dfa_exec() is used. The advantages and disadvantages of the
alternative function, and how it differs from the normal function, are
discussed in the pcrematching page.
NEWLINE CONVENTIONS
PCRE supports five different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (line-
feed) character, the two-character sequence CRLF, any of the three pre-
ceding, or any Unicode newline sequence. The pcreapi page has further
discussion about newlines, and shows how to set the newline convention
in the options arguments for the compiling and matching functions.
It is also possible to specify a newline convention by starting a pat-
tern string with one of the following five sequences:
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
These override the default and the options given to pcre_compile() or
pcre_compile2(). For example, on a Unix system where LF is the default
newline sequence, the pattern
(*CR)a.b
changes the convention to CR. That pattern matches "a\nb" because LF is
no longer a newline. Note that these special settings, which are not
Perl-compatible, are recognized only at the very start of a pattern,
and that they must be in upper case. If more than one of them is
present, the last one is used.
The newline convention does not affect what the \R escape sequence
matches. By default, this is any Unicode newline sequence, for Perl
compatibility. However, this can be changed; see the description of \R
in the section entitled "Newline sequences" below. A change of \R set-
ting can be combined with a change of newline convention.
CHARACTERS AND METACHARACTERS
A regular expression is a pattern that is matched against a subject
string from left to right. Most characters stand for themselves in a
pattern, and match the corresponding characters in the subject. As a
trivial example, the pattern
The quick brown fox
matches a portion of a subject string that is identical to itself. When
caseless matching is specified (the PCRE_CASELESS option), letters are
matched independently of case. In UTF-8 mode, PCRE always understands
the concept of case for characters whose values are less than 128, so
caseless matching is always possible. For characters with higher val-
ues, the concept of case is supported if PCRE is compiled with Unicode
property support, but not otherwise. If you want to use caseless
matching for characters 128 and above, you must ensure that PCRE is
compiled with Unicode property support as well as with UTF-8 support.
The power of regular expressions comes from the ability to include
alternatives and repetitions in the pattern. These are encoded in the
pattern by the use of metacharacters, which do not stand for themselves
but instead are interpreted in some special way.
There are two different sets of metacharacters: those that are recog-
nized anywhere in the pattern except within square brackets, and those
that are recognized within square brackets. Outside square brackets,
the metacharacters are as follows:
\ general escape character with several uses
^ assert start of string (or line, in multiline mode)
$ assert end of string (or line, in multiline mode)
. match any character except newline (by default)
[ start character class definition
| start of alternative branch
( start subpattern
) end subpattern
? extends the meaning of (
also 0 or 1 quantifier
also quantifier minimizer
* 0 or more quantifier
+ 1 or more quantifier
also "possessive quantifier"
{ start min/max quantifier
Part of a pattern that is in square brackets is called a "character
class". In a character class the only metacharacters are:
\ general escape character
^ negate the class, but only if the first character
- indicates character range
[ POSIX character class (only if followed by POSIX
syntax)
] terminates the character class
The following sections describe the use of each of the metacharacters.
BACKSLASH
The backslash character has several uses. Firstly, if it is followed by
a non-alphanumeric character, it takes away any special meaning that
character may have. This use of backslash as an escape character
applies both inside and outside character classes.
For example, if you want to match a * character, you write \* in the
pattern. This escaping action applies whether or not the following
character would otherwise be interpreted as a metacharacter, so it is
always safe to precede a non-alphanumeric with backslash to specify
that it stands for itself. In particular, if you want to match a back-
slash, you write \\.
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
the pattern (other than in a character class) and characters between a
# outside a character class and the next newline are ignored. An escap-
ing backslash can be used to include a whitespace or # character as
part of the pattern.
If you want to remove the special meaning from a sequence of charac-
ters, you can do so by putting them between \Q and \E. This is differ-
ent from Perl in that $ and @ are handled as literals in \Q...\E
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
tion. Note the following examples:
Pattern PCRE matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the
contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
The \Q...\E sequence is recognized both inside and outside character
classes.
Non-printing characters
A second use of backslash provides a way of encoding non-printing char-
acters in patterns in a visible manner. There is no restriction on the
appearance of non-printing characters, apart from the binary zero that
terminates a pattern, but when a pattern is being prepared by text
editing, it is often easier to use one of the following escape
sequences than the binary character it represents:
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any character
\e escape (hex 1B)
\f formfeed (hex 0C)
\n linefeed (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\ddd character with octal code ddd, or back reference
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..
The precise effect of \cx is as follows: if x is a lower case letter,
it is converted to upper case. Then bit 6 of the character (hex 40) is
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
becomes hex 7B.
After \x, from zero to two hexadecimal digits are read (letters can be
in upper or lower case). Any number of hexadecimal digits may appear
between \x{ and }, but the value of the character code must be less
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
than the largest Unicode code point, which is 10FFFF.
If characters other than hexadecimal digits appear between \x{ and },
or if there is no terminating }, this form of escape is not recognized.
Instead, the initial \x will be interpreted as a basic hexadecimal
escape, with no following digits, giving a character whose value is
zero.
Characters whose value is less than 256 can be defined by either of the
two syntaxes for \x. There is no difference in the way they are han-
dled. For example, \xdc is exactly the same as \x{dc}.
After \0 up to two further octal digits are read. If there are fewer
than two digits, just those that are present are used. Thus the
sequence \0\x\07 specifies two binary zeros followed by a BEL character
(code value 7). Make sure you supply two digits after the initial zero
if the pattern character that follows is itself an octal digit.
The handling of a backslash followed by a digit other than 0 is compli-
cated. Outside a character class, PCRE reads it and any following dig-
its as a decimal number. If the number is less than 10, or if there
have been at least that many previous capturing left parentheses in the
expression, the entire sequence is taken as a back reference. A
description of how this works is given later, following the discussion
of parenthesized subpatterns.
Inside a character class, or if the decimal number is greater than 9
and there have not been that many capturing subpatterns, PCRE re-reads
up to three octal digits following the backslash, and uses them to gen-
erate a data character. Any subsequent digits stand for themselves. In
non-UTF-8 mode, the value of a character specified in octal must be
less than \400. In UTF-8 mode, values up to \777 are permitted. For
example:
\040 is another way of writing a space
\40 is the same, provided there are fewer than 40
previous capturing subpatterns
\7 is always a back reference
\11 might be a back reference, or another way of
writing a tab
\011 is always a tab
\0113 is a tab followed by the character "3"
\113 might be a back reference, otherwise the
character with octal code 113
\377 might be a back reference, otherwise
the byte consisting entirely of 1 bits
\81 is either a back reference, or a binary zero
followed by the two characters "8" and "1"
Note that octal values of 100 or greater must not be introduced by a
leading zero, because no more than three octal digits are ever read.
All the sequences that define a single character value can be used both
inside and outside character classes. In addition, inside a character
class, the sequence \b is interpreted as the backspace character (hex
08), and the sequences \R and \X are interpreted as the characters "R"
and "X", respectively. Outside a character class, these sequences have
different meanings (see below).
Absolute and relative back references
The sequence \g followed by an unsigned or a negative number, option-
ally enclosed in braces, is an absolute or relative back reference. A
named back reference can be coded as \g{name}. Back references are dis-
cussed later, following the discussion of parenthesized subpatterns.
Absolute and relative subroutine calls
For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is
an alternative syntax for referencing a subpattern as a "subroutine".
Details are discussed later. Note that \g{...} (Perl syntax) and
\g<...> (Oniguruma syntax) are not synonymous. The former is a back
reference; the latter is a subroutine call.
Generic character types
Another use of backslash is for specifying generic character types. The
following are always recognized:
\d any decimal digit
\D any character that is not a decimal digit
\h any horizontal whitespace character
\H any character that is not a horizontal whitespace character
\s any whitespace character
\S any character that is not a whitespace character
\v any vertical whitespace character
\V any character that is not a vertical whitespace character
\w any "word" character
\W any "non-word" character
Each pair of escape sequences partitions the complete set of characters
into two disjoint sets. Any given character matches one, and only one,
of each pair.
These character type sequences can appear both inside and outside char-
acter classes. They each match one character of the appropriate type.
If the current matching point is at the end of the subject string, all
of them fail, since there is no character to match.
For compatibility with Perl, \s does not match the VT character (code
11). This makes it different from the the POSIX "space" class. The \s
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
"use locale;" is included in a Perl script, \s may match the VT charac-
ter. In PCRE, it never does.
In UTF-8 mode, characters with values greater than 128 never match \d,
\s, or \w, and always match \D, \S, and \W. This is true even when Uni-
code character property support is available. These sequences retain
their original meanings from before UTF-8 support was available, mainly
for efficiency reasons. Note that this also affects \b, because it is
defined in terms of \w and \W.
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
the other sequences, these do match certain high-valued codepoints in
UTF-8 mode. The horizontal space characters are:
U+0009 Horizontal tab
U+0020 Space
U+00A0 Non-break space
U+1680 Ogham space mark
U+180E Mongolian vowel separator
U+2000 En quad
U+2001 Em quad
U+2002 En space
U+2003 Em space
U+2004 Three-per-em space
U+2005 Four-per-em space
U+2006 Six-per-em space
U+2007 Figure space
U+2008 Punctuation space
U+2009 Thin space
U+200A Hair space
U+202F Narrow no-break space
U+205F Medium mathematical space
U+3000 Ideographic space
The vertical space characters are:
U+000A Linefeed
U+000B Vertical tab
U+000C Formfeed
U+000D Carriage return
U+0085 Next line
U+2028 Line separator
U+2029 Paragraph separator
A "word" character is an underscore or any character less than 256 that
is a letter or digit. The definition of letters and digits is con-
trolled by PCRE's low-valued character tables, and may vary if locale-
specific matching is taking place (see "Locale support" in the pcreapi
page). For example, in a French locale such as "fr_FR" in Unix-like
systems, or "french" in Windows, some character codes greater than 128
are used for accented letters, and these are matched by \w. The use of
locales with Unicode is discouraged.
Newline sequences
Outside a character class, by default, the escape sequence \R matches
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
mode \R is equivalent to the following:
(?>\r\n|\n|\x0b|\f|\r|\x85)
This is an example of an "atomic group", details of which are given
below. This particular group matches either the two-character sequence
CR followed by LF, or one of the single characters LF (linefeed,
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
return, U+000D), or NEL (next line, U+0085). The two-character sequence
is treated as a single unit that cannot be split.
In UTF-8 mode, two additional characters whose codepoints are greater
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
rator, U+2029). Unicode character property support is not needed for
these characters to be recognized.
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
the complete set of Unicode line endings) by setting the option
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
(BSR is an abbrevation for "backslash R".) This can be made the default
when PCRE is built; if this is the case, the other behaviour can be
requested via the PCRE_BSR_UNICODE option. It is also possible to
specify these settings by starting a pattern string with one of the
following sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
These override the default and the options given to pcre_compile() or
pcre_compile2(), but they can be overridden by options given to
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
are not Perl-compatible, are recognized only at the very start of a
pattern, and that they must be in upper case. If more than one of them
is present, the last one is used. They can be combined with a change of
newline convention, for example, a pattern can start with:
(*ANY)(*BSR_ANYCRLF)
Inside a character class, \R matches the letter "R".
Unicode character properties
When PCRE is built with Unicode character property support, three addi-
tional escape sequences that match characters with specific properties
are available. When not in UTF-8 mode, these sequences are of course
limited to testing characters whose codepoints are less than 256, but
they do work in this mode. The extra escape sequences are:
\p{xx} a character with the xx property
\P{xx} a character without the xx property
\X an extended Unicode sequence
The property names represented by xx above are limited to the Unicode
script names, the general category properties, and "Any", which matches
any character (including newline). Other properties such as "InMusical-
Symbols" are not currently supported by PCRE. Note that \P{Any} does
not match any characters, so always causes a match failure.
Sets of Unicode characters are defined as belonging to certain scripts.
A character from one of these sets can be matched using a script name.
For example:
\p{Greek}
\P{Han}
Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-
tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,
Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
Ugaritic, Vai, Yi.
Each character has exactly one general category property, specified by
a two-letter abbreviation. For compatibility with Perl, negation can be
specified by including a circumflex between the opening brace and the
property name. For example, \p{^Lu} is the same as \P{Lu}.
If only one letter is specified with \p or \P, it includes all the gen-
eral category properties that start with that letter. In this case, in
the absence of negation, the curly brackets in the escape sequence are
optional; these two examples have the same effect:
\p{L}
\pL
The following general category property codes are supported:
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
The special property L& is also supported: it matches a character that
has the Lu, Ll, or Lt property, in other words, a letter that is not
classified as a modifier or "other".
The Cs (Surrogate) property applies only to characters in the range
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in
the pcreapi page). Perl does not support the Cs property.
The long synonyms for property names that Perl supports (such as
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix
any of these properties with "Is".
No character that is in the Unicode table has the Cn (unassigned) prop-
erty. Instead, this property is assumed for any code point that is not
in the Unicode table.
Specifying caseless matching does not affect these escape sequences.
For example, \p{Lu} always matches only upper case letters.
The \X escape matches any number of Unicode characters that form an
extended Unicode sequence. \X is equivalent to
(?>\PM\pM*)
That is, it matches a character without the "mark" property, followed
by zero or more characters with the "mark" property, and treats the
sequence as an atomic group (see below). Characters with the "mark"
property are typically accents that affect the preceding character.
None of them have codepoints less than 256, so in non-UTF-8 mode \X
matches any one character.
Matching characters by Unicode property is not fast, because PCRE has
to search a structure that contains data for over fifteen thousand
characters. That is why the traditional escape sequences such as \d and
\w do not use Unicode properties in PCRE.
Resetting the match start
The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
ously matched characters not to be included in the final matched
sequence. For example, the pattern:
foo\Kbar
matches "foobar", but reports that it has matched "bar". This feature
is similar to a lookbehind assertion (described below). However, in
this case, the part of the subject before the real match does not have
to be of fixed length, as lookbehind assertions do. The use of \K does
not interfere with the setting of captured substrings. For example,
when the pattern
(foo)\Kbar
matches "foobar", the first substring is still set to "foo".
Perl documents that the use of \K within assertions is "not well
defined". In PCRE, \K is acted upon when it occurs inside positive
assertions, but is ignored in negative assertions.
Simple assertions
The final use of backslash is for certain simple assertions. An asser-
tion specifies a condition that has to be met at a particular point in
a match, without consuming any characters from the subject string. The
use of subpatterns for more complicated assertions is described below.
The backslashed assertions are:
\b matches at a word boundary
\B matches when not at a word boundary
\A matches at the start of the subject
\Z matches at the end of the subject
also matches before a newline at the end of the subject
\z matches only at the end of the subject
\G matches at the first matching position in the subject
These assertions may not appear in character classes (but note that \b
has a different meaning, namely the backspace character, inside a char-
acter class).
A word boundary is a position in the subject string where the current
character and the previous character do not both match \w or \W (i.e.
one matches \w and the other matches \W), or the start or end of the
string if the first or last character matches \w, respectively. Neither
PCRE nor Perl has a separte "start of word" or "end of word" metase-
quence. However, whatever follows \b normally determines which it is.
For example, the fragment \ba matches "a" at the start of a word.
The \A, \Z, and \z assertions differ from the traditional circumflex
and dollar (described in the next section) in that they only ever match
at the very start and end of the subject string, whatever options are
set. Thus, they are independent of multiline mode. These three asser-
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
affect only the behaviour of the circumflex and dollar metacharacters.
However, if the startoffset argument of pcre_exec() is non-zero, indi-
cating that matching is to start at a point other than the beginning of
the subject, \A can never match. The difference between \Z and \z is
that \Z matches before a newline at the end of the string as well as at
the very end, whereas \z matches only at the end.
The \G assertion is true only when the current matching position is at
the start point of the match, as specified by the startoffset argument
of pcre_exec(). It differs from \A when the value of startoffset is
non-zero. By calling pcre_exec() multiple times with appropriate argu-
ments, you can mimic Perl's /g option, and it is in this kind of imple-
mentation where \G can be useful.
Note, however, that PCRE's interpretation of \G, as the start of the
current match, is subtly different from Perl's, which defines it as the
end of the previous match. In Perl, these can be different when the
previously matched string was empty. Because PCRE does just one match
at a time, it cannot reproduce this behaviour.
If all the alternatives of a pattern begin with \G, the expression is
anchored to the starting match position, and the "anchored" flag is set
in the compiled regular expression.
CIRCUMFLEX AND DOLLAR
Outside a character class, in the default matching mode, the circumflex
character is an assertion that is true only if the current matching
point is at the start of the subject string. If the startoffset argu-
ment of pcre_exec() is non-zero, circumflex can never match if the
PCRE_MULTILINE option is unset. Inside a character class, circumflex
has an entirely different meaning (see below).
Circumflex need not be the first character of the pattern if a number
of alternatives are involved, but it should be the first thing in each
alternative in which it appears if the pattern is ever to match that
branch. If all possible alternatives start with a circumflex, that is,
if the pattern is constrained to match only at the start of the sub-
ject, it is said to be an "anchored" pattern. (There are also other
constructs that can cause a pattern to be anchored.)
A dollar character is an assertion that is true only if the current
matching point is at the end of the subject string, or immediately
before a newline at the end of the string (by default). Dollar need not
be the last character of the pattern if a number of alternatives are
involved, but it should be the last item in any branch in which it
appears. Dollar has no special meaning in a character class.
The meaning of dollar can be changed so that it matches only at the
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
compile time. This does not affect the \Z assertion.
The meanings of the circumflex and dollar characters are changed if the
PCRE_MULTILINE option is set. When this is the case, a circumflex
matches immediately after internal newlines as well as at the start of
the subject string. It does not match after a newline that ends the
string. A dollar matches before any newlines in the string, as well as
at the very end, when PCRE_MULTILINE is set. When newline is specified
as the two-character sequence CRLF, isolated CR and LF characters do
not indicate newlines.
For example, the pattern /^abc$/ matches the subject string "def\nabc"
(where \n represents a newline) in multiline mode, but not otherwise.
Consequently, patterns that are anchored in single line mode because
all branches start with ^ are not anchored in multiline mode, and a
match for circumflex is possible when the startoffset argument of
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
PCRE_MULTILINE is set.
Note that the sequences \A, \Z, and \z can be used to match the start
and end of the subject in both modes, and if all branches of a pattern
start with \A it is always anchored, whether or not PCRE_MULTILINE is
set.
FULL STOP (PERIOD, DOT)
Outside a character class, a dot in the pattern matches any one charac-
ter in the subject string except (by default) a character that signi-
fies the end of a line. In UTF-8 mode, the matched character may be
more than one byte long.
When a line ending is defined as a single character, dot never matches
that character; when the two-character sequence CRLF is used, dot does
not match CR if it is immediately followed by LF, but otherwise it
matches all characters (including isolated CRs and LFs). When any Uni-
code line endings are being recognized, dot does not match CR or LF or
any of the other line ending characters.
The behaviour of dot with regard to newlines can be changed. If the
PCRE_DOTALL option is set, a dot matches any one character, without
exception. If the two-character sequence CRLF is present in the subject
string, it takes two dots to match it.
The handling of dot is entirely independent of the handling of circum-
flex and dollar, the only relationship being that they both involve
newlines. Dot has no special meaning in a character class.
MATCHING A SINGLE BYTE
Outside a character class, the escape sequence \C matches any one byte,
both in and out of UTF-8 mode. Unlike a dot, it always matches any
line-ending characters. The feature is provided in Perl in order to
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
acters into individual bytes, what remains in the string may be a mal-
formed UTF-8 string. For this reason, the \C escape sequence is best
avoided.
PCRE does not allow \C to appear in lookbehind assertions (described
below), because in UTF-8 mode this would make it impossible to calcu-
late the length of the lookbehind.
SQUARE BRACKETS AND CHARACTER CLASSES
An opening square bracket introduces a character class, terminated by a
closing square bracket. A closing square bracket on its own is not spe-
cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
a lone closing square bracket causes a compile-time error. If a closing
square bracket is required as a member of the class, it should be the
first data character in the class (after an initial circumflex, if
present) or escaped with a backslash.
A character class matches a single character in the subject. In UTF-8
mode, the character may be more than one byte long. A matched character
must be in the set of characters defined by the class, unless the first
character in the class definition is a circumflex, in which case the
subject character must not be in the set defined by the class. If a
circumflex is actually required as a member of the class, ensure it is
not the first character, or escape it with a backslash.
For example, the character class [aeiou] matches any lower case vowel,
while [^aeiou] matches any character that is not a lower case vowel.
Note that a circumflex is just a convenient notation for specifying the
characters that are in the class by enumerating those that are not. A
class that starts with a circumflex is not an assertion; it still con-
sumes a character from the subject string, and therefore it fails if
the current pointer is at the end of the string.
In UTF-8 mode, characters with values greater than 255 can be included
in a class as a literal string of bytes, or by using the \x{ escaping
mechanism.
When caseless matching is set, any letters in a class represent both
their upper case and lower case versions, so for example, a caseless
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
understands the concept of case for characters whose values are less
than 128, so caseless matching is always possible. For characters with
higher values, the concept of case is supported if PCRE is compiled
with Unicode property support, but not otherwise. If you want to use
caseless matching in UTF8-mode for characters 128 and above, you must
ensure that PCRE is compiled with Unicode property support as well as
with UTF-8 support.
Characters that might indicate line breaks are never treated in any
special way when matching character classes, whatever line-ending
sequence is in use, and whatever setting of the PCRE_DOTALL and
PCRE_MULTILINE options is used. A class such as [^a] always matches one
of these characters.
The minus (hyphen) character can be used to specify a range of charac-
ters in a character class. For example, [d-m] matches any letter
between d and m, inclusive. If a minus character is required in a
class, it must be escaped with a backslash or appear in a position
where it cannot be interpreted as indicating a range, typically as the
first or last character in the class.
It is not possible to have the literal character "]" as the end charac-
ter of a range. A pattern such as [W-]46] is interpreted as a class of
two characters ("W" and "-") followed by a literal string "46]", so it
would match "W46]" or "-46]". However, if the "]" is escaped with a
backslash it is interpreted as the end of range, so [W-\]46] is inter-
preted as a class containing a range followed by two other characters.
The octal or hexadecimal representation of "]" can also be used to end
a range.
Ranges operate in the collating sequence of character values. They can
also be used for characters specified numerically, for example
[\000-\037]. In UTF-8 mode, ranges can include characters whose values
are greater than 255, for example [\x{100}-\x{2ff}].
If a range that includes letters is used when caseless matching is set,
it matches the letters in either case. For example, [W-c] is equivalent
to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
character tables for a French locale are in use, [\xc8-\xcb] matches
accented E characters in both cases. In UTF-8 mode, PCRE supports the
concept of case for characters with values greater than 128 only when
it is compiled with Unicode property support.
The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
in a character class, and add the characters that they match to the
class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
flex can conveniently be used with the upper case character types to
specify a more restricted set of characters than the matching lower
case type. For example, the class [^\W_] matches any letter or digit,
but not underscore.
The only metacharacters that are recognized in character classes are
backslash, hyphen (only where it can be interpreted as specifying a
range), circumflex (only at the start), opening square bracket (only
when it can be interpreted as introducing a POSIX class name - see the
next section), and the terminating closing square bracket. However,
escaping other non-alphanumeric characters does no harm.
POSIX CHARACTER CLASSES
Perl supports the POSIX notation for character classes. This uses names
enclosed by [: and :] within the enclosing square brackets. PCRE also
supports this notation. For example,
[01[:alpha:]%]
matches "0", "1", any alphabetic character, or "%". The supported class
names are
alnum letters and digits
alpha letters
ascii character codes 0 - 127
blank space or tab only
cntrl control characters
digit decimal digits (same as \d)
graph printing characters, excluding space
lower lower case letters
print printing characters, including space
punct printing characters, excluding letters and digits
space white space (not quite the same as \s)
upper upper case letters
word "word" characters (same as \w)
xdigit hexadecimal digits
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
and space (32). Notice that this list includes the VT character (code
11). This makes "space" different to \s, which does not include VT (for
Perl compatibility).
The name "word" is a Perl extension, and "blank" is a GNU extension
from Perl 5.8. Another Perl extension is negation, which is indicated
by a ^ character after the colon. For example,
[12[:^digit:]]
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
these are not supported, and an error is given if they are encountered.
In UTF-8 mode, characters with values greater than 128 do not match any
of the POSIX character classes.
VERTICAL BAR
Vertical bar characters are used to separate alternative patterns. For
example, the pattern
gilbert|sullivan
matches either "gilbert" or "sullivan". Any number of alternatives may
appear, and an empty alternative is permitted (matching the empty
string). The matching process tries each alternative in turn, from left
to right, and the first one that succeeds is used. If the alternatives
are within a subpattern (defined below), "succeeds" means matching the
rest of the main pattern as well as the alternative in the subpattern.
INTERNAL OPTION SETTING
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
PCRE_EXTENDED options (which are Perl-compatible) can be changed from
within the pattern by a sequence of Perl option letters enclosed
between "(?" and ")". The option letters are
i for PCRE_CASELESS
m for PCRE_MULTILINE
s for PCRE_DOTALL
x for PCRE_EXTENDED
For example, (?im) sets caseless, multiline matching. It is also possi-
ble to unset these options by preceding the letter with a hyphen, and a
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
is also permitted. If a letter appears both before and after the
hyphen, the option is unset.
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
can be changed in the same way as the Perl-compatible options by using
the characters J, U and X respectively.
When one of these option changes occurs at top level (that is, not
inside subpattern parentheses), the change applies to the remainder of
the pattern that follows. If the change is placed right at the start of
a pattern, PCRE extracts it into the global options (and it will there-
fore show up in data extracted by the pcre_fullinfo() function).
An option change within a subpattern (see below for a description of
subpatterns) affects only that part of the current pattern that follows
it, so
(a(?i)b)c
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
used). By this means, options can be made to have different settings
in different parts of the pattern. Any changes made in one alternative
do carry on into subsequent branches within the same subpattern. For
example,
(a(?i)b|c)
matches "ab", "aB", "c", and "C", even though when matching "C" the
first branch is abandoned before the option setting. This is because
the effects of option settings happen at compile time. There would be
some very weird behaviour otherwise.
Note: There are other PCRE-specific options that can be set by the
application when the compile or match functions are called. In some
cases the pattern can contain special leading sequences such as (*CRLF)
to override what the application has set or what has been defaulted.
Details are given in the section entitled "Newline sequences" above.
There is also the (*UTF8) leading sequence that can be used to set
UTF-8 mode; this is equivalent to setting the PCRE_UTF8 option.
SUBPATTERNS
Subpatterns are delimited by parentheses (round brackets), which can be
nested. Turning part of a pattern into a subpattern does two things:
1. It localizes a set of alternatives. For example, the pattern
cat(aract|erpillar|)
matches one of the words "cat", "cataract", or "caterpillar". Without
the parentheses, it would match "cataract", "erpillar" or an empty
string.
2. It sets up the subpattern as a capturing subpattern. This means
that, when the whole pattern matches, that portion of the subject
string that matched the subpattern is passed back to the caller via the
ovector argument of pcre_exec(). Opening parentheses are counted from
left to right (starting from 1) to obtain numbers for the capturing
subpatterns.
For example, if the string "the red king" is matched against the pat-
tern
the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king", and are num-
bered 1, 2, and 3, respectively.
The fact that plain parentheses fulfil two functions is not always
helpful. There are often times when a grouping subpattern is required
without a capturing requirement. If an opening parenthesis is followed
by a question mark and a colon, the subpattern does not do any captur-
ing, and is not counted when computing the number of any subsequent
capturing subpatterns. For example, if the string "the white queen" is
matched against the pattern
the ((?:red|white) (king|queen))
the captured substrings are "white queen" and "queen", and are numbered
1 and 2. The maximum number of capturing subpatterns is 65535.
As a convenient shorthand, if any option settings are required at the
start of a non-capturing subpattern, the option letters may appear
between the "?" and the ":". Thus the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative branches are
tried from left to right, and options are not reset until the end of
the subpattern is reached, an option setting in one branch does affect
subsequent branches, so the above patterns match "SUNDAY" as well as
"Saturday".
DUPLICATE SUBPATTERN NUMBERS
Perl 5.10 introduced a feature whereby each alternative in a subpattern
uses the same numbers for its capturing parentheses. Such a subpattern
starts with (?| and is itself a non-capturing subpattern. For example,
consider this pattern:
(?|(Sat)ur|(Sun))day
Because the two alternatives are inside a (?| group, both sets of cap-
turing parentheses are numbered one. Thus, when the pattern matches,
you can look at captured substring number one, whichever alternative
matched. This construct is useful when you want to capture part, but
not all, of one of a number of alternatives. Inside a (?| group, paren-
theses are numbered as usual, but the number is reset at the start of
each branch. The numbers of any capturing buffers that follow the sub-
pattern start after the highest number used in any branch. The follow-
ing example is taken from the Perl documentation. The numbers under-
neath show in which buffer the captured content will be stored.
# before ---------------branch-reset----------- after
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
A back reference to a numbered subpattern uses the most recent value
that is set for that number by any subpattern. The following pattern
matches "abcabc" or "defdef":
/(?|(abc)|(def))\1/
In contrast, a recursive or "subroutine" call to a numbered subpattern
always refers to the first one in the pattern with the given number.
The following pattern matches "abcabc" or "defabc":
/(?|(abc)|(def))(?1)/
If a condition test for a subpattern's having matched refers to a non-
unique number, the test is true if any of the subpatterns of that num-
ber have matched.
An alternative approach to using this "branch reset" feature is to use
duplicate named subpatterns, as described in the next section.
NAMED SUBPATTERNS
Identifying capturing parentheses by number is simple, but it can be
very hard to keep track of the numbers in complicated regular expres-
sions. Furthermore, if an expression is modified, the numbers may
change. To help with this difficulty, PCRE supports the naming of sub-
patterns. This feature was not added to Perl until release 5.10. Python
had the feature earlier, and PCRE introduced it at release 4.0, using
the Python syntax. PCRE now supports both the Perl and the Python syn-
tax. Perl allows identically numbered subpatterns to have different
names, but PCRE does not.
In PCRE, a subpattern can be named in one of three ways: (?<name>...)
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
to capturing parentheses from other parts of the pattern, such as back
references, recursion, and conditions, can be made by name as well as
by number.
Names consist of up to 32 alphanumeric characters and underscores.
Named capturing parentheses are still allocated numbers as well as
names, exactly as if the names were not present. The PCRE API provides
function calls for extracting the name-to-number translation table from
a compiled pattern. There is also a convenience function for extracting
a captured substring by name.
By default, a name must be unique within a pattern, but it is possible
to relax this constraint by setting the PCRE_DUPNAMES option at compile
time. (Duplicate names are also always permitted for subpatterns with
the same number, set up as described in the previous section.) Dupli-
cate names can be useful for patterns where only one instance of the
named parentheses can match. Suppose you want to match the name of a
weekday, either as a 3-letter abbreviation or as the full name, and in
both cases you want to extract the abbreviation. This pattern (ignoring
the line breaks) does the job:
(?<DN>Mon|Fri|Sun)(?:day)?|
(?<DN>Tue)(?:sday)?|
(?<DN>Wed)(?:nesday)?|
(?<DN>Thu)(?:rsday)?|
(?<DN>Sat)(?:urday)?
There are five capturing substrings, but only one is ever set after a
match. (An alternative way of solving this problem is to use a "branch
reset" subpattern, as described in the previous section.)
The convenience function for extracting the data by name returns the
substring for the first (and in this example, the only) subpattern of
that name that matched. This saves searching to find which numbered
subpattern it was.
If you make a back reference to a non-unique named subpattern from
elsewhere in the pattern, the one that corresponds to the first occur-
rence of the name is used. In the absence of duplicate numbers (see the
previous section) this is the one with the lowest number. If you use a
named reference in a condition test (see the section about conditions
below), either to check whether a subpattern has matched, or to check
for recursion, all subpatterns with the same name are tested. If the
condition is true for any one of them, the overall condition is true.
This is the same behaviour as testing by number. For further details of
the interfaces for handling named subpatterns, see the pcreapi documen-
tation.
Warning: You cannot use different names to distinguish between two sub-
patterns with the same number because PCRE uses only the numbers when
matching. For this reason, an error is given at compile time if differ-
ent names are given to subpatterns with the same number. However, you
can give the same name to subpatterns with the same number, even when
PCRE_DUPNAMES is not set.
REPETITION
Repetition is specified by quantifiers, which can follow any of the
following items:
a literal data character
the dot metacharacter
the \C escape sequence
the \X escape sequence (in UTF-8 mode with Unicode properties)
the \R escape sequence
an escape such as \d that matches a single character
a character class
a back reference (see next section)
a parenthesized subpattern (unless it is an assertion)
a recursive or "subroutine" call to a subpattern
The general repetition quantifier specifies a minimum and maximum num-
ber of permitted matches, by giving the two numbers in curly brackets
(braces), separated by a comma. The numbers must be less than 65536,
and the first must be less than or equal to the second. For example:
z{2,4}
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
special character. If the second number is omitted, but the comma is
present, there is no upper limit; if the second number and the comma
are both omitted, the quantifier specifies an exact number of required
matches. Thus
[aeiou]{3,}
matches at least 3 successive vowels, but may match many more, while
\d{8}
matches exactly 8 digits. An opening curly bracket that appears in a
position where a quantifier is not allowed, or one that does not match
the syntax of a quantifier, is taken as a literal character. For exam-
ple, {,6} is not a quantifier, but a literal string of four characters.
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
acters, each of which is represented by a two-byte sequence. Similarly,
when Unicode property support is available, \X{3} matches three Unicode
extended sequences, each of which may be several bytes long (and they
may be of different lengths).
The quantifier {0} is permitted, causing the expression to behave as if
the previous item and the quantifier were not present. This may be use-
ful for subpatterns that are referenced as subroutines from elsewhere
in the pattern. Items other than subpatterns that have a {0} quantifier
are omitted from the compiled pattern.
For convenience, the three most common quantifiers have single-charac-
ter abbreviations:
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
It is possible to construct infinite loops by following a subpattern
that can match no characters with a quantifier that has no upper limit,
for example:
(a?)*
Earlier versions of Perl and PCRE used to give an error at compile time
for such patterns. However, because there are cases where this can be
useful, such patterns are now accepted, but if any repetition of the
subpattern does in fact match no characters, the loop is forcibly bro-
ken.
By default, the quantifiers are "greedy", that is, they match as much
as possible (up to the maximum number of permitted times), without
causing the rest of the pattern to fail. The classic example of where
this gives problems is in trying to match comments in C programs. These
appear between /* and */ and within the comment, individual * and /
characters may appear. An attempt to match C comments by applying the
pattern
/\*.*\*/
to the string
/* first comment */ not comment /* second comment */
fails, because it matches the entire string owing to the greediness of
the .* item.
However, if a quantifier is followed by a question mark, it ceases to
be greedy, and instead matches the minimum number of times possible, so
the pattern
/\*.*?\*/
does the right thing with the C comments. The meaning of the various
quantifiers is not otherwise changed, just the preferred number of
matches. Do not confuse this use of question mark with its use as a
quantifier in its own right. Because it has two uses, it can sometimes
appear doubled, as in
\d??\d
which matches one digit by preference, but can match two if that is the
only way the rest of the pattern matches.
If the PCRE_UNGREEDY option is set (an option that is not available in
Perl), the quantifiers are not greedy by default, but individual ones
can be made greedy by following them with a question mark. In other
words, it inverts the default behaviour.
When a parenthesized subpattern is quantified with a minimum repeat
count that is greater than 1 or with a limited maximum, more memory is
required for the compiled pattern, in proportion to the size of the
minimum or maximum.
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
alent to Perl's /s) is set, thus allowing the dot to match newlines,
the pattern is implicitly anchored, because whatever follows will be
tried against every character position in the subject string, so there
is no point in retrying the overall match at any position after the
first. PCRE normally treats such a pattern as though it were preceded
by \A.
In cases where it is known that the subject string contains no new-
lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
mization, or alternatively using ^ to indicate anchoring explicitly.
However, there is one situation where the optimization cannot be used.
When .* is inside capturing parentheses that are the subject of a back
reference elsewhere in the pattern, a match at the start may fail where
a later one succeeds. Consider, for example:
(.*)abc\1
If the subject is "xyz123abc123" the match point is the fourth charac-
ter. For this reason, such a pattern is not implicitly anchored.
When a capturing subpattern is repeated, the value captured is the sub-
string that matched the final iteration. For example, after
(tweedle[dume]{3}\s*)+
has matched "tweedledum tweedledee" the value of the captured substring
is "tweedledee". However, if there are nested capturing subpatterns,
the corresponding captured values may have been set in previous itera-
tions. For example, after
/(a|(b))+/
matches "aba" the value of the second captured substring is "b".
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
repetition, failure of what follows normally causes the repeated item
to be re-evaluated to see if a different number of repeats allows the
rest of the pattern to match. Sometimes it is useful to prevent this,
either to change the nature of the match, or to cause it fail earlier
than it otherwise might, when the author of the pattern knows there is
no point in carrying on.
Consider, for example, the pattern \d+foo when applied to the subject
line
123456bar
After matching all 6 digits and then failing to match "foo", the normal
action of the matcher is to try again with only 5 digits matching the
\d+ item, and then with 4, and so on, before ultimately failing.
"Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
the means for specifying that once a subpattern has matched, it is not
to be re-evaluated in this way.
If we use atomic grouping for the previous example, the matcher gives
up immediately on failing to match "foo" the first time. The notation
is a kind of special parenthesis, starting with (?> as in this example:
(?>\d+)foo
This kind of parenthesis "locks up" the part of the pattern it con-
tains once it has matched, and a failure further into the pattern is
prevented from backtracking into it. Backtracking past it to previous
items, however, works as normal.
An alternative description is that a subpattern of this type matches
the string of characters that an identical standalone pattern would
match, if anchored at the current point in the subject string.
Atomic grouping subpatterns are not capturing subpatterns. Simple cases
such as the above example can be thought of as a maximizing repeat that
must swallow everything it can. So, while both \d+ and \d+? are pre-
pared to adjust the number of digits they match in order to make the
rest of the pattern match, (?>\d+) can only match an entire sequence of
digits.
Atomic groups in general can of course contain arbitrarily complicated
subpatterns, and can be nested. However, when the subpattern for an
atomic group is just a single repeated item, as in the example above, a
simpler notation, called a "possessive quantifier" can be used. This
consists of an additional + character following a quantifier. Using
this notation, the previous example can be rewritten as
\d++foo
Note that a possessive quantifier can be used with an entire group, for
example:
(abc|xyz){2,3}+
Possessive quantifiers are always greedy; the setting of the
PCRE_UNGREEDY option is ignored. They are a convenient notation for the
simpler forms of atomic group. However, there is no difference in the
meaning of a possessive quantifier and the equivalent atomic group,
though there may be a performance difference; possessive quantifiers
should be slightly faster.
The possessive quantifier syntax is an extension to the Perl 5.8 syn-
tax. Jeffrey Friedl originated the idea (and the name) in the first
edition of his book. Mike McCloskey liked it, so implemented it when he
built Sun's Java package, and PCRE copied it from there. It ultimately
found its way into Perl at release 5.10.
PCRE has an optimization that automatically "possessifies" certain sim-
ple pattern constructs. For example, the sequence A+B is treated as
A++B because there is no point in backtracking into a sequence of A's
when B must follow.
When a pattern contains an unlimited repeat inside a subpattern that
can itself be repeated an unlimited number of times, the use of an
atomic group is the only way to avoid some failing matches taking a
very long time indeed. The pattern
(\D+|<\d+>)*[!?]
matches an unlimited number of substrings that either consist of non-
digits, or digits enclosed in <>, followed by either ! or ?. When it
matches, it runs quickly. However, if it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
it takes a long time before reporting failure. This is because the
string can be divided between the internal \D+ repeat and the external
* repeat in a large number of ways, and all have to be tried. (The
example uses [!?] rather than a single character at the end, because
both PCRE and Perl have an optimization that allows for fast failure
when a single character is used. They remember the last single charac-
ter that is required for a match, and fail early if it is not present
in the string.) If the pattern is changed so that it uses an atomic
group, like this:
((?>\D+)|<\d+>)*[!?]
sequences of non-digits cannot be broken, and failure happens quickly.
BACK REFERENCES
Outside a character class, a backslash followed by a digit greater than
0 (and possibly further digits) is a back reference to a capturing sub-
pattern earlier (that is, to its left) in the pattern, provided there
have been that many previous capturing left parentheses.
However, if the decimal number following the backslash is less than 10,
it is always taken as a back reference, and causes an error only if
there are not that many capturing left parentheses in the entire pat-
tern. In other words, the parentheses that are referenced need not be
to the left of the reference for numbers less than 10. A "forward back
reference" of this type can make sense when a repetition is involved
and the subpattern to the right has participated in an earlier itera-
tion.
It is not possible to have a numerical "forward back reference" to a
subpattern whose number is 10 or more using this syntax because a
sequence such as \50 is interpreted as a character defined in octal.
See the subsection entitled "Non-printing characters" above for further
details of the handling of digits following a backslash. There is no
such problem when named parentheses are used. A back reference to any
subpattern is possible using named parentheses (see below).
Another way of avoiding the ambiguity inherent in the use of digits
following a backslash is to use the \g escape sequence, which is a fea-
ture introduced in Perl 5.10. This escape must be followed by an
unsigned number or a negative number, optionally enclosed in braces.
These examples are all identical:
(ring), \1
(ring), \g1
(ring), \g{1}
An unsigned number specifies an absolute reference without the ambigu-
ity that is present in the older syntax. It is also useful when literal
digits follow the reference. A negative number is a relative reference.
Consider this example:
(abc(def)ghi)\g{-1}
The sequence \g{-1} is a reference to the most recently started captur-
ing subpattern before \g, that is, is it equivalent to \2. Similarly,
\g{-2} would be equivalent to \1. The use of relative references can be
helpful in long patterns, and also in patterns that are created by
joining together fragments that contain references within themselves.
A back reference matches whatever actually matched the capturing sub-
pattern in the current subject string, rather than anything matching
the subpattern itself (see "Subpatterns as subroutines" below for a way
of doing that). So the pattern
(sens|respons)e and \1ibility
matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If caseful matching is in force at the
time of the back reference, the case of letters is relevant. For exam-
ple,
((?i)rah)\s+\1
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
original capturing subpattern is matched caselessly.
There are several different ways of writing back references to named
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
unified back reference syntax, in which \g can be used for both numeric
and named references, is also supported. We could rewrite the above
example in any of the following ways:
(?<p1>(?i)rah)\s+\k<p1>
(?'p1'(?i)rah)\s+\k{p1}
(?P<p1>(?i)rah)\s+(?P=p1)
(?<p1>(?i)rah)\s+\g{p1}
A subpattern that is referenced by name may appear in the pattern
before or after the reference.
There may be more than one back reference to the same subpattern. If a
subpattern has not actually been used in a particular match, any back
references to it always fail by default. For example, the pattern
(a|(bc))\2
always fails if it starts to match "a" rather than "bc". However, if
the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
ence to an unset value matches an empty string.
Because there may be many capturing parentheses in a pattern, all dig-
its following a backslash are taken as part of a potential back refer-
ence number. If the pattern continues with a digit character, some
delimiter must be used to terminate the back reference. If the
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
syntax or an empty comment (see "Comments" below) can be used.
Recursive back references
A back reference that occurs inside the parentheses to which it refers
fails when the subpattern is first used, so, for example, (a\1) never
matches. However, such references can be useful inside repeated sub-
patterns. For example, the pattern
(a|b\1)+
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
ation of the subpattern, the back reference matches the character
string corresponding to the previous iteration. In order for this to
work, the pattern must be such that the first iteration does not need
to match the back reference. This can be done using alternation, as in
the example above, or by a quantifier with a minimum of zero.
Back references of this type cause the group that they reference to be
treated as an atomic group. Once the whole group has been matched, a
subsequent matching failure cannot cause backtracking into the middle
of the group.
ASSERTIONS
An assertion is a test on the characters following or preceding the
current matching point that does not actually consume any characters.
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
described above.
More complicated assertions are coded as subpatterns. There are two
kinds: those that look ahead of the current position in the subject
string, and those that look behind it. An assertion subpattern is
matched in the normal way, except that it does not cause the current
matching position to be changed.
Assertion subpatterns are not capturing subpatterns, and may not be
repeated, because it makes no sense to assert the same thing several
times. If any kind of assertion contains capturing subpatterns within
it, these are counted for the purposes of numbering the capturing sub-
patterns in the whole pattern. However, substring capturing is carried
out only for positive assertions, because it does not make sense for
negative assertions.
Lookahead assertions
Lookahead assertions start with (?= for positive assertions and (?! for
negative assertions. For example,
\w+(?=;)
matches a word followed by a semicolon, but does not include the semi-
colon in the match, and
foo(?!bar)
matches any occurrence of "foo" that is not followed by "bar". Note
that the apparently similar pattern
(?!foo)bar
does not find an occurrence of "bar" that is preceded by something
other than "foo"; it finds any occurrence of "bar" whatsoever, because
the assertion (?!foo) is always true when the next three characters are
"bar". A lookbehind assertion is needed to achieve the other effect.
If you want to force a matching failure at some point in a pattern, the
most convenient way to do it is with (?!) because an empty string
always matches, so an assertion that requires there not to be an empty
string must always fail. The Perl 5.10 backtracking control verb
(*FAIL) or (*F) is essentially a synonym for (?!).
Lookbehind assertions
Lookbehind assertions start with (?<= for positive assertions and (?<!
for negative assertions. For example,
(?<!foo)bar
does find an occurrence of "bar" that is not preceded by "foo". The
contents of a lookbehind assertion are restricted such that all the
strings it matches must have a fixed length. However, if there are sev-
eral top-level alternatives, they do not all have to have the same
fixed length. Thus
(?<=bullock|donkey)
is permitted, but
(?<!dogs?|cats?)
causes an error at compile time. Branches that match different length
strings are permitted only at the top level of a lookbehind assertion.
This is an extension compared with Perl (5.8 and 5.10), which requires
all branches to match the same length of string. An assertion such as
(?<=ab(c|de))
is not permitted, because its single top-level branch can match two
different lengths, but it is acceptable to PCRE if rewritten to use two
top-level branches:
(?<=abc|abde)
In some cases, the Perl 5.10 escape sequence \K (see above) can be used
instead of a lookbehind assertion to get round the fixed-length
restriction.
The implementation of lookbehind assertions is, for each alternative,
to temporarily move the current position back by the fixed length and
then try to match. If there are insufficient characters before the cur-
rent position, the assertion fails.
PCRE does not allow the \C escape (which matches a single byte in UTF-8
mode) to appear in lookbehind assertions, because it makes it impossi-
ble to calculate the length of the lookbehind. The \X and \R escapes,
which can match different numbers of bytes, are also not permitted.
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
lookbehinds, as long as the subpattern matches a fixed-length string.
Recursion, however, is not supported.
Possessive quantifiers can be used in conjunction with lookbehind
assertions to specify efficient matching of fixed-length strings at the
end of subject strings. Consider a simple pattern such as
abcd$
when applied to a long string that does not match. Because matching
proceeds from left to right, PCRE will look for each "a" in the subject
and then see if what follows matches the rest of the pattern. If the
pattern is specified as
^.*abcd$
the initial .* matches the entire string at first, but when this fails
(because there is no following "a"), it backtracks to match all but the
last character, then all but the last two characters, and so on. Once
again the search for "a" covers the entire string, from right to left,
so we are no better off. However, if the pattern is written as
^.*+(?<=abcd)
there can be no backtracking for the .*+ item; it can match only the
entire string. The subsequent lookbehind assertion does a single test
on the last four characters. If it fails, the match fails immediately.
For long strings, this approach makes a significant difference to the
processing time.
Using multiple assertions
Several assertions (of any sort) may occur in succession. For example,
(?<=\d{3})(?<!999)foo
matches "foo" preceded by three digits that are not "999". Notice that
each of the assertions is applied independently at the same point in
the subject string. First there is a check that the previous three
characters are all digits, and then there is a check that the same
three characters are not "999". This pattern does not match "foo" pre-
ceded by six characters, the first of which are digits and the last
three of which are not "999". For example, it doesn't match "123abc-
foo". A pattern to do that is
(?<=\d{3}...)(?<!999)foo
This time the first assertion looks at the preceding six characters,
checking that the first three are digits, and then the second assertion
checks that the preceding three characters are not "999".
Assertions can be nested in any combination. For example,
(?<=(?<!foo)bar)baz
matches an occurrence of "baz" that is preceded by "bar" which in turn
is not preceded by "foo", while
(?<=\d{3}(?!999)...)foo
is another pattern that matches "foo" preceded by three digits and any
three characters that are not "999".
CONDITIONAL SUBPATTERNS
It is possible to cause the matching process to obey a subpattern con-
ditionally or to choose between two alternative subpatterns, depending
on the result of an assertion, or whether a specific capturing subpat-
tern has already been matched. The two possible forms of conditional
subpattern are:
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
If the condition is satisfied, the yes-pattern is used; otherwise the
no-pattern (if present) is used. If there are more than two alterna-
tives in the subpattern, a compile-time error occurs.
There are four kinds of condition: references to subpatterns, refer-
ences to recursion, a pseudo-condition called DEFINE, and assertions.
Checking for a used subpattern by number
If the text between the parentheses consists of a sequence of digits,
the condition is true if a capturing subpattern of that number has pre-
viously matched. If there is more than one capturing subpattern with
the same number (see the earlier section about duplicate subpattern
numbers), the condition is true if any of them have been set. An alter-
native notation is to precede the digits with a plus or minus sign. In
this case, the subpattern number is relative rather than absolute. The
most recently opened parentheses can be referenced by (?(-1), the next
most recent by (?(-2), and so on. In looping constructs it can also
make sense to refer to subsequent groups with constructs such as
(?(+2).
Consider the following pattern, which contains non-significant white
space to make it more readable (assume the PCRE_EXTENDED option) and to
divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) )
The first part matches an optional opening parenthesis, and if that
character is present, sets it as the first captured substring. The sec-
ond part matches one or more characters that are not parentheses. The
third part is a conditional subpattern that tests whether the first set
of parentheses matched or not. If they did, that is, if subject started
with an opening parenthesis, the condition is true, and so the yes-pat-
tern is executed and a closing parenthesis is required. Otherwise,
since no-pattern is not present, the subpattern matches nothing. In
other words, this pattern matches a sequence of non-parentheses,
optionally enclosed in parentheses.
If you were embedding this pattern in a larger one, you could use a
relative reference:
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
This makes the fragment independent of the parentheses in the larger
pattern.
Checking for a used subpattern by name
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
used subpattern by name. For compatibility with earlier versions of
PCRE, which had this facility before Perl, the syntax (?(name)...) is
also recognized. However, there is a possible ambiguity with this syn-
tax, because subpattern names may consist entirely of digits. PCRE
looks first for a named subpattern; if it cannot find one and the name
consists entirely of digits, PCRE looks for a subpattern of that num-
ber, which must be greater than zero. Using subpattern names that con-
sist entirely of digits is not recommended.
Rewriting the above example to use a named subpattern gives this:
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
If the name used in a condition of this kind is a duplicate, the test
is applied to all subpatterns of the same name, and is true if any one
of them has matched.
Checking for pattern recursion
If the condition is the string (R), and there is no subpattern with the
name R, the condition is true if a recursive call to the whole pattern
or any subpattern has been made. If digits or a name preceded by amper-
sand follow the letter R, for example:
(?(R3)...) or (?(R&name)...)
the condition is true if the most recent recursion is into a subpattern
whose number or name is given. This condition does not check the entire
recursion stack. If the name used in a condition of this kind is a
duplicate, the test is applied to all subpatterns of the same name, and
is true if any one of them is the most recent recursion.
At "top level", all these recursion test conditions are false. The
syntax for recursive patterns is described below.
Defining subpatterns for use by reference only
If the condition is the string (DEFINE), and there is no subpattern
with the name DEFINE, the condition is always false. In this case,
there may be only one alternative in the subpattern. It is always
skipped if control reaches this point in the pattern; the idea of
DEFINE is that it can be used to define "subroutines" that can be ref-
erenced from elsewhere. (The use of "subroutines" is described below.)
For example, a pattern to match an IPv4 address could be written like
this (ignore whitespace and line breaks):
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b
The first part of the pattern is a DEFINE group inside which a another
group named "byte" is defined. This matches an individual component of
an IPv4 address (a number less than 256). When matching takes place,
this part of the pattern is skipped because DEFINE acts like a false
condition. The rest of the pattern uses references to the named group
to match the four dot-separated components of an IPv4 address, insist-
ing on a word boundary at each end.
Assertion conditions
If the condition is not in any of the above formats, it must be an
assertion. This may be a positive or negative lookahead or lookbehind
assertion. Consider this pattern, again containing non-significant
white space, and with the two alternatives on the second line:
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
The condition is a positive lookahead assertion that matches an
optional sequence of non-letters followed by a letter. In other words,
it tests for the presence of at least one letter in the subject. If a
letter is found, the subject is matched against the first alternative;
otherwise it is matched against the second. This pattern matches
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
letters and dd are digits.
COMMENTS
The sequence (?# marks the start of a comment that continues up to the
next closing parenthesis. Nested parentheses are not permitted. The
characters that make up a comment play no part in the pattern matching
at all.
If the PCRE_EXTENDED option is set, an unescaped # character outside a
character class introduces a comment that continues to immediately
after the next newline in the pattern.
RECURSIVE PATTERNS
Consider the problem of matching a string in parentheses, allowing for
unlimited nested parentheses. Without the use of recursion, the best
that can be done is to use a pattern that matches up to some fixed
depth of nesting. It is not possible to handle an arbitrary nesting
depth.
For some time, Perl has provided a facility that allows regular expres-
sions to recurse (amongst other things). It does this by interpolating
Perl code in the expression at run time, and the code can refer to the
expression itself. A Perl pattern using code interpolation to solve the
parentheses problem can be created like this:
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
The (?p{...}) item interpolates Perl code at run time, and in this case
refers recursively to the pattern in which it appears.
Obviously, PCRE cannot support the interpolation of Perl code. Instead,
it supports special syntax for recursion of the entire pattern, and
also for individual subpattern recursion. After its introduction in
PCRE and Python, this kind of recursion was subsequently introduced
into Perl at release 5.10.
A special item that consists of (? followed by a number greater than
zero and a closing parenthesis is a recursive call of the subpattern of
the given number, provided that it occurs inside that subpattern. (If
not, it is a "subroutine" call, which is described in the next sec-
tion.) The special item (?R) or (?0) is a recursive call of the entire
regular expression.
This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
\( ( [^()]++ | (?R) )* \)
First it matches an opening parenthesis. Then it matches any number of
substrings which can either be a sequence of non-parentheses, or a
recursive match of the pattern itself (that is, a correctly parenthe-
sized substring). Finally there is a closing parenthesis. Note the use
of a possessive quantifier to avoid backtracking into sequences of non-
parentheses.
If this were part of a larger pattern, you would not want to recurse
the entire pattern, so instead you could use this:
( \( ( [^()]++ | (?1) )* \) )
We have put the pattern into parentheses, and caused the recursion to
refer to them instead of the whole pattern.
In a larger pattern, keeping track of parenthesis numbers can be
tricky. This is made easier by the use of relative references (a Perl
5.10 feature). Instead of (?1) in the pattern above you can write
(?-2) to refer to the second most recently opened parentheses preceding
the recursion. In other words, a negative number counts capturing
parentheses leftwards from the point at which it is encountered.
It is also possible to refer to subsequently opened parentheses, by
writing references such as (?+2). However, these cannot be recursive
because the reference is not inside the parentheses that are refer-
enced. They are always "subroutine" calls, as described in the next
section.
An alternative approach is to use named parentheses instead. The Perl
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
supported. We could rewrite the above example as follows:
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
If there is more than one subpattern with the same name, the earliest
one is used.
This particular example pattern that we have been looking at contains
nested unlimited repeats, and so the use of a possessive quantifier for
matching strings of non-parentheses is important when applying the pat-
tern to strings that do not match. For example, when this pattern is
applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
it yields "no match" quickly. However, if a possessive quantifier is
not used, the match runs for a very long time indeed because there are
so many different ways the + and * repeats can carve up the subject,
and all have to be tested before failure can be reported.
At the end of a match, the values of capturing parentheses are those
from the outermost level. If you want to obtain intermediate values, a
callout function can be used (see below and the pcrecallout documenta-
tion). If the pattern above is matched against
(ab(cd)ef)
the value for the inner capturing parentheses (numbered 2) is "ef",
which is the last value taken on at the top level. If a capturing sub-
pattern is not matched at the top level, its final value is unset, even
if it is (temporarily) set at a deeper level.
If there are more than 15 capturing parentheses in a pattern, PCRE has
to obtain extra memory to store data during a recursion, which it does
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
Do not confuse the (?R) item with the condition (R), which tests for
recursion. Consider this pattern, which matches text in angle brack-
ets, allowing for arbitrary nesting. Only digits are allowed in nested
brackets (that is, when recursing), whereas any characters are permit-
ted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
In this pattern, (?(R) is the start of a conditional subpattern, with
two different alternatives for the recursive and non-recursive cases.
The (?R) item is the actual recursive call.
Recursion difference from Perl
In PCRE (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
alternatives and there is a subsequent matching failure. This can be
illustrated by the following pattern, which purports to match a palin-
dromic string that contains an odd number of characters (for example,
"a", "aba", "abcba", "abcdcba"):
^(.|(.)(?1)\2)$
The idea is that it either matches a single character, or two identical
characters surrounding a sub-palindrome. In Perl, this pattern works;
in PCRE it does not if the pattern is longer than three characters.
Consider the subject string "abcba":
At the top level, the first character is matched, but as it is not at
the end of the string, the first alternative fails; the second alterna-
tive is taken and the recursion kicks in. The recursive call to subpat-
tern 1 successfully matches the next character ("b"). (Note that the
beginning and end of line tests are not part of the recursion).
Back at the top level, the next character ("c") is compared with what
subpattern 2 matched, which was "a". This fails. Because the recursion
is treated as an atomic group, there are now no backtracking points,
and so the entire match fails. (Perl is able, at this point, to re-
enter the recursion and try the second alternative.) However, if the
pattern is written with the alternatives in the other order, things are
different:
^((.)(?1)\2|.)$
This time, the recursing alternative is tried first, and continues to
recurse until it runs out of characters, at which point the recursion
fails. But this time we do have another alternative to try at the
higher level. That is the big difference: in the previous case the
remaining alternative is at a deeper recursion level, which PCRE cannot
use.
To change the pattern so that matches all palindromic strings, not just
those with an odd number of characters, it is tempting to change the
pattern to this:
^((.)(?1)\2|.?)$
Again, this works in Perl, but not in PCRE, and for the same reason.
When a deeper recursion has matched a single character, it cannot be
entered again in order to match an empty string. The solution is to
separate the two cases, and write out the odd and even cases as alter-
natives at the higher level:
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
If you want to match typical palindromic phrases, the pattern has to
ignore all non-word characters, which can be done like this:
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
If run with the PCRE_CASELESS option, this pattern matches phrases such
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
ing into sequences of non-word characters. Without this, PCRE takes a
great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop.
WARNING: The palindrome-matching patterns above work only if the sub-
ject string does not start with a palindrome that is shorter than the
entire string. For example, although "abcba" is correctly matched, if
the subject is "ababa", PCRE finds the palindrome "aba" at the start,
then fails at top level because the end of the string does not follow.
Once again, it cannot jump back into the recursion to try other alter-
natives, so the entire match fails.
SUBPATTERNS AS SUBROUTINES
If the syntax for a recursive subpattern reference (either by number or
by name) is used outside the parentheses to which it refers, it oper-
ates like a subroutine in a programming language. The "called" subpat-
tern may be defined before or after the reference. A numbered reference
can be absolute or relative, as in these examples:
(...(absolute)...)...(?2)...
(...(relative)...)...(?-1)...
(...(?+1)...(relative)...
An earlier example pointed out that the pattern
(sens|respons)e and \1ibility
matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If instead the pattern
(sens|respons)e and (?1)ibility
is used, it does match "sense and responsibility" as well as the other
two strings. Another example is given in the discussion of DEFINE
above.
Like recursive subpatterns, a subroutine call is always treated as an
atomic group. That is, once it has matched some of the subject string,
it is never re-entered, even if it contains untried alternatives and
there is a subsequent matching failure. Any capturing parentheses that
are set during the subroutine call revert to their previous values
afterwards.
When a subpattern is used as a subroutine, processing options such as
case-independence are fixed when the subpattern is defined. They cannot
be changed for different calls. For example, consider this pattern:
(abc)(?i:(?-1))
It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
ONIGURUMA SUBROUTINE SYNTAX
For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is
an alternative syntax for referencing a subpattern as a subroutine,
possibly recursively. Here are two of the examples used above, rewrit-
ten using this syntax:
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
(sens|respons)e and \g'1'ibility
PCRE supports an extension to Oniguruma: if a number is preceded by a
plus or a minus sign it is taken as a relative reference. For example:
(abc)(?i:\g<-1>)
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
synonymous. The former is a back reference; the latter is a subroutine
call.
CALLOUTS
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
Perl code to be obeyed in the middle of matching a regular expression.
This makes it possible, amongst other things, to extract different sub-
strings that match the same pair of parentheses when there is a repeti-
tion.
PCRE provides a similar feature, but of course it cannot obey arbitrary
Perl code. The feature is called "callout". The caller of PCRE provides
an external function by putting its entry point in the global variable
pcre_callout. By default, this variable contains NULL, which disables
all calling out.
Within a regular expression, (?C) indicates the points at which the
external function is to be called. If you want to identify different
callout points, you can put a number less than 256 after the letter C.
The default value is zero. For example, this pattern has two callout
points:
(?C1)abc(?C2)def
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
automatically installed before each item in the pattern. They are all
numbered 255.
During matching, when PCRE reaches a callout point (and pcre_callout is
set), the external function is called. It is provided with the number
of the callout, the position in the pattern, and, optionally, one item
of data originally supplied by the caller of pcre_exec(). The callout
function may cause matching to proceed, to backtrack, or to fail alto-
gether. A complete description of the interface to the callout function
is given in the pcrecallout documentation.
BACKTRACKING CONTROL
Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
which are described in the Perl documentation as "experimental and sub-
ject to change or removal in a future version of Perl". It goes on to
say: "Their usage in production code should be noted to avoid problems
during upgrades." The same remarks apply to the PCRE features described
in this section.
Since these verbs are specifically related to backtracking, most of
them can be used only when the pattern is to be matched using
pcre_exec(), which uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, they cause an
error if encountered by pcre_dfa_exec().
If any of these verbs are used in an assertion or subroutine subpattern
(including recursive subpatterns), their effect is confined to that
subpattern; it does not extend to the surrounding pattern. Note that
such subpatterns are processed as anchored at the point where they are
tested.
The new verbs make use of what was previously invalid syntax: an open-
ing parenthesis followed by an asterisk. In Perl, they are generally of
the form (*VERB:ARG) but PCRE does not support the use of arguments, so
its general form is just (*VERB). Any number of these verbs may occur
in a pattern. There are two kinds:
Verbs that act immediately
The following verbs act as soon as they are encountered:
(*ACCEPT)
This verb causes the match to end successfully, skipping the remainder
of the pattern. When inside a recursion, only the innermost pattern is
ended immediately. If (*ACCEPT) is inside capturing parentheses, the
data so far is captured. (This feature was added to PCRE at release
8.00.) For example:
A((?:A|B(*ACCEPT)|C)D)
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
tured by the outer parentheses.
(*FAIL) or (*F)
This verb causes the match to fail, forcing backtracking to occur. It
is equivalent to (?!) but easier to read. The Perl documentation notes
that it is probably useful only when combined with (?{}) or (??{}).
Those are, of course, Perl features that are not present in PCRE. The
nearest equivalent is the callout feature, as for example in this pat-
tern:
a+(?C)(*FAIL)
A match with the string "aaaa" always fails, but the callout is taken
before each backtrack happens (in this example, 10 times).
Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching con-
tinues with what follows, but if there is no subsequent match, a fail-
ure is forced. The verbs differ in exactly what kind of failure
occurs.
(*COMMIT)
This verb causes the whole match to fail outright if the rest of the
pattern does not match. Even if the pattern is unanchored, no further
attempts to find a match by advancing the starting point take place.
Once (*COMMIT) has been passed, pcre_exec() is committed to finding a
match at the current starting point, or not at all. For example:
a+(*COMMIT)b
This matches "xxaab" but not "aacaab". It can be thought of as a kind
of dynamic anchor, or "I've started, so I must finish."
(*PRUNE)
This verb causes the match to fail at the current position if the rest
of the pattern does not match. If the pattern is unanchored, the normal
"bumpalong" advance to the next starting character then happens. Back-
tracking can occur as usual to the left of (*PRUNE), or when matching
to the right of (*PRUNE), but if there is no match to the right, back-
tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE)
is just an alternative to an atomic group or possessive quantifier, but
there are some uses of (*PRUNE) that cannot be expressed in any other
way.
(*SKIP)
This verb is like (*PRUNE), except that if the pattern is unanchored,
the "bumpalong" advance is not to the next character, but to the posi-
tion in the subject where (*SKIP) was encountered. (*SKIP) signifies
that whatever text was matched leading up to it cannot be part of a
successful match. Consider:
a+(*SKIP)b
If the subject is "aaaac...", after the first match attempt fails
(starting at the first character in the string), the starting point
skips on to start the next attempt at "c". Note that a possessive quan-
tifer does not have the same effect as this example; although it would
suppress backtracking during the first match attempt, the second
attempt would start at the second character instead of skipping on to
"c".
(*THEN)
This verb causes a skip to the next alternation if the rest of the pat-
tern does not match. That is, it cancels pending backtracking, but only
within the current alternation. Its name comes from the observation
that it can be used for a pattern-based if-then-else block:
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
If the COND1 pattern matches, FOO is tried (and possibly further items
after the end of the group if FOO succeeds); on failure the matcher
skips to the second alternative and tries COND2, without backtracking
into COND1. If (*THEN) is used outside of any alternation, it acts
exactly like (*PRUNE).
SEE ALSO
pcreapi(3), pcrecallout(3), pcrematching(3), pcresyntax(3), pcre(3).
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 06 March 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
PCRESYNTAX(3) PCRESYNTAX(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE REGULAR EXPRESSION SYNTAX SUMMARY
The full syntax and semantics of the regular expressions that are sup-
ported by PCRE are described in the pcrepattern documentation. This
document contains just a quick-reference summary of the syntax.
QUOTING
\x where x is non-alphanumeric is a literal x
\Q...\E treat enclosed characters as literal
CHARACTERS
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any character
\e escape (hex 1B)
\f formfeed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\ddd character with octal code ddd, or backreference
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..
CHARACTER TYPES
. any character except newline;
in dotall mode, any character whatsoever
\C one byte, even in UTF-8 mode (best avoided)
\d a decimal digit
\D a character that is not a decimal digit
\h a horizontal whitespace character
\H a character that is not a horizontal whitespace character
\p{xx} a character with the xx property
\P{xx} a character without the xx property
\R a newline sequence
\s a whitespace character
\S a character that is not a whitespace character
\v a vertical whitespace character
\V a character that is not a vertical whitespace character
\w a "word" character
\W a "non-word" character
\X an extended Unicode sequence
In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
GENERAL CATEGORY PROPERTY CODES FOR \p and \P
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
L& Ll, Lu, or Lt
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
SCRIPT NAMES FOR \p AND \P
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-
tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,
Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
Ugaritic, Vai, Yi.
CHARACTER CLASSES
[...] positive character class
[^...] negative character class
[x-y] range (can be used for hex characters)
[[:xxx:]] positive POSIX named set
[[:^xxx:]] negative POSIX named set
alnum alphanumeric
alpha alphabetic
ascii 0-127
blank space or tab
cntrl control character
digit decimal digit
graph printing, excluding space
lower lower case letter
print printing, including space
punct printing, excluding alphanumeric
space whitespace
upper upper case letter
word same as \w
xdigit hexadecimal digit
In PCRE, POSIX character set names recognize only ASCII characters. You
can use \Q...\E inside a character class.
QUANTIFIERS
? 0 or 1, greedy
?+ 0 or 1, possessive
?? 0 or 1, lazy
* 0 or more, greedy
*+ 0 or more, possessive
*? 0 or more, lazy
+ 1 or more, greedy
++ 1 or more, possessive
+? 1 or more, lazy
{n} exactly n
{n,m} at least n, no more than m, greedy
{n,m}+ at least n, no more than m, possessive
{n,m}? at least n, no more than m, lazy
{n,} n or more, greedy
{n,}+ n or more, possessive
{n,}? n or more, lazy
ANCHORS AND SIMPLE ASSERTIONS
\b word boundary (only ASCII letters recognized)
\B not a word boundary
^ start of subject
also after internal newline in multiline mode
\A start of subject
$ end of subject
also before newline at end of subject
also before internal newline in multiline mode
\Z end of subject
also before newline at end of subject
\z end of subject
\G first matching position in subject
MATCH POINT RESET
\K reset start of match
ALTERNATION
expr|expr|expr...
CAPTURING
(...) capturing group
(?<name>...) named capturing group (Perl)
(?'name'...) named capturing group (Perl)
(?P<name>...) named capturing group (Python)
(?:...) non-capturing group
(?|...) non-capturing group; reset group numbers for
capturing groups in each alternative
ATOMIC GROUPS
(?>...) atomic, non-capturing group
COMMENT
(?#....) comment (not nestable)
OPTION SETTING
(?i) caseless
(?J) allow duplicate names
(?m) multiline
(?s) single line (dotall)
(?U) default ungreedy (lazy)
(?x) extended (ignore white space)
(?-...) unset option(s)
The following is recognized only at the start of a pattern or after one
of the newline-setting options with similar syntax:
(*UTF8) set UTF-8 mode
LOOKAHEAD AND LOOKBEHIND ASSERTIONS
(?=...) positive look ahead
(?!...) negative look ahead
(?<=...) positive look behind
(?<!...) negative look behind
Each top-level branch of a look behind must be of a fixed length.
BACKREFERENCES
\n reference by number (can be ambiguous)
\gn reference by number
\g{n} reference by number
\g{-n} relative reference by number
\k<name> reference by name (Perl)
\k'name' reference by name (Perl)
\g{name} reference by name (Perl)
\k{name} reference by name (.NET)
(?P=name) reference by name (Python)
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
(?R) recurse whole pattern
(?n) call subpattern by absolute number
(?+n) call subpattern by relative number
(?-n) call subpattern by relative number
(?&name) call subpattern by name (Perl)
(?P>name) call subpattern by name (Python)
\g<name> call subpattern by name (Oniguruma)
\g'name' call subpattern by name (Oniguruma)
\g<n> call subpattern by absolute number (Oniguruma)
\g'n' call subpattern by absolute number (Oniguruma)
\g<+n> call subpattern by relative number (PCRE extension)
\g'+n' call subpattern by relative number (PCRE extension)
\g<-n> call subpattern by relative number (PCRE extension)
\g'-n' call subpattern by relative number (PCRE extension)
CONDITIONAL PATTERNS
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
(?(n)... absolute reference condition
(?(+n)... relative reference condition
(?(-n)... relative reference condition
(?(<name>)... named reference condition (Perl)
(?('name')... named reference condition (Perl)
(?(name)... named reference condition (PCRE)
(?(R)... overall recursion condition
(?(Rn)... specific group recursion condition
(?(R&name)... specific recursion condition
(?(DEFINE)... define subpattern for reference
(?(assert)... assertion condition
BACKTRACKING CONTROL
The following act immediately they are reached:
(*ACCEPT) force successful match
(*FAIL) force backtrack; synonym (*F)
The following act only when a subsequent match failure causes a back-
track to reach them. They all force a match failure, but they differ in
what happens afterwards. Those that advance the start-of-match point do
so only if the pattern is not anchored.
(*COMMIT) overall failure, no advance of starting point
(*PRUNE) advance to next starting character
(*SKIP) advance start to current matching position
(*THEN) local failure, backtrack to next alternation
NEWLINE CONVENTIONS
These are recognized only at the very start of the pattern or after a
(*BSR_...) or (*UTF8) option.
(*CR) carriage return only
(*LF) linefeed only
(*CRLF) carriage return followed by linefeed
(*ANYCRLF) all three of the above
(*ANY) any Unicode newline sequence
WHAT \R MATCHES
These are recognized only at the very start of the pattern or after a
(*...) option that sets the newline convention or UTF-8 mode.
(*BSR_ANYCRLF) CR, LF, or CRLF
(*BSR_UNICODE) any Unicode newline sequence
CALLOUTS
(?C) callout
(?Cn) callout with data n
SEE ALSO
pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 01 March 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
PCREPARTIAL(3) PCREPARTIAL(3)
NAME
PCRE - Perl-compatible regular expressions
PARTIAL MATCHING IN PCRE
In normal use of PCRE, if the subject string that is passed to
pcre_exec() or pcre_dfa_exec() matches as far as it goes, but is too
short to match the entire pattern, PCRE_ERROR_NOMATCH is returned.
There are circumstances where it might be helpful to distinguish this
case from other cases in which there is no match.
Consider, for example, an application where a human is required to type
in data for a field with specific formatting requirements. An example
might be a date in the form ddmmmyy, defined by this pattern:
^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
If the application sees the user's keystrokes one by one, and can check
that what has been typed so far is potentially valid, it is able to
raise an error as soon as a mistake is made, by beeping and not
reflecting the character that has been typed, for example. This immedi-
ate feedback is likely to be a better user interface than a check that
is delayed until the entire string has been entered. Partial matching
can also sometimes be useful when the subject string is very long and
is not all available at once.
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym
for PCRE_PARTIAL_SOFT. The essential difference between the two options
is whether or not a partial match is preferred to an alternative com-
plete match, though the details differ between the two matching func-
tions. If both options are set, PCRE_PARTIAL_HARD takes precedence.
Setting a partial matching option disables two of PCRE's optimizations.
PCRE remembers the last literal byte in a pattern, and abandons match-
ing immediately if such a byte is not present in the subject string.
This optimization cannot be used for a subject string that might match
only partially. If the pattern was studied, PCRE knows the minimum
length of a matching string, and does not bother to run the matching
function on shorter strings. This optimization is also disabled for
partial matching.
PARTIAL MATCHING USING pcre_exec()
A partial match occurs during a call to pcre_exec() whenever the end of
the subject string is reached successfully, but matching cannot con-
tinue because more characters are needed. However, at least one charac-
ter must have been matched. (In other words, a partial match can never
be an empty string.)
If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but
matching continues as normal, and other alternatives in the pattern are
tried. If no complete match can be found, pcre_exec() returns
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least
two slots in the offsets vector, the first of them is set to the offset
of the earliest character that was inspected when the partial match was
found. For convenience, the second offset points to the end of the
string so that a substring can easily be identified.
For the majority of patterns, the first offset identifies the start of
the partially matched string. However, for patterns that contain look-
behind assertions, or \K, or begin with \b or \B, earlier characters
have been inspected while carrying out the match. For example:
/(?<=abc)123/
This pattern matches "123", but only if it is preceded by "abc". If the
subject string is "xyzabc12", the offsets after a partial match are for
the substring "abc12", because all these characters are needed if
another match is tried with extra characters added.
If there is more than one partial match, the first one that was found
provides the data that is returned. Consider this pattern:
/123\w+X|dogY/
If this is matched against the subject string "abc123dog", both alter-
natives fail to match, but the end of the subject is reached during
matching, so PCRE_ERROR_PARTIAL is returned instead of
PCRE_ERROR_NOMATCH. The offsets are set to 3 and 9, identifying
"123dog" as the first partial match that was found. (In this example,
there are two partial matches, because "dog" on its own partially
matches the second alternative.)
If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR-
TIAL as soon as a partial match is found, without continuing to search
for possible complete matches. The difference between the two options
can be illustrated by a pattern such as:
/dog(sbody)?/
This matches either "dog" or "dogsbody", greedily (that is, it prefers
the longer string if possible). If it is matched against the string
"dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog".
However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
On the other hand, if the pattern is made ungreedy the result is dif-
ferent:
/dog(sbody)??/
In this case the result is always a complete match because pcre_exec()
finds that first, and it never continues after finding a match. It
might be easier to follow this explanation by thinking of the two pat-
terns like this:
/dog(sbody)?/ is the same as /dogsbody|dog/
/dog(sbody)??/ is the same as /dog|dogsbody/
The second pattern will never match "dogsbody" when pcre_exec() is
used, because it will always find the shorter match first.
PARTIAL MATCHING USING pcre_dfa_exec()
The pcre_dfa_exec() function moves along the subject string character
by character, without backtracking, searching for all possible matches
simultaneously. If the end of the subject is reached before the end of
the pattern, there is the possibility of a partial match, again pro-
vided that at least one character has matched.
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if
there have been no complete matches. Otherwise, the complete matches
are returned. However, if PCRE_PARTIAL_HARD is set, a partial match
takes precedence over any complete matches. The portion of the string
that was inspected when the longest partial match was found is set as
the first matching string, provided there are at least two slots in the
offsets vector.
Because pcre_dfa_exec() always searches for all possible matches, and
there is no difference between greedy and ungreedy repetition, its be-
haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-
sider the string "dog" matched against the ungreedy pattern shown
above:
/dog(sbody)??/
Whereas pcre_exec() stops as soon as it finds the complete match for
"dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
so returns that when PCRE_PARTIAL_HARD is set.
PARTIAL MATCHING AND WORD BOUNDARIES
If a pattern ends with one of sequences \b or \B, which test for word
boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-
intuitive results. Consider this pattern:
/\bcat\b/
This matches "cat", provided there is a word boundary at either end. If
the subject string is "the cat", the comparison of the final "t" with a
following character cannot take place, so a partial match is found.
However, pcre_exec() carries on with normal matching, which matches \b
at the end of the subject when the last character is a letter, thus
finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-
TIAL. The same thing happens with pcre_dfa_exec(), because it also
finds the complete match.
Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL,
because then the partial match takes precedence.
FORMERLY RESTRICTED PATTERNS
For releases of PCRE prior to 8.00, because of the way certain internal
optimizations were implemented in the pcre_exec() function, the
PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be
used with all patterns. From release 8.00 onwards, the restrictions no
longer apply, and partial matching with pcre_exec() can be requested
for any pattern.
Items that were formerly restricted were repeated single characters and
repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
not conform to the restrictions, pcre_exec() returned the error code
PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
pattern can be used for partial matching now always returns 1.
EXAMPLE OF PARTIAL MATCHING USING PCRETEST
If the escape sequence \P is present in a pcretest data line, the
PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
pcretest that uses the date example quoted above:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data> 25jun04\P
0: 25jun04
1: jun
data> 25dec3\P
Partial match: 23dec3
data> 3ju\P
Partial match: 3ju
data> 3juj\P
No match
data> j\P
No match
The first data string is matched completely, so pcretest shows the
matched substrings. The remaining four strings do not match the com-
plete pattern, but the first two are partial matches. Similar output is
obtained when pcre_dfa_exec() is used.
If the escape sequence \P is present more than once in a pcretest data
line, the PCRE_PARTIAL_HARD option is set for the match.
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
When a partial match has been found using pcre_dfa_exec(), it is possi-
ble to continue the match by providing additional subject data and
calling pcre_dfa_exec() again with the same compiled regular expres-
sion, this time setting the PCRE_DFA_RESTART option. You must pass the
same working space as before, because this is where details of the pre-
vious partial match are stored. Here is an example using pcretest,
using the \R escape sequence to set the PCRE_DFA_RESTART option (\D
specifies the use of pcre_dfa_exec()):
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data> 23ja\P\D
Partial match: 23ja
data> n05\R\D
0: n05
The first call has "23ja" as the subject, and requests partial match-
ing; the second call has "n05" as the subject for the continued
(restarted) match. Notice that when the match is complete, only the
last part is shown; PCRE does not retain the previously partially-
matched string. It is up to the calling program to do that if it needs
to.
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
PCRE_DFA_RESTART to continue partial matching over multiple segments.
This facility can be used to pass very long subject strings to
pcre_dfa_exec().
MULTI-SEGMENT MATCHING WITH pcre_exec()
From release 8.00, pcre_exec() can also be used to do multi-segment
matching. Unlike pcre_dfa_exec(), it is not possible to restart the
previous match with a new segment of data. Instead, new data must be
added to the previous subject string, and the entire match re-run,
starting from the point where the partial match occurred. Earlier data
can be discarded. Consider an unanchored pattern that matches dates:
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
data> The date is 23ja\P
Partial match: 23ja
At this stage, an application could discard the text preceding "23ja",
add on text from the next segment, and call pcre_exec() again. Unlike
pcre_dfa_exec(), the entire matching string must always be available,
and the complete matching process occurs for each call, so more memory
and more processing time is needed.
Note: If the pattern contains lookbehind assertions, or \K, or starts
with \b or \B, the string that is returned for a partial match will
include characters that precede the partially matched string itself,
because these must be retained when adding on more characters for a
subsequent matching attempt.
ISSUES WITH MULTI-SEGMENT MATCHING
Certain types of pattern may give problems with multi-segment matching,
whichever matching function is used.
1. If the pattern contains tests for the beginning or end of a line,
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
ate, when the subject string for any call does not contain the begin-
ning or end of a line.
2. Lookbehind assertions at the start of a pattern are catered for in
the offsets that are returned for a partial match. However, in theory,
a lookbehind assertion later in the pattern could require even earlier
characters to be inspected, and it might not have been reached when a
partial match occurs. This is probably an extremely unlikely case; you
could guard against it to a certain extent by always including extra
characters at the start.
3. Matching a subject string that is split into multiple segments may
not always produce exactly the same result as matching over one single
long string, especially when PCRE_PARTIAL_SOFT is used. The section
"Partial Matching and Word Boundaries" above describes an issue that
arises if the pattern ends with \b or \B. Another kind of difference
may occur when there are multiple matching possibilities, because a
partial match result is given only when there are no completed matches.
This means that as soon as the shortest match has been found, continua-
tion to a new subject segment is no longer possible. Consider again
this pcretest example:
re> /dog(sbody)?/
data> dogsb\P
0: dog
data> do\P\D
Partial match: do
data> gsb\R\P\D
0: g
data> dogsbody\D
0: dogsbody
1: dog
The first data line passes the string "dogsb" to pcre_exec(), setting
the PCRE_PARTIAL_SOFT option. Although the string is a partial match
for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the
shorter string "dog" is a complete match. Similarly, when the subject
is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
the first two) the match stops when "dog" has been found, and it is not
possible to continue. On the other hand, if "dogsbody" is presented as
a single string, pcre_dfa_exec() finds both matches.
Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
when matching multi-segment data. The example above then behaves dif-
ferently:
re> /dog(sbody)?/
data> dogsb\P\P
Partial match: dogsb
data> do\P\D
Partial match: do
data> gsb\R\P\P\D
Partial match: gsb
4. Patterns that contain alternatives at the top level which do not all
start with the same pattern item may not work as expected when
PCRE_DFA_RESTART is used with pcre_dfa_exec(). For example, consider
this pattern:
1234|3789
If the first part of the subject is "ABC123", a partial match of the
first alternative is found at offset 3. There is no partial match for
the second alternative, because such a match does not start at the same
point in the subject string. Attempting to continue with the string
"7890" does not yield a match because only those alternatives that
match at one point in the subject are remembered. The problem arises
because the start of the second alternative matches within the first
alternative. There is no problem with anchored patterns or patterns
such as:
1234|ABCD
where no string can be a partial match for both alternatives. This is
not a problem if pcre_exec() is used, because the entire match has to
be rerun each time:
re> /1234|3789/
data> ABC123\P
Partial match: 123
data> 1237890
0: 3789
Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-
running the entire match can also be used with pcre_dfa_exec(). Another
possibility is to work with two buffers. If a partial match at offset n
in the first buffer is followed by "no match" when PCRE_DFA_RESTART is
used on the second buffer, you can then try a new match starting at
offset n+1 in the first buffer.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 19 October 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
PCREPRECOMPILE(3) PCREPRECOMPILE(3)
NAME
PCRE - Perl-compatible regular expressions
SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
If you are running an application that uses a large number of regular
expression patterns, it may be useful to store them in a precompiled
form instead of having to compile them every time the application is
run. If you are not using any private character tables (see the
pcre_maketables() documentation), this is relatively straightforward.
If you are using private tables, it is a little bit more complicated.
If you save compiled patterns to a file, you can copy them to a differ-
ent host and run them there. This works even if the new host has the
opposite endianness to the one on which the patterns were compiled.
There may be a small performance penalty, but it should be insignifi-
cant. However, compiling regular expressions with one version of PCRE
for use with a different version is not guaranteed to work and may
cause crashes.
SAVING A COMPILED PATTERN
The value returned by pcre_compile() points to a single block of memory
that holds the compiled pattern and associated data. You can find the
length of this block in bytes by calling pcre_fullinfo() with an argu-
ment of PCRE_INFO_SIZE. You can then save the data in any appropriate
manner. Here is sample code that compiles a pattern and writes it to a
file. It assumes that the variable fd refers to a file that is open for
output:
int erroroffset, rc, size;
char *error;
pcre *re;
re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
if (re == NULL) { ... handle errors ... }
rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
if (rc < 0) { ... handle errors ... }
rc = fwrite(re, 1, size, fd);
if (rc != size) { ... handle errors ... }
In this example, the bytes that comprise the compiled pattern are
copied exactly. Note that this is binary data that may contain any of
the 256 possible byte values. On systems that make a distinction
between binary and non-binary data, be sure that the file is opened for
binary output.
If you want to write more than one pattern to a file, you will have to
devise a way of separating them. For binary data, preceding each pat-
tern with its length is probably the most straightforward approach.
Another possibility is to write out the data in hexadecimal instead of
binary, one pattern to a line.
Saving compiled patterns in a file is only one possible way of storing
them for later use. They could equally well be saved in a database, or
in the memory of some daemon process that passes them via sockets to
the processes that want them.
If the pattern has been studied, it is also possible to save the study
data in a similar way to the compiled pattern itself. When studying
generates additional information, pcre_study() returns a pointer to a
pcre_extra data block. Its format is defined in the section on matching
a pattern in the pcreapi documentation. The study_data field points to
the binary study data, and this is what you must save (not the
pcre_extra block itself). The length of the study data can be obtained
by calling pcre_fullinfo() with an argument of PCRE_INFO_STUDYSIZE.
Remember to check that pcre_study() did return a non-NULL value before
trying to save the study data.
RE-USING A PRECOMPILED PATTERN
Re-using a precompiled pattern is straightforward. Having reloaded it
into main memory, you pass its pointer to pcre_exec() or
pcre_dfa_exec() in the usual way. This should work even on another
host, and even if that host has the opposite endianness to the one
where the pattern was compiled.
However, if you passed a pointer to custom character tables when the
pattern was compiled (the tableptr argument of pcre_compile()), you
must now pass a similar pointer to pcre_exec() or pcre_dfa_exec(),
because the value saved with the compiled pattern will obviously be
nonsense. A field in a pcre_extra() block is used to pass this data, as
described in the section on matching a pattern in the pcreapi documen-
tation.
If you did not provide custom character tables when the pattern was
compiled, the pointer in the compiled pattern is NULL, which causes
pcre_exec() to use PCRE's internal tables. Thus, you do not need to
take any special action at run time in this case.
If you saved study data with the compiled pattern, you need to create
your own pcre_extra data block and set the study_data field to point to
the reloaded study data. You must also set the PCRE_EXTRA_STUDY_DATA
bit in the flags field to indicate that study data is present. Then
pass the pcre_extra block to pcre_exec() or pcre_dfa_exec() in the
usual way.
COMPATIBILITY WITH DIFFERENT PCRE RELEASES
In general, it is safest to recompile all saved patterns when you
update to a new PCRE release, though not all updates actually require
this. Recompiling is definitely needed for release 7.2.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 13 June 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------
PCREPERFORM(3) PCREPERFORM(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE PERFORMANCE
Two aspects of performance are discussed below: memory usage and pro-
cessing time. The way you express your pattern as a regular expression
can affect both of them.
COMPILED PATTERN MEMORY USAGE
Patterns are compiled by PCRE into a reasonably efficient byte code, so
that most simple patterns do not use much memory. However, there is one
case where the memory usage of a compiled pattern can be unexpectedly
large. If a parenthesized subpattern has a quantifier with a minimum
greater than 1 and/or a limited maximum, the whole subpattern is
repeated in the compiled code. For example, the pattern
(abc|def){2,4}
is compiled as if it were
(abc|def)(abc|def)((abc|def)(abc|def)?)?
(Technical aside: It is done this way so that backtrack points within
each of the repetitions can be independently maintained.)
For regular expressions whose quantifiers use only small numbers, this
is not usually a problem. However, if the numbers are large, and par-
ticularly if such repetitions are nested, the memory usage can become
an embarrassment. For example, the very simple pattern
((ab){1,1000}c){1,3}
uses 51K bytes when compiled. When PCRE is compiled with its default
internal pointer size of two bytes, the size limit on a compiled pat-
tern is 64K, and this is reached with the above pattern if the outer
repetition is increased from 3 to 4. PCRE can be compiled to use larger
internal pointers and thus handle larger compiled patterns, but it is
better to try to rewrite your pattern to use less memory if you can.
One way of reducing the memory usage for such patterns is to make use
of PCRE's "subroutine" facility. Re-writing the above pattern as
((ab)(?2){0,999}c)(?1){0,2}
reduces the memory requirements to 18K, and indeed it remains under 20K
even with the outer repetition increased to 100. However, this pattern
is not exactly equivalent, because the "subroutine" calls are treated
as atomic groups into which there can be no backtracking if there is a
subsequent matching failure. Therefore, PCRE cannot do this kind of
rewriting automatically. Furthermore, there is a noticeable loss of
speed when executing the modified pattern. Nevertheless, if the atomic
grouping is not a problem and the loss of speed is acceptable, this
kind of rewriting will allow you to process patterns that PCRE cannot
otherwise handle.
STACK USAGE AT RUN TIME
When pcre_exec() is used for matching, certain kinds of pattern can
cause it to use large amounts of the process stack. In some environ-
ments the default process stack is quite small, and if it runs out the
result is often SIGSEGV. This issue is probably the most frequently
raised problem with PCRE. Rewriting your pattern can often help. The
pcrestack documentation discusses this issue in detail.
PROCESSING TIME
Certain items in regular expression patterns are processed more effi-
ciently than others. It is more efficient to use a character class like
[aeiou] than a set of single-character alternatives such as
(a|e|i|o|u). In general, the simplest construction that provides the
required behaviour is usually the most efficient. Jeffrey Friedl's book
contains a lot of useful general discussion about optimizing regular
expressions for efficient performance. This document contains a few
observations about PCRE.
Using Unicode character properties (the \p, \P, and \X escapes) is
slow, because PCRE has to scan a structure that contains data for over
fifteen thousand characters whenever it needs a character's property.
If you can find an alternative pattern that does not use character
properties, it will probably be faster.
When a pattern begins with .* not in parentheses, or in parentheses
that are not the subject of a backreference, and the PCRE_DOTALL option
is set, the pattern is implicitly anchored by PCRE, since it can match
only at the start of a subject string. However, if PCRE_DOTALL is not
set, PCRE cannot make this optimization, because the . metacharacter
does not then match a newline, and if the subject string contains new-
lines, the pattern may match from the character immediately following
one of them instead of from the very start. For example, the pattern
.*second
matches the subject "first\nand second" (where \n stands for a newline
character), with the match starting at the seventh character. In order
to do this, PCRE has to retry the match starting after every newline in
the subject.
If you are using such a pattern with subject strings that do not con-
tain newlines, the best performance is obtained by setting PCRE_DOTALL,
or starting the pattern with ^.* or ^.*? to indicate explicit anchor-
ing. That saves PCRE from having to scan along the subject looking for
a newline to restart at.
Beware of patterns that contain nested indefinite repeats. These can
take a long time to run when applied to a string that does not match.
Consider the pattern fragment
^(a+)*
This can match "aaaa" in 16 different ways, and this number increases
very rapidly as the string gets longer. (The * repeat can match 0, 1,
2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
repeats can match different numbers of times.) When the remainder of
the pattern is such that the entire match is going to fail, PCRE has in
principle to try every possible variation, and this can take an
extremely long time, even for relatively short strings.
An optimization catches some of the more simple cases such as
(a+)*b
where a literal character follows. Before embarking on the standard
matching procedure, PCRE checks that there is a "b" later in the sub-
ject string, and if there is not, it fails the match immediately. How-
ever, when there is no following literal this optimization cannot be
used. You can see the difference by comparing the behaviour of
(a+)*\d
with the pattern above. The former gives a failure almost instantly
when applied to a whole line of "a" characters, whereas the latter
takes an appreciable time with strings longer than about 20 characters.
In many cases, the solution to this kind of performance issue is to use
an atomic group or a possessive quantifier.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 07 March 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
PCREPOSIX(3) PCREPOSIX(3)
NAME
PCRE - Perl-compatible regular expressions.
SYNOPSIS OF POSIX API
#include <pcreposix.h>
int regcomp(regex_t *preg, const char *pattern,
int cflags);
int regexec(regex_t *preg, const char *string,
size_t nmatch, regmatch_t pmatch[], int eflags);
size_t regerror(int errcode, const regex_t *preg,
char *errbuf, size_t errbuf_size);
void regfree(regex_t *preg);
DESCRIPTION
This set of functions provides a POSIX-style API to the PCRE regular
expression package. See the pcreapi documentation for a description of
PCRE's native API, which contains much additional functionality.
The functions described here are just wrapper functions that ultimately
call the PCRE native API. Their prototypes are defined in the
pcreposix.h header file, and on Unix systems the library itself is
called pcreposix.a, so can be accessed by adding -lpcreposix to the
command for linking an application that uses them. Because the POSIX
functions call the native ones, it is also necessary to add -lpcre.
I have implemented only those POSIX option bits that can be reasonably
mapped to PCRE native options. In addition, the option REG_EXTENDED is
defined with the value zero. This has no effect, but since programs
that are written to the POSIX interface often use it, this makes it
easier to slot in PCRE as a replacement library. Other POSIX options
are not even defined.
There are also some other options that are not defined by POSIX. These
have been added at the request of users who want to make use of certain
PCRE-specific features via the POSIX calling interface.
When PCRE is called via these functions, it is only the API that is
POSIX-like in style. The syntax and semantics of the regular expres-
sions themselves are still those of Perl, subject to the setting of
various PCRE options, as described below. "POSIX-like in style" means
that the API approximates to the POSIX definition; it is not fully
POSIX-compatible, and in multi-byte encoding domains it is probably
even less compatible.
The header for these functions is supplied as pcreposix.h to avoid any
potential clash with other POSIX libraries. It can, of course, be
renamed or aliased as regex.h, which is the "correct" name. It provides
two structure types, regex_t for compiled internal forms, and reg-
match_t for returning captured substrings. It also defines some con-
stants whose names start with "REG_"; these are used for setting
options and identifying error codes.
COMPILING A PATTERN
The function regcomp() is called to compile a pattern into an internal
form. The pattern is a C string terminated by a binary zero, and is
passed in the argument pattern. The preg argument is a pointer to a
regex_t structure that is used as a base for storing information about
the compiled regular expression.
The argument cflags is either zero, or contains one or more of the bits
defined by the following macros:
REG_DOTALL
The PCRE_DOTALL option is set when the regular expression is passed for
compilation to the native function. Note that REG_DOTALL is not part of
the POSIX standard.
REG_ICASE
The PCRE_CASELESS option is set when the regular expression is passed
for compilation to the native function.
REG_NEWLINE
The PCRE_MULTILINE option is set when the regular expression is passed
for compilation to the native function. Note that this does not mimic
the defined POSIX behaviour for REG_NEWLINE (see the following sec-
tion).
REG_NOSUB
The PCRE_NO_AUTO_CAPTURE option is set when the regular expression is
passed for compilation to the native function. In addition, when a pat-
tern that is compiled with this flag is passed to regexec() for match-
ing, the nmatch and pmatch arguments are ignored, and no captured
strings are returned.
REG_UNGREEDY
The PCRE_UNGREEDY option is set when the regular expression is passed
for compilation to the native function. Note that REG_UNGREEDY is not
part of the POSIX standard.
REG_UTF8
The PCRE_UTF8 option is set when the regular expression is passed for
compilation to the native function. This causes the pattern itself and
all data strings used for matching it to be treated as UTF-8 strings.
Note that REG_UTF8 is not part of the POSIX standard.
In the absence of these flags, no options are passed to the native
function. This means the the regex is compiled with PCRE default
semantics. In particular, the way it handles newline characters in the
subject string is the Perl way, not the POSIX way. Note that setting
PCRE_MULTILINE has only some of the effects specified for REG_NEWLINE.
It does not affect the way newlines are matched by . (they are not) or
by a negative class such as [^a] (they are).
The yield of regcomp() is zero on success, and non-zero otherwise. The
preg structure is filled in on success, and one member of the structure
is public: re_nsub contains the number of capturing subpatterns in the
regular expression. Various error codes are defined in the header file.
NOTE: If the yield of regcomp() is non-zero, you must not attempt to
use the contents of the preg structure. If, for example, you pass it to
regexec(), the result is undefined and your program is likely to crash.
MATCHING NEWLINE CHARACTERS
This area is not simple, because POSIX and Perl take different views of
things. It is not possible to get PCRE to obey POSIX semantics, but
then PCRE was never intended to be a POSIX engine. The following table
lists the different possibilities for matching newline characters in
PCRE:
Default Change with
. matches newline no PCRE_DOTALL
newline matches [^a] yes not changeable
$ matches \n at end yes PCRE_DOLLARENDONLY
$ matches \n in middle no PCRE_MULTILINE
^ matches \n in middle no PCRE_MULTILINE
This is the equivalent table for POSIX:
Default Change with
. matches newline yes REG_NEWLINE
newline matches [^a] yes REG_NEWLINE
$ matches \n at end no REG_NEWLINE
$ matches \n in middle no REG_NEWLINE
^ matches \n in middle no REG_NEWLINE
PCRE's behaviour is the same as Perl's, except that there is no equiva-
lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is
no way to stop newline from matching [^a].
The default POSIX newline handling can be obtained by setting
PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE
behave exactly as for the REG_NEWLINE action.
MATCHING A PATTERN
The function regexec() is called to match a compiled pattern preg
against a given string, which is by default terminated by a zero byte
(but see REG_STARTEND below), subject to the options in eflags. These
can be:
REG_NOTBOL
The PCRE_NOTBOL option is set when calling the underlying PCRE matching
function.
REG_NOTEMPTY
The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-
ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
However, setting this option can give more POSIX-like behaviour in some
situations.
REG_NOTEOL
The PCRE_NOTEOL option is set when calling the underlying PCRE matching
function.
REG_STARTEND
The string is considered to start at string + pmatch[0].rm_so and to
have a terminating NUL located at string + pmatch[0].rm_eo (there need
not actually be a NUL at that location), regardless of the value of
nmatch. This is a BSD extension, compatible with but not specified by
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in
software intended to be portable to other systems. Note that a non-zero
rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
of the string, not how it is matched.
If the pattern was compiled with the REG_NOSUB flag, no data about any
matched strings is returned. The nmatch and pmatch arguments of
regexec() are ignored.
If the value of nmatch is zero, or if the value pmatch is NULL, no data
about any matched strings is returned.
Otherwise,the portion of the string that was matched, and also any cap-
tured substrings, are returned via the pmatch argument, which points to
an array of nmatch structures of type regmatch_t, containing the mem-
bers rm_so and rm_eo. These contain the offset to the first character
of each substring and the offset to the first character after the end
of each substring, respectively. The 0th element of the vector relates
to the entire portion of string that was matched; subsequent elements
relate to the capturing subpatterns of the regular expression. Unused
entries in the array have both structure members set to -1.
A successful match yields a zero return; various error codes are
defined in the header file, of which REG_NOMATCH is the "expected"
failure code.
ERROR MESSAGES
The regerror() function maps a non-zero errorcode from either regcomp()
or regexec() to a printable message. If preg is not NULL, the error
should have arisen from the use of that structure. A message terminated
by a binary zero is placed in errbuf. The length of the message,
including the zero, is limited to errbuf_size. The yield of the func-
tion is the size of buffer needed to hold the whole message.
MEMORY USAGE
Compiling a regular expression causes memory to be allocated and asso-
ciated with the preg structure. The function regfree() frees all such
memory, after which preg may no longer be used as a compiled expres-
sion.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 02 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
PCRECPP(3) PCRECPP(3)
NAME
PCRE - Perl-compatible regular expressions.
SYNOPSIS OF C++ WRAPPER
#include <pcrecpp.h>
DESCRIPTION
The C++ wrapper for PCRE was provided by Google Inc. Some additional
functionality was added by Giuseppe Maxia. This brief man page was con-
structed from the notes in the pcrecpp.h file, which should be con-
sulted for further details.
MATCHING INTERFACE
The "FullMatch" operation checks that supplied text matches a supplied
pattern exactly. If pointer arguments are supplied, it copies matched
sub-strings that match sub-patterns into them.
Example: successful match
pcrecpp::RE re("h.*o");
re.FullMatch("hello");
Example: unsuccessful match (requires full match):
pcrecpp::RE re("e");
!re.FullMatch("hello");
Example: creating a temporary RE object:
pcrecpp::RE("h.*o").FullMatch("hello");
You can pass in a "const char*" or a "string" for "text". The examples
below tend to use a const char*. You can, as in the different examples
above, store the RE object explicitly in a variable or use a temporary
RE object. The examples below use one mode or the other arbitrarily.
Either could correctly be used for any of these examples.
You must supply extra pointer arguments to extract matched subpieces.
Example: extracts "ruby" into "s" and 1234 into "i"
int i;
string s;
pcrecpp::RE re("(\\w+):(\\d+)");
re.FullMatch("ruby:1234", &s, &i);
Example: does not try to extract any extra sub-patterns
re.FullMatch("ruby:1234", &s);
Example: does not try to extract into NULL
re.FullMatch("ruby:1234", NULL, &i);
Example: integer overflow causes failure
!re.FullMatch("ruby:1234567891234", NULL, &i);
Example: fails because there aren't enough sub-patterns:
!pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
Example: fails because string cannot be stored in integer
!pcrecpp::RE("(.*)").FullMatch("ruby", &i);
The provided pointer arguments can be pointers to any scalar numeric
type, or one of:
string (matched piece is copied to string)
StringPiece (StringPiece is mutated to point to matched piece)
T (where "bool T::ParseFrom(const char*, int)" exists)
NULL (the corresponding matched sub-pattern is not copied)
The function returns true iff all of the following conditions are sat-
isfied:
a. "text" matches "pattern" exactly;
b. The number of matched sub-patterns is >= number of supplied
pointers;
c. The "i"th argument has a suitable type for holding the
string captured as the "i"th sub-pattern. If you pass in
void * NULL for the "i"th argument, or a non-void * NULL
of the correct type, or pass fewer arguments than the
number of sub-patterns, "i"th captured sub-pattern is
ignored.
CAVEAT: An optional sub-pattern that does not exist in the matched
string is assigned the empty string. Therefore, the following will
return false (because the empty string is not a valid number):
int number;
pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
The matching interface supports at most 16 arguments per call. If you
need more, consider using the more general interface
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
NOTE: Do not use no_arg, which is used internally to mark the end of a
list of optional arguments, as a placeholder for missing arguments, as
this can lead to segfaults.
QUOTING METACHARACTERS
You can use the "QuoteMeta" operation to insert backslashes before all
potentially meaningful characters in a string. The returned string,
used as a regular expression, will exactly match the original string.
Example:
string quoted = RE::QuoteMeta(unquoted);
Note that it's legal to escape a character even if it has no special
meaning in a regular expression -- so this function does that. (This
also makes it identical to the perl function of the same name; see
"perldoc -f quotemeta".) For example, "1.5-2.0?" becomes
"1\.5\-2\.0\?".
PARTIAL MATCHES
You can use the "PartialMatch" operation when you want the pattern to
match any substring of the text.
Example: simple search for a string:
pcrecpp::RE("ell").PartialMatch("hello");
Example: find first number in a string:
int number;
pcrecpp::RE re("(\\d+)");
re.PartialMatch("x*100 + 20", &number);
assert(number == 100);
UTF-8 AND THE MATCHING INTERFACE
By default, pattern and text are plain text, one byte per character.
The UTF8 flag, passed to the constructor, causes both pattern and
string to be treated as UTF-8 text, still a byte stream but potentially
multiple bytes per character. In practice, the text is likelier to be
UTF-8 than the pattern, but the match returned may depend on the UTF8
flag, so always use it when matching UTF8 text. For example, "." will
match one byte normally but with UTF8 set may match up to three bytes
of a multi-byte character.
Example:
pcrecpp::RE_Options options;
options.set_utf8();
pcrecpp::RE re(utf8_pattern, options);
re.FullMatch(utf8_string);
Example: using the convenience function UTF8():
pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
re.FullMatch(utf8_string);
NOTE: The UTF8 flag is ignored if pcre was not configured with the
--enable-utf8 flag.
PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
PCRE defines some modifiers to change the behavior of the regular
expression engine. The C++ wrapper defines an auxiliary class,
RE_Options, as a vehicle to pass such modifiers to a RE class. Cur-
rently, the following modifiers are supported:
modifier description Perl corresponding
PCRE_CASELESS case insensitive match /i
PCRE_MULTILINE multiple lines match /m
PCRE_DOTALL dot matches newlines /s
PCRE_DOLLAR_ENDONLY $ matches only at end N/A
PCRE_EXTRA strict escape parsing N/A
PCRE_EXTENDED ignore whitespaces /x
PCRE_UTF8 handles UTF8 chars built-in
PCRE_UNGREEDY reverses * and *? N/A
PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
(*) Both Perl and PCRE allow non capturing parentheses by means of the
"?:" modifier within the pattern itself. e.g. (?:ab|cd) does not cap-
ture, while (ab|cd) does.
For a full account on how each modifier works, please check the PCRE
API reference page.
For each modifier, there are two member functions whose name is made
out of the modifier in lowercase, without the "PCRE_" prefix. For
instance, PCRE_CASELESS is handled by
bool caseless()
which returns true if the modifier is set, and
RE_Options & set_caseless(bool)
which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
be accessed through the set_match_limit() and match_limit() member
functions. Setting match_limit to a non-zero value will limit the exe-
cution of pcre to keep it from doing bad things like blowing the stack
or taking an eternity to return a result. A value of 5000 is good
enough to stop stack blowup in a 2MB thread stack. Setting match_limit
to zero disables match limiting. Alternatively, you can call
match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
limit how much PCRE recurses. match_limit() limits the number of
matches PCRE does; match_limit_recursion() limits the depth of internal
recursion, and therefore the amount of stack that is used.
Normally, to pass one or more modifiers to a RE class, you declare a
RE_Options object, set the appropriate options, and pass this object to
a RE constructor. Example:
RE_options opt;
opt.set_caseless(true);
if (RE("HELLO", opt).PartialMatch("hello world")) ...
RE_options has two constructors. The default constructor takes no argu-
ments and creates a set of flags that are off by default. The optional
parameter option_flags is to facilitate transfer of legacy code from C
programs. This lets you do
RE(pattern,
RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
However, new code is better off doing
RE(pattern,
RE_Options().set_caseless(true).set_multiline(true))
.PartialMatch(str);
If you are going to pass one of the most used modifiers, there are some
convenience functions that return a RE_Options class with the appropri-
ate modifier already set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
and EXTENDED().
If you need to set several options at once, and you don't want to go
through the pains of declaring a RE_Options object and setting several
options, there is a parallel method that give you such ability on the
fly. You can concatenate several set_xxxxx() member functions, since
each of them returns a reference to its class object. For example, to
pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
statement, you may write:
RE(" ^ xyz \\s+ .* blah$",
RE_Options()
.set_caseless(true)
.set_extended(true)
.set_multiline(true)).PartialMatch(sometext);
SCANNING TEXT INCREMENTALLY
The "Consume" operation may be useful if you want to repeatedly match
regular expressions at the front of a string and skip over them as they
match. This requires use of the "StringPiece" type, which represents a
sub-range of a real string. Like RE, StringPiece is defined in the
pcrecpp namespace.
Example: read lines of the form "var = value" from a string.
string contents = ...; // Fill string somehow
pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
string var;
int value;
pcrecpp::RE re("(\\w+) = (\\d+)\n");
while (re.Consume(&input, &var, &value)) {
...;
}
Each successful call to "Consume" will set "var/value", and also
advance "input" so it points past the matched text.
The "FindAndConsume" operation is similar to "Consume" but does not
anchor your match at the beginning of the string. For example, you
could extract all words from a string by repeatedly calling
pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
PARSING HEX/OCTAL/C-RADIX NUMBERS
By default, if you pass a pointer to a numeric value, the corresponding
text is interpreted as a base-10 number. You can instead wrap the
pointer with a call to one of the operators Hex(), Octal(), or CRadix()
to interpret the text in another base. The CRadix operator interprets
C-style "0" (base-8) and "0x" (base-16) prefixes, but defaults to
base-10.
Example:
int a, b, c, d;
pcrecpp::RE re("(.*) (.*) (.*) (.*)");
re.FullMatch("100 40 0100 0x40",
pcrecpp::Octal(&a), pcrecpp::Hex(&b),
pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
will leave 64 in a, b, c, and d.
REPLACING PARTS OF STRINGS
You can replace the first match of "pattern" in "str" with "rewrite".
Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
insert text matching corresponding parenthesized group from the pat-
tern. \0 in "rewrite" refers to the entire matching text. For example:
string s = "yabba dabba doo";
pcrecpp::RE("b+").Replace("d", &s);
will leave "s" containing "yada dabba doo". The result is true if the
pattern matches and a replacement occurs, false otherwise.
GlobalReplace is like Replace except that it replaces all occurrences
of the pattern in the string with the rewrite. Replacements are not
subject to re-matching. For example:
string s = "yabba dabba doo";
pcrecpp::RE("b+").GlobalReplace("d", &s);
will leave "s" containing "yada dada doo". It returns the number of
replacements made.
Extract is like Replace, except that if the pattern matches, "rewrite"
is copied into "out" (an additional argument) with substitutions. The
non-matching portions of "text" are ignored. Returns true iff a match
occurred and the extraction happened successfully; if no match occurs,
the string is left unaffected.
AUTHOR
The C++ wrapper was contributed by Google Inc.
Copyright (c) 2007 Google Inc.
REVISION
Last updated: 17 March 2009
------------------------------------------------------------------------------
PCRESAMPLE(3) PCRESAMPLE(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE SAMPLE PROGRAM
A simple, complete demonstration program, to get you started with using
PCRE, is supplied in the file pcredemo.c in the PCRE distribution. A
listing of this program is given in the pcredemo documentation. If you
do not have a copy of the PCRE distribution, you can save this listing
to re-create pcredemo.c.
The program compiles the regular expression that is its first argument,
and matches it against the subject string in its second argument. No
PCRE options are set, and default character tables are used. If match-
ing succeeds, the program outputs the portion of the subject that
matched, together with the contents of any captured substrings.
If the -g option is given on the command line, the program then goes on
to check for further matches of the same regular expression in the same
subject string. The logic is a little bit tricky because of the possi-
bility of matching an empty string. Comments in the code explain what
is going on.
If PCRE is installed in the standard include and library directories
for your operating system, you should be able to compile the demonstra-
tion program using this command:
gcc -o pcredemo pcredemo.c -lpcre
If PCRE is installed elsewhere, you may need to add additional options
to the command line. For example, on a Unix-like system that has PCRE
installed in /usr/local, you can compile the demonstration program
using a command like this:
gcc -o pcredemo -I/usr/local/include pcredemo.c \
-L/usr/local/lib -lpcre
Once you have compiled the demonstration program, you can run simple
tests like this:
./pcredemo 'cat|dog' 'the cat sat on the mat'
./pcredemo -g 'cat|dog' 'the dog sat on the cat'
Note that there is a much more comprehensive test program, called
pcretest, which supports many more facilities for testing regular
expressions and the PCRE library. The pcredemo program is provided as a
simple coding example.
When you try to run pcredemo when PCRE is not installed in the standard
library directory, you may get an error like this on some operating
systems (e.g. Solaris):
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or
directory
This is caused by the way shared library support works on those sys-
tems. You need to add
-R/usr/local/lib
(for example) to the compile command to get round this problem.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 30 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
PCRESTACK(3) PCRESTACK(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE DISCUSSION OF STACK USAGE
When you call pcre_exec(), it makes use of an internal function called
match(). This calls itself recursively at branch points in the pattern,
in order to remember the state of the match so that it can back up and
try a different alternative if the first one fails. As matching pro-
ceeds deeper and deeper into the tree of possibilities, the recursion
depth increases.
Not all calls of match() increase the recursion depth; for an item such
as a* it may be called several times at the same level, after matching
different numbers of a's. Furthermore, in a number of cases where the
result of the recursive call would immediately be passed back as the
result of the current call (a "tail recursion"), the function is just
restarted instead.
The pcre_dfa_exec() function operates in an entirely different way, and
uses recursion only when there is a regular expression recursion or
subroutine call in the pattern. This includes the processing of asser-
tion and "once-only" subpatterns, which are handled like subroutine
calls. Normally, these are never very deep, and the limit on the com-
plexity of pcre_dfa_exec() is controlled by the amount of workspace it
is given. However, it is possible to write patterns with runaway infi-
nite recursions; such patterns will cause pcre_dfa_exec() to run out of
stack. At present, there is no protection against this.
The comments that follow do NOT apply to pcre_dfa_exec(); they are rel-
evant only for pcre_exec().
Reducing pcre_exec()'s stack usage
Each time that match() is actually called recursively, it uses memory
from the process stack. For certain kinds of pattern and data, very
large amounts of stack may be needed, despite the recognition of "tail
recursion". You can often reduce the amount of recursion, and there-
fore the amount of stack used, by modifying the pattern that is being
matched. Consider, for example, this pattern:
([^<]|<(?!inet))+
It matches from wherever it starts until it encounters "<inet" or the
end of the data, and is the kind of pattern that might be used when
processing an XML file. Each iteration of the outer parentheses matches
either one character that is not "<" or a "<" that is not followed by
"inet". However, each time a parenthesis is processed, a recursion
occurs, so this formulation uses a stack frame for each matched charac-
ter. For a long string, a lot of stack is required. Consider now this
rewritten pattern, which matches exactly the same strings:
([^<]++|<(?!inet))+
This uses very much less stack, because runs of characters that do not
contain "<" are "swallowed" in one item inside the parentheses. Recur-
sion happens only when a "<" character that is not followed by "inet"
is encountered (and we assume this is relatively rare). A possessive
quantifier is used to stop any backtracking into the runs of non-"<"
characters, but that is not related to stack usage.
This example shows that one way of avoiding stack problems when match-
ing long subject strings is to write repeated parenthesized subpatterns
to match more than one character whenever possible.
Compiling PCRE to use heap instead of stack for pcre_exec()
In environments where stack memory is constrained, you might want to
compile PCRE to use heap memory instead of stack for remembering back-
up points when pcre_exec() is running. This makes it run a lot more
slowly, however. Details of how to do this are given in the pcrebuild
documentation. When built in this way, instead of using the stack, PCRE
obtains and frees memory by calling the functions that are pointed to
by the pcre_stack_malloc and pcre_stack_free variables. By default,
these point to malloc() and free(), but you can replace the pointers to
cause PCRE to use your own functions. Since the block sizes are always
the same, and are always freed in reverse order, it may be possible to
implement customized memory handlers that are more efficient than the
standard functions.
Limiting pcre_exec()'s stack usage
You can set limits on the number of times that match() is called, both
in total and recursively. If a limit is exceeded, pcre_exec() returns
an error code. Setting suitable limits should prevent it from running
out of stack. The default values of the limits are very large, and
unlikely ever to operate. They can be changed when PCRE is built, and
they can also be set when pcre_exec() is called. For details of these
interfaces, see the pcrebuild documentation and the section on extra
data for pcre_exec() in the pcreapi documentation.
As a very rough rule of thumb, you should reckon on about 500 bytes per
recursion. Thus, if you want to limit your stack usage to 8Mb, you
should set the limit at 16000 recursions. A 64Mb stack, on the other
hand, can support around 128000 recursions.
In Unix-like environments, the pcretest test program has a command line
option (-S) that can be used to increase the size of its stack. As long
as the stack is large enough, another option (-M) can be used to find
the smallest limits that allow a particular pattern to match a given
subject string. This is done by calling pcre_exec() repeatedly with
different limits.
Changing stack size in Unix-like systems
In Unix-like environments, there is not often a problem with the stack
unless very long strings are involved, though the default limit on
stack size varies from system to system. Values from 8Mb to 64Mb are
common. You can find your default limit by running the command:
ulimit -s
Unfortunately, the effect of running out of stack is often SIGSEGV,
though sometimes a more explicit error message is given. You can nor-
mally increase the limit on stack size by code such as this:
struct rlimit rlim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur = 100*1024*1024;
setrlimit(RLIMIT_STACK, &rlim);
This reads the current limits (soft and hard) using getrlimit(), then
attempts to increase the soft limit to 100Mb using setrlimit(). You
must do this before calling pcre_exec().
Changing stack size in Mac OS X
Using setrlimit(), as described above, should also work on Mac OS X. It
is also possible to set a stack size when linking a program. There is a
discussion about stack sizes in Mac OS X at this web site:
http://developer.apple.com/qa/qa2005/qa1419.html.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 03 January 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
|