1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
|
Some comparisons of PCRE with the original Henry Spencer (1986) regular
expression functions were done on a SPARCstation IPC using gcc version 2.6.3
with -O optimization, to give some idea as to how the two libraries compare.
This is not a major statistical investigation.
Code size
---------
The code size of PCRE is a bit over twice the size of the Henry Spencer
functions (roughly 33K vs 14K bytes on a SPARCstation with gcc -O).
Store size for compiled expressions
-----------------------------------
For expressions that are compatible with both libraries, PCRE uses less store
for the examples tried, except in some cases that involve the use of character
classes. Except in the special case of a negated charcter class containing only
one character (e.g. [^a]), PCRE uses a 32-byte bit map for each character
class, in order to get the maximum matching speed. By contrast the Spencer code
uses a strchr() call.
The Spencer functions have an overhead of 92 bytes per expression, because
there is a table for up to 10 matched substrings held with every compiled
expression. In contrast, PCRE's overhead is just 9 bytes, since it requires the
caller to supply a vector to receive the offsets of the matched substrings. In
the table below, the size without the overhead is shown in brackets.
PCRE Spencer Pattern
---- ------- -------
18 (09) 109 (17) /^$/
25 (16) 120 (28) /^.*nter/
26 (17) 121 (29) /^12.34/
37 (28) 126 (34) /the quick brown fox/
50 (41) 114 (22) /^[]cde]/
50 (41) 114 (22) /^[^]cde]/
51 (42) 125 (33) /^[.^$|()*+?{,}]+/
52 (43) 126 (34) /^[0-9]+$/
56 (47) 153 (61) /^(abc)(abc)?zz/
57 (48) 133 (41) /^xxx[0-9]+$/
57 (48) 145 (53) /([0-9a-fA-F:]+)$/
62 (53) 171 (79) /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$$/
70 (61) 170 (78) /^(b+|a)(b+|a)?c/
74 (65) 173 (81) /^(ba|b*)(ba|b*)?bc/
99 (90) 235 (143) /^(a(b(c)))(d(e(f)))(h(i(j)))$/
119 (110) 157 (65) /^.+[0-9][0-9][0-9]$/
165 (156) 446 (354) /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/
451 (442) 605 (513) /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/
Compilation time
----------------
Timing was done using the clock() function to time 2000 compilations of each
expression and then dividing by twice the number of clocks per second, to get a
value in milliseconds. The variation observed over several runs was never more
than 0.01:
PCRE Spencer Pattern
---- ------- -------
0.04 0.07 /^$/
0.06 0.12 /^.*nter/
0.06 0.13 /^12.34/
0.06 0.09 /^[]cde]/
0.07 0.14 /^[0-9]+$/
0.07 0.10 /^[^]cde]/
0.08 0.17 /^xxx[0-9]+$/
0.08 0.14 /the quick brown fox/
0.09 0.14 /^[.^$|()*+?{,}]+/
0.10 0.33 /([0-9a-fA-F:]+)$/
0.12 0.26 /^.+[0-9][0-9][0-9]$/
0.12 0.42 /^(abc)(abc)?zz/
0.14 0.51 /^(b+|a)(b+|a)?c/
0.15 0.53 /^(ba|b*)(ba|b*)?bc/
0.19 0.51 /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/
0.34 1.59 /^(a(b(c)))(d(e(f)))(h(i(j)))$/
0.47 1.32 /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/
0.66 1.78 /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/
Execution time
--------------
Execution timing was done in a similar manner. Blank entries in the "pattern"
column below indicate the use of the same pattern as before.
PCRE Spencer Subject Pattern
---- ------- ------- -------
0.03 0.02 <null string> /^$/
0.04 0.04 enter /^.*nter/
0.04 0.04 uponter
0.03 0.03 12\r34 /^12.34/
0.03 0.03 0 /^[0-9]+$/
0.04 0.03 100
0.03 0.03 ]thing /^[]cde]/
0.03 0.03 ething
0.03 0.03 athing /^[^]cde]/
0.04 0.04 xxx0 /^xxx[0-9]+$/
0.04 0.04 xxx1234
0.04 0.07 .^\$(*+)|{?,?} /^[.^$|()*+?{,}]+/
0.03 0.03 the quick brown fox /the quick brown fox/
0.06 0.08 What do you know about the quick brown fox?
0.04 0.07 0abc /([0-9a-fA-F:]+)$/
0.04 0.07 abc
0.05 0.13 5f03:12C0::932e
0.06 0.07 x123 /^.+[0-9][0-9][0-9]$/
0.06 0.07 123456
0.06 0.09 abczz /^(abc)(abc)?zz/
0.06 0.12 abcabczz /^(abc)(abc)?zz/
/^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/
0.23 0.28 abc!pqr=apquxz.ixr.zzz.ac.uk
0.09 0.15 bc /^(b+|a)(b+|a)?c/
0.09 0.15 bbc
0.08 0.15 bbbc
0.09 0.15 bac
0.09 0.15 bbac
0.07 0.14 aac
0.09 0.15 abbbbbbbbbbbc
0.09 0.15 bbbbbbbbbbbac
0.09 0.18 babc /^(ba|b*)(ba|b*)?bc/
0.12 0.24 bbabc
0.07 0.15 bababc
0.06 0.10 a. /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/
0.13 0.34 ab-c.pq-r.
0.24 0.58 sxk.zzz.ac.uk.
0.12 0.34 x-.y-.
0.20 0.38 abcdefhij /^(a(b(c)))(d(e(f)))(h(i(j)))$/
/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/
0.18 0.30 From abcd Mon Sep 01 12:33:02 1997
In general, PCRE runs faster than the Spencer function, but remember, this
is just for one particular compiler on one set of hardware and operating
system. Until comprehensive tests have been run in other environments, the most
one can plausibly say is that it is probably no worse on average for the kinds
of expression tested here.
Speeding up matching
--------------------
A character class is much more efficient than a set of bracketed alternatives.
Matching /^[abc]{12}/ against "abcabcabcabc" took 0.03 ms, whereas
/^(a|b|c){12}/ took 0.33 ms. This is because brackets and alternatives involve
recursion.
Serious test
------------
One of the tests of PCRE is the monster regular expression from "Mastering
Regular Expressions" (O'Reilly's "hip owls" book, 1997, ISBN 1-56592-257-3)
which recognizes email addresses. There are two versions, unoptimized and
optimized. For interest, here are their timings, again on a SPARCstation IPC.
The compile times were 55 ms and 94 ms, and the compiled expressions
occupied 11010 and 15426 bytes of store, respectively. The following strings
were matched in the times shown (unoptimized first):
0.34 0.38 user@dom.ain
0.38 0.42 <user@dom.ain>
0.88 0.60 Alan Other <user@dom.ain>
1.87 0.82 "A. Other" <user.1234@dom.ain> (a comment)
1.77 1.19 A. Other <user.1234@dom.ain> (a comment)
2.21 0.42 "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay
The optimization of the expression clearly has a dramatic effect in some cases.
Philip Hazel <ph10@cus.cam.ac.uk>
October 1997
|