1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
|
Introduction
TRE is a lightweight, robust, and efficient POSIX compliant regexp
matching library with some exciting features such as approximate
matching.
At the core of TRE is a new algorithm for regular expression
matching with submatch addressing. The algorithm uses linear
worst-case time in the length of the text being searched, and
quadratic worst-case time in the length of the used regular
expression. In other words, the time complexity of the algorithm is
O(M2N), where M is the length of the regular expression and N is
the length of the text. The used space is also quadratic on the
length of the regex, but does not depend on the searched
string. This quadratic behaviour occurs only on pathological cases
which are probably very rare in practice.
Features
TRE is not just yet another regexp matcher. TRE has some features
which are not there in most free POSIX compatible implementations.
Most of these features are not present in non-free implementations
either, for that matter.
Approximate matching
Approximate pattern matching allows matches to be approximate, that
is, allows the matches to be close to the searched pattern under
some measure of closeness. TRE uses the edit-distance measure (also
known as the Levenshtein distance) where characters can be
inserted, deleted, or substituted in the searched text in order to
get an exact match. Each insertion, deletion, or substitution adds
the distance, or cost, of the match. TRE can report the matches
which have a cost lower than some given threshold value. TRE can
also be used to search for matches with the lowest cost.
TRE includes a version of the agrep command line tool for
approximate regexp matching in the style of grep. Unlike other
agrep implementations (like the one by Sun Wu and Udi Manber from
University of Arizona available) TRE agrep allows full regexps of
any length, any number of errors, and non-uniform costs for
insertion, deletion and substitution.
Strict standard conformance
POSIX defines the behaviour of regexp functions precisely. TRE
attempts to conform to these specifications as strictly as
possible. TRE always returns the correct matches for subpatterns,
for example. Very few other implementations do this correctly. In
fact, the only other implementations besides TRE that I am aware of
(free or not) that get it right are Rx by Tom Lord, Regex++ by John
Maddock, and the AT&T ast regex by Glenn Fowler and Doug McIlroy.
The standard TRE tries to conform to is the IEEE Std 1003.1-2001,
or Open Group Base Specifications Issue 6, commonly referred to as
"POSIX". The relevant parts are the base specifications on regular
expressions (and the rationale) and the description of the
regcomp() API.
For an excellent survey on POSIX regexp matchers, see the testregex
pages by Glenn Fowler of AT&T Labs Research.
Predictable matching speed
Because of the novel matching algorithm used in TRE, the maximum
time consumed by any regexec() call is always directly proportional
to the length of the searched string. There is one exception: if
back references are used, the matching may take time that grows
exponentially with the length of the string. This is because
matching back references is an NP complete problem, and almost
certainly requires exponential time to match in the worst case.
Predictable and modest memory consumption
A regexec() call never allocates memory from the heap. TRE
allocates all the memory it needs during a regcomp() call, and some
temporary working space from the stack frame for the duration of
the regexec() call. The amount of temporary space needed is
constant during matching and does not depend on the searched
string. For regexps of reasonable size TRE needs less than 50K of
dynamically allocated memory during the regcomp() call, less than
20K for the compiled pattern buffer, and less than two kilobytes of
temporary working space from the stack frame during a regexec()
call. There is no time/memory tradeoff. TRE is also small in code
size; statically linking with TRE increases the executable size
less than 30K (gcc-3.2, x86, GNU/Linux).
Wide character and multibyte character set support
TRE supports multibyte character sets. This makes it possible to
use regexps seamlessly with, for example, Japanese locales. TRE
also provides a wide character API.
Binary pattern and data support
TRE provides APIs which allow binary zero characters both in
regexps and searched strings. The standard API cannot be easily
used to, for example, search for printable words from binary data
(although it is possible with some hacking). Searching for
patterns which contain binary zeroes embedded is not possible at
all with the standard API.
Completely thread safe
TRE is completely thread safe. All the exported functions are
re-entrant, and a single compiled regexp object can be used
simultaneously in multiple contexts; e.g. in main() and a signal
handler, or in many threads of a multithreaded application.
Portable
TRE is portable across multiple platforms. Here's a table of
platforms and compilers that have been successfully used to compile
and run TRE:
Platform(s) | Compiler(s)
----------------------------------+------------
Various GNU/Linux systems | GCC 2.95.3, GCC 3.2.1
Solaris 2.8 sparc | GCC 3.2.1, Sun Workshop 6 compilers
Solaris 2.8 x86 | Sun Workshop 6 compilers
AIX 4.3.2 | C for AIX compiler version 5, GCC 3.2.1
Windows 2000, Windows 98 | Microsoft Visual C++ 6.0
Cygwin 1.3.20-1 (under Windows 98)| GCC 3.2.1
Compaq Tru64 UNIX V5.1A | Compaq C V6.4-014
Compaq Tru64 UNIX V5.1B | Compaq C V6.5-011
Digital UNIX V4.0 | DEC C V5.9-005
HP-UX 10.20 | HP C Compiler A.10.32.30
HP-UX 11 | HP C Compiler A.11.00.00
IRIX 6.5 | MIPSpro Compilers 7.3.1.3m, GCC 3.0.4
NetBSD 1.5.2 | egcs-1.1.2
TRE 0.5.3 should compile without changes on all of the above
platforms. Tell me if you are using TRE on a platform that is
not listed above, and I'll add it to the list.
Free
TRE is free; you can redistribute it and/or modify it under the
terms of the GNU General Public License version 2 (June 1991) as
published by the Free Software Foundation. See the file COPYING,
included in the distribution packages, for the complete license.
If you cannot accept the GNU GPL, it may be possible that you can
persuade me to give or sell you a version which is licensed
differently. Please contact me for more information.
Roadmap
There are currently two features, both related to collating
elements, missing from 100% POSIX compliance. These are:
* Support for collating elements (e.g. [[.<X>.]], where <X> is a
collating element). It is not possible to support
multi-character collating elements portably, since POSIX does
not define a way to determine whether a character sequence is a
multi-character collating element or not.
* Support for equivalence classes, for example [[=<X>=]], where
<X> is a collating element. An equivalence class matches any
character which has the same primary collation weight as
<X>. Again, POSIX provides no portable mechanism for
determining the primary collation weight of a collating
element.
Note that other portable regexp implementations don't support
collating elements either. The single exception is Regex++, which
comes with its own database for collating elements for different
locales.
These are other features I'm planning to implement:
* All the missing GNU extensions enabled in GNU regex, such as
[[: and [[:>:]]
* New syntax for approximate matching to control the approximate
matching parameters (deletion/insertion/substitution costs,
maximum cost) in different parts of the regexp.
* A REG_SHORTEST regexec() flag for returning the shortest match
instead of the longest match.
* A REG_NONGREEDY regcomp() flag for forcing all repetition
operators to be non-greedy regardless of whether a ? is
present.
* Perl-compatible syntax
[:^]class:]
Matches anything but the characters in class
\A
Match only at beginning of string
\Z
Match only at end of string, or before newline at the end
\z
Match only at end of string
\l
Lowercase next char (think vi)
\u
Uppercase next char (think vi)
\L
Lowercase till \E (think vi)
\U
Uppercase till \E (think vi)
\E
End case modification (think vi)
\Q
Quote (disable) pattern metacharacters till \E
* The REG_NOSPEC regcomp() flag (also known as REG_LITERAL) which
forces all characters to be considered as ordinary. Useful for
matching literal strings.
Documentation especially for the nonstandard features of TRE, such
as approximate matching, is a work in progress.
Ville Laurikari <vl@iki.fi>
|