1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
|
<h1>TRE Regexp Syntax</h1>
<p>
This document describes the POSIX 1003.2 extended RE (ERE) syntax and
the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions
to the ERE syntax. A simple Extended Backus-Naur Form (EBNF) style
notation is used to describe the grammar.
</p>
<h2>ERE Syntax</h2>
<h3>Alternation operator</h3>
<a name="alternation"></a>
<a name="extended-regexp"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a>
| <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a>
</pre>
</td></tr>
</table>
<p>
An extended regexp (ERE) is one or more <i>branches</i>, separated by
<tt>|</tt>. An ERE matches anything that matches one or more of the
branches.
</p>
<h3>Catenation of REs</h3>
<a name="catenation"></a>
<a name="branch"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>branch</i> ::= <i>piece</i>
| <i>branch</i> <i>piece</i>
</pre>
</td></tr>
</table>
<p>
A branch is one or more <i>pieces</i> concatenated. It matches a
match for the first piece, followed by a match for the second piece,
and so on.
</p>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>piece</i> ::= <i>atom</i>
| <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a>
| <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a>
</pre>
</td></tr>
</table>
<p>
A piece is an <i>atom</i> possibly followed by a repeat operator or an
expression controlling approximate matching parameters for the <i>atom</i>.
</p>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b>
| <a href="#bracket-expression"><i>bracket-expression</i></a>
| <b>"."</b>
| <a href="#assertion"><i>assertion</i></a>
| <a href="#literal"><i>literal</i></a>
| <a href="#backref"><i>back-reference</i></a>
| <b>"(?#"</b> <i>comment-text</i> <b>")"</b>
| <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i>
| <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b>
</pre>
</td></tr>
</table>
<p>
An atom is either an ERE enclosed in parenthesis, a bracket
expression, a <tt>.</tt> (period), an assertion, or a literal.
</p>
<p>
The dot (<tt>.</tt>) matches any single character.
If the <code>REG_NEWLINE</code> compilation flag (see <a
href="api.html">API manual</a>) is specified, the newline
character is not matched.
</p>
<p>
<tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is
completely ignored by the regex parser and it used solely for readability purposes.
</p>
<h3>Repeat operators</h3>
<a name="repeat-operator"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>repeat-operator</i> ::= <b>"*"</b>
| <b>"+"</b>
| <b>"?"</b>
| <i>bound</i>
| <b>"*?"</b>
| <b>"+?"</b>
| <b>"??"</b>
| <i>bound</i> <b>?</b>
</pre>
</td></tr>
</table>
<p>
An atom followed by <tt>*</tt> matches a sequence of 0 or more matches
of the atom. <tt>+</tt> is similar to <tt>*</tt>, matching a sequence
of 1 or more matches of the atom. An atom followed by <tt>?</tt>
matches a sequence of 0 or 1 matches of the atom.
</p>
<p>
A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i>
are unsigned decimal integers between <tt>0</tt> and
<tt>RE_DUP_MAX</tt>:
</p>
<ol>
<li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li>
<li><tt>{,</tt><i>n</i><tt>}</tt></li>
<li><tt>{</tt><i>m</i><tt>,}</tt></li>
<li><tt>{</tt><i>m</i><tt>}</tt></li>
<li><tt>{,}</tt></li>
</ol>
<p>
An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i>
(inclusive) matches of the atom.
An atom followed by [2] matches a sequence of up to <i>n</i> matches
of the atom.
An atom followed by [3] matches a sequence of <i>m</i> or more matches
of the atom.
An atom followed by [4] matches a sequence of exactly <i>m</i> matches
of the atom.
An atom followed by [5] matches a sequence of zero or more matches of
the atom.
</p>
<p>
Adding a <tt>?</tt> to a repeat operator makes the subexpression minimal, or
non-greedy. Normally a repeated expression is greedy, that is, it matches as
many characters as possible. A non-greedy subexpression matches as few
characters as possible. Note that this does not (always) mean the same thing
as matching as many or few repetitions as possible. Also note
that <strong>minimal repetitions are not currently supported for approximate
matching</strong>.
</p>
<h3>Approximate matching settings</h3>
<a name="approx-settings"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b>
<i>count-limits</i> ::= <b>"+"</b> <i>number</i>?
| <b>"-"</b> <i>number</i>?
| <b>"#"</b> <i>number</i>?
| <b>"~"</b> <i>number</i>?
<i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"<"</b> <i>number</i>
<i>cost-term</i> ::= <i>number</i> <b>"i"</b>
| <i>number</i> <b>"d"</b>
| <i>number</i> <b>"s"</b>
</pre>
</td></tr>
</table>
<p>
The approximate matching settings for a subpattern can be changed
by appending <i>approx-settings</i> to the subpattern. Limits for
the number of errors can be set and an expression for specifying and
limiting the costs can be given.
</p>
<p>
The <i>count-limits</i> can be used to set limits for the number of
insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions
(<tt>#</tt>), and total number of errors (<tt>~</tt>). If the
<i>number</i> part is omitted, the specified error count will be
unlimited.
</p>
<p>
The <i>cost-equation</i> can be thought of as a mathematical equation,
where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of
insertions, deletions, and substitutions, respectively. The equation
can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and
<tt>s</tt>. The multiplier is the cost of the error, and the number
after <tt><</tt> is the maximum allowed cost of a match. Spaces
and pluses can be inserted to make the equation readable. In fact, when
specifying only a cost equation, adding a space after the opening <tt>{</tt>
is <strong>required</strong>.
</p>
<p>
Examples:
<dl>
<dt><tt>{~}</tt></dt>
<dd>Sets the maximum number of errors to unlimited.</dd>
<dt><tt>{~3}</tt></dt>
<dd>Sets the maximum number of errors to three.</dd>
<dt><tt>{+2~5}</tt></dt>
<dd>Sets the maximum number of errors to five, and the maximum number
of insertions to two.</dd>
<dt><tt>{<3}</tt></dt>
<dd>Sets the maximum cost to three.
<dt><tt>{ 2i + 1d + 2s < 5 }</tt></dt>
<dd>Sets the cost of an insertion to two, a deletion to one, a
substitution to two, and the maximum cost to five.
</dl>
<h3>Bracket expressions</h3>
<a name="bracket-expression"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b>
| <b>"[^"</b> <i>item</i>+ <b>"]"</b>
</pre>
</td></tr>
</table>
<p>
A bracket expression specifies a set of characters by enclosing a
nonempty list of items in brackets. Normally anything matching any
item in the list is matched. If the list begins with <tt>^</tt> the
meaning is negated; any character matching no item in the list is
matched.
</p>
<p>
An item is any of the following:
</p>
<ul>
<li>A single character, matching that character.</li>
<li>Two characters separated by <tt>-</tt>. This is shorthand for the
full range of characters between those two (inclusive) in the
collating sequence. For example, <tt>[0-9]</tt> in ASCII matches any
decimal digit.</li>
<li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>,
matching the collating element. This can be used to include a literal
<tt>-</tt> or a multi-character collating element in the list.</li>
<li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an
equivalence class), matching all collating elements with the same
primary collation weight as that element, including the element
itself.</li>
<li>The name of a character class enclosed in <tt>[:</tt> and
<tt>:]</tt>, matching any character belonging to the class. The set
of valid names depends on the <code>LC_CTYPE</code> category of the
current locale, but the following names are valid in all locales:
<ul>
<li><tt>alnum</tt> - alphanumeric characters</li>
<li><tt>alpha</tt> - alphabetic characters</li>
<li><tt>blank</tt> - blank characters</li>
<li><tt>cntrl</tt> - control characters</li>
<li><tt>digit</tt> - decimal digits (0 through 9)</li>
<li><tt>graph</tt> - all printable characters except space</li>
<li><tt>lower</tt> - lower-case letters</li>
<li><tt>print</tt> - printable characters including space</li>
<li><tt>punct</tt> - printable characters not space or alphanumeric</li>
<li><tt>space</tt> - white-space characters</li>
<li><tt>upper</tt> - upper case letters</li>
<li><tt>xdigit</tt> - hexadecimal digits</li>
</ul>
</ul>
<p>
To include a literal <tt>-</tt> in the list, make it either the first
or last item, the second endpoint of a range, or enclose it in
<tt>[.</tt> and <tt>.]</tt> to make it a collating element. To
include a literal <tt>]</tt> in the list, make it either the first
item, the second endpoint of a range, or enclose it in <tt>[.</tt> and
<tt>.]</tt>. To use a literal <tt>-</tt> as the first
endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>.
</p>
<h3>Assertions</h3>
<a name="assertion"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>assertion</i> ::= <b>"^"</b>
| <b>"$"</b>
| <b>"\"</b> <i>assertion-character</i>
</pre>
</td></tr>
</table>
<p>
The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and
"right anchor", respectively. The left anchor matches the empty
string at the beginning of the string. The right anchor matches the
empty string at the end of the string. The behaviour of both anchors
can be varied by specifying certain execution and compilation flags;
see the <a href="api.html">API manual</a>.
</p>
<p>
An assertion-character can be any of the following:
</p>
<ul>
<li><tt><</tt> - Beginning of word
<li><tt>></tt> - End of word
<li><tt>b</tt> - Word boundary
<li><tt>B</tt> - Non-word boundary
<li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li>
<li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li>
<li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li>
<li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li>
<li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li>
<li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li>
</ul>
<h3>Literals</h3>
<a name="literal"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>literal</i> ::= <i>ordinary-character</i>
| <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2}
| <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b>
| <b>"\"</b> <i>character</i>
</pre>
</td></tr>
</table>
<p>
A literal is either an ordinary character (a character that has no
other significance in the context), an 8 bit hexadecimal encoded
character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character
(e.g. <tt>\x{263a}</tt>), or an escaped character. An escaped
character is a <tt>\</tt> followed by any character, and matches that
character. Escaping can be used to match characters which have a
special meaning in regexp syntax. A <tt>\</tt> cannot be the last
character of an ERE. Escaping also allows you to include a few
non-printable characters in the regular expression. These special
escape sequences include:
</p>
<ul>
<li><tt>\a</tt> - Bell character (ASCII code 7)
<li><tt>\e</tt> - Escape character (ASCII code 27)
<li><tt>\f</tt> - Form-feed character (ASCII code 12)
<li><tt>\n</tt> - New-line/line-feed character (ASCII code 10)
<li><tt>\r</tt> - Carriage return character (ASCII code 13)
<li><tt>\t</tt> - Horizontal tab character (ASCII code 9)
</ul>
<p>
An ordinary character is just a single character with no other
significance, and matches that character. A <tt>{</tt> followed by
something else than a digit is considered an ordinary character.
</p>
<h3>Back references</h3>
<a name="backref"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>]
</pre>
</td></tr>
</table>
<p>
A back reference is a backslash followed by a single non-zero decimal
digit <i>d</i>. It matches the same sequence of characters
matched by the <i>d</i>th parenthesized subexpression.
</p>
<p>
Back references are not defined for POSIX EREs (for BREs they are),
but many matchers, including TRE, implement back references for both
EREs and BREs.
</p>
<h3>Options</h3>
<a name="options"></a>
<table bgcolor="#e0e0f0" cellpadding="10">
<tr><td>
<pre>
<i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)?
</pre>
</td></tr>
</table>
Options allow compile time options to be turned on/off for particular parts of the
regular expression. The options equate to several compile time options specified to
the regcomp API function. If the option is specified in the first section, it is
turned on. If it is specified in the second section (after the <tt>-</tt>), it is
turned off.
<ul>
<li>i - Case insensitive.
<li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in
the <a href="tre-api.html">API Manual</a>.
<li>r - Causes the regex to be matched in a right associative manner rather than the normal
left associative manner.
<li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended.
</ul>
<h2>BRE Syntax</h2>
<p>
The obsolete basic regexp (BRE) syntax differs from the ERE syntax as
follows:
</p>
<ul>
<li><tt>|</tt> is an ordinary character, and there is no equivalent
for its functionality. <tt>+</tt>, and <tt>?</tt> are ordinary
characters.</li>
<li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with
<tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li>
<li>The parentheses for nested subexpressions are <tt>\(</tt> and
<tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary
characters.</li>
<li><tt>^</tt> is an ordinary character except at the beginning of the
RE or the beginning of a parenthesized subexpression. Similarly,
<tt>$</tt> is an ordinary character except at the end of the
RE or the end of a parenthesized subexpression.</li>
</ul>
|