File: regex-doc.m2

package info (click to toggle)
macaulay2 1.25.05%2Bds-2
  • links: PTS, VCS
  • area: main
  • in suites: experimental
  • size: 172,152 kB
  • sloc: cpp: 107,824; ansic: 16,193; javascript: 4,189; makefile: 3,899; lisp: 702; yacc: 604; sh: 476; xml: 177; perl: 114; lex: 65; python: 33
file content (345 lines) | stat: -rw-r--r-- 16,846 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
--- status: Rewritten July 2020
--- author(s): Dan, Mahrud
--- notes: functions below are all defined in regex.m2

doc ///
  Key
     regex
    (regex, String,         String)
    (regex, String, ZZ,     String)
    (regex, String, ZZ, ZZ, String)
    [regex, POSIX]
    POSIX
  Headline
    evaluate a regular expression search
  Usage
    regex(re, str)
    regex(re, start, str)
    regex(re, start, range, str)
  Inputs
    re:String
      a @TO2 {"regular expressions", "regular expression"}@ describing a pattern
    start:ZZ
      positive, the position in @TT "str"@ at which to begin the search.
      when omitted, the search starts at the beginning of the string.
    range:ZZ
      restricts matches to those beginning at a position between @TT "start"@ and @TT "start + range"@;
      when 0, the pattern is matched only at the starting position;
      when negative, only positions to the left of the starting position are examined for matches;
      when omitted, the search extends to the end of the string.
    str:String
      the subject string to be searched
    POSIX=>Boolean
      if true, interpret the @TT "re"@ using the POSIX Extended flavor, otherwise the Perl flavor
  Outputs
    :List
      a list of pairs of integers; each pair denotes the beginning position and the length of a substring.
      Only the leftmost matching substring of @TT "str"@ and the capturing groups within it are returned.
      If no match is found, the output is @TO "null"@.
  Description
    Text
      The value returned is a list of pairs of integers corresponding to the parenthesized subexpressions
      successfully matched, suitable for use as the first argument of @TO "substring"@. The first member
      of each pair is the offset within @TT "str"@ of the substring matched, and the second is the length.

      See @TO "regular expressions"@ for a brief introduction to the topic.
    Example
      s = "The cat is black.";
      m = regex("(\\w+) (\\w+) (\\w+)",s)
      substring(m#0, s)
      substring(m#1, s)
      substring(m#2, s)
      substring(m#3, s)

      s = "aa     aaaa";
      m = regex("a+", 0, s)
      substring(m#0, s)
      m = regex("a+", 2, s)
      substring(m#0, s)
      m = regex("a+", 2, 3, s)

      s = "line 1\nline 2\r\nline 3";
      m = regex("^.*$", 8, -8, s)
      substring(m#0, s)
      m = regex("^", 10, -10, s)
      substring(0, m#0#0, s)
      substring(m#0#0, s)
      m = regex("^.*$", 4, -10, s)
      substring(m#0, s)
      m = regex("a.*$", 4, -10, s)

    Text
      By default, the regular expressions are interpreted using the Perl flavor, which
      supports features such as lookaheads and lookbehinds for fine-tuning the matches.
      This syntax is used in Perl and JavaScript languages.
    Example
      regex("A(?!C)", "AC AB")
      regex("A(?=B)", "AC AB")

    Text
      Alternatively, one can choose the POSIX Extended flavor of regex using @TT "POSIX => true"@.
      This syntax is similar to the one used by the Unix utilities @TT "egrep"@ and @TT "awk"@ and
      enforces the @BOLD "leftmost, longest"@ rule for finding matches. If there's a tie, the rule
      is applied to the first subexpression.
    Example
      s = "<b>bold</b> and <b>strong</b>";
      m = regex("<b>(.*)</b>", s, POSIX => true);
      substring(m#1, s)

    Text
      In the Perl flavor, one can specify whether repetitions should be possessive or non-greedy.
    Example
      m = regex("<b>(.*?)</b>", s);
      substring(m#1, s)
  SeeAlso
    "regular expressions"
    "strings and nets"
    match
    separate
    (replace, String, String, String)
    (select, String, String, String)
    regexQuote
    substring
///

doc ///
  Key
    "regular expressions"
  Headline
    syntax for regular expressions
  Description
    Text

      A regular expression is a string that specifies a pattern that describes a set of matching subject strings.
      Typically the string is compiled into a deterministic finite automaton whose execution, guided by the
      subject string, determines whether there is a match.

      Characters match themselves, except for the following special characters.

      @PRE {CODE {".  [  {  }  (  )  \\  *  +  ?  |  ^  $"}}@

      Regular expressions are constructed inductively as follows: the empty regular expression matches the
      empty string; a concatenation of regular expressions matches the concatenation of the corresponding
      matching strings.  Regular expressions separated by the character @TT "|"@ match strings matched by any
      of them.  Parentheses can be used for grouping, except that now, with the use of the @ TO "Boost" @
      library, their insertion may alter the matching of subexpressions of ambiguous expressions.
      Additionally, the substrings matched by parenthesized subexpressions are captured for later use in
      replacement strings.

    Tree
      :Syntax for special characters
        :Wildcard
          @TT "."@	-- match any character except the newline character
        :Anchors
          @TT "^"@	-- match the beginning of the string or the beginning of a line
          @TT "$"@	-- match the end of the string or the end of a line
        :Sub-expressions
          @TT "(...)"@	-- marked sub-expression, may be referred to by a back-reference
          @TT "\\i"@	-- match the same string that the i-th parenthesized sub-expression matched
        :Repeats
          @TT "*"@	-- match previous expression 0 or more times
          @TT "+"@	-- match previous expression 1 or more times
          @TT "?"@	-- match previous expression 1 or 0 times
          @TT "{m}"@	-- match previous expression exactly m times
          @TT "{m,n}"@	-- match previous expression at least m and at most n times
          @TT "{,n}"@	-- match previous expression at most n times
          @TT "{m,}"@	-- match previous expression at least m times
        :Alternation
          @TT "|"@	-- match expression to left or expression to right
        :Word and buffer boundaries
          @TT "\\b"@	-- match word boundary
          @TT "\\B"@	-- match within word
          @TT "\\<"@	-- match beginning of word
          @TT "\\>"@	-- match end of word
          @TT "\\`"@	-- match beginning of string
          @TT "\\'"@	-- match end of string
        :Character sets
          @TT "[...]"@	-- match any single character that is a member of the set
           @TT "[abc]"@	-- match either @TT "a"@, @TT "b"@, or @TT "c"@
           @TT "[A-C]"@	-- match any character from @TT "A"@ through @TT "C"@
          @TT "[^...]"@	-- match non-listed characters, ranges, or classes
        :Character classes
          @TT "[:alnum:]"@	-- any alphanumeric character
          @TT "[:alpha:]"@	-- any alphabetic character
          @TT "[:blank:]"@	-- any whitespace or tab character
          @TT "[:cntrl:]"@	-- any control character
          @TT "[:digit:]"@	-- any decimal digit
          @TT "[:graph:]"@	-- any graphical character (same as [:print:] except omits space)
          @TT "[:lower:]"@	-- any lowercase character
          @TT "[:print:]"@	-- any printable character
          @TT "[:punct:]"@	-- any punctuation character
          @TT "[:space:]"@	-- any whitespace, tab, carriage return, newline, vertical tab, and form feed
          @TT "[:unicode:]"@	-- any unicode character with code point above 255 in value
          @TT "[:upper:]"@	-- any uppercase character
          @TT "[:word:]"@	-- any word character (alphanumeric characters plus the underscore
          @TT "[:xdigit:]"@	-- any hexadecimal digit character
        :"Single character" character classes
          @TT "\\d"@	-- same as @TT "[[:digit:]]"@
          @TT "\\l"@	-- same as @TT "[[:lower:]]"@
          @TT "\\s"@	-- same as @TT "[[:space:]]"@
          @TT "\\u"@	-- same as @TT "[[:upper:]]"@
          @TT "\\w"@	-- same as @TT "[[:word:]]"@
          @TT "\\D"@	-- same as @TT "[^[:digit:]]"@
          @TT "\\L"@	-- same as @TT "[^[:lower:]]"@
          @TT "\\S"@	-- same as @TT "[^[:space:]]"@
          @TT "\\U"@	-- same as @TT "[^[:upper:]]"@
          @TT "\\W"@	-- same as @TT "[^[:word:]]"@

    Text
      The special character @TT "\\"@ may be confusing, as inside a string delimited by quotation marks
      (@TT ////"..."////@), you type two of them to specify a special character, whereas inside a string
      delimited by triple slashes (@TT "////...////"@), you only need one. Thus regular expressions delimited
      by triple slashes are more readable. To match @TT "\\"@ against itself, double the number of backslashes.

      In order to match one of the special characters itself, precede it with a backslash or use @TO regexQuote@.

    Text
      @HEADER2 "Flavors of Regular Expressions"@

      The regular expression functions in Macaulay2 are powered by calls to the
      @HREF {"https://www.boost.org/doc/libs/release/libs/regex/", "Boost.Regex"}@
      C++ library, which supports multiple flavors, or standards, of regular expression.

      Since Macaulay2 @TO2{"changes, 1.17", "v1.17"}@, the Perl flavor is the default. In general, the Perl flavor
      supports all patterns designed for the POSIX Extended flavor, but allows for more fine-tuning in the patterns.
      Alternatively, the POSIX Extended flavor can be chosen by passing the option @TT "POSIX => true"@.
      One key difference is what happens when there is more that one way to match a regular expression:

      @UL {
         {BOLD "Perl", " -- the \"first\" match is arrived at by a depth-first search."},
         {BOLD "POSIX", " -- the \"best\" match is obtained using the \"leftmost-longest\" rule;"},
         }@

      If there's a tie in the POSIX flavor, the rule is applied to the first parenthetical subexpression.

      @HEADER2 "Additional Perl Regular Expression Syntax"@

      The Perl flavor adds the following, non-backward compatible constructions:

    Tree
      :Non-marking grouping; i.e., a grouping that does not generate a sub-expression
        @TT "(?#...)"@	-- ignored and treated as a comment
        @TT "(?:...)"@	-- non-marked sub-expression, may not be referred to by a back-reference
        @TT "(?=...)"@	-- positive lookahead; consumes zero characters, only if pattern matches
        @TT "(?!...)"@	-- negative lookahead; consumes zero characters, only if pattern does not match
        @TT "(?<=..)"@	-- positive lookbehind; consumes zero characters, only if pattern could be matched against the characters preceding the current position (pattern must be of fixed length)
        @TT "(?<!..)"@	-- negative lookbehind; consumes zero characters, only if pattern could not be matched against the characters preceding the current position (pattern must be of fixed length)
        @TT "(?>...)"@	-- match independently of the surrounding pattern and the expression will never backtrack into the pattern
      :Non-greedy repeats
        @TT "*?"@	-- match the previous atom 0 or more times, while consuming as little input as possible
        @TT "+?"@	-- match the previous atom 1 or more times, while consuming as little input as possible
        @TT "??"@	-- match the previous atom 1 or 0 times, while consuming as little input as possible
        @TT "{m,}?"@	-- match the previous atom m or more times, while consuming as little input as possible
        @TT "{m,n}?"@	-- match the previous atom at between m and n times, while consuming as little input as possible
      :Possessive repeats
        @TT "*+"@	-- match the previous atom 0 or more times, while giving nothing back
        @TT "++"@	-- match the previous atom 1 or more times, while giving nothing back
        @TT "?+"@	-- match the previous atom 1 or 0 times, while giving nothing back
        @TT "{m,}+"@	-- match the previous atom m or more times, while giving nothing back
        @TT "{m,n}+"@	-- match the previous atom at between m and n times, while giving nothing back
      :Back references
        @TT "\\g1"@	-- match whatever matched sub-expression 1
        @TT "\\g{1}"@	-- match whatever matched sub-expression 1
        @TT "\\g-1"@	-- match whatever matched the last opened sub-expression
        @TT "\\g{-2}"@	-- match whatever matched the last but one opened sub-expression
        @TT "\\g{that}"@	-- match whatever matched the sub-expression named "that"
        @TT "\\k<that>"@	-- match whatever matched the sub-expression named "that"
        @TT "(?<NAME>...)"@	-- named sub-expression, may be referred to by a named back-reference
        @TT "(?'NAME'...)"@	-- named sub-expression, may be referred to by a named back-reference
    Text
      See references below for more in depth syntax for controlling the backtracking algorithm.

    Text
      @HEADER2 "String Formatting Syntax"@

      The replacement string in @TO (replace, String, String, String)@ and @TO (select, String, String, String)@
      supports additional syntax for escape sequences as well as inserting captured sub-expressions:

    Tree
     :Using Perl regular expression syntax (default)
      :Syntax for inserting captured sub-expressions:
        @TT "$&"@	-- outputs what matched the whole expression
        @TT "$`"@	-- outputs the text between the end of the last match (or beginning if no previous match was found) and the start of the current match
        @TT "$'"@	-- outputs the text following the end of the current match
        @TT "$+"@	-- outputs what matched the last marked sub-expression in the regular expression
        @TT "$^N"@	-- outputs what matched the last sub-expression to be actually matched
        @TT "$n"@	-- outputs what matched the n-th sub-expression
        @TT "${n}"@	-- outputs what matched the n-th sub-expression
        @TT "$+{NAME}"@	-- outputs what matched the sub-expression named @TT "NAME"@ (Perl syntax only)
        @TT "$$"@	-- outputs a literal @TT "$"@
      :Syntax for manipulating the captured groups:
        @TT "\\l"@	-- converts the next character to be outputted to lower case
        @TT "\\u"@	-- converts the next character to be outputted to upper case
        @TT "\\L"@	-- converts all subsequent characters to be outputted to lower case, until it reaches @TT "\\E"@
        @TT "\\U"@	-- converts all subsequent characters to be outputted to upper case, until it reaches @TT "\\E"@
        @TT "\\E"@	-- terminates a @TT "\\U"@ or @TT "\\L"@ sequence
        @TT "\\"@	-- specifies an escape sequence (e.g. @TT "\\\\"@)
     :Using POSIX Extended syntax (@TT "POSIX => true"@)
      :Syntax for inserting captured groups:
        @TT "&"@	-- outputs what matched the whole expression
        @TT "\\0"@	-- outputs what matched the whole expression
        @TT "\\n"@	-- if @TT "n"@ is in the range 1-9, outputs what matched the n-th sub-expression
        @TT "\\"@	-- specifies an escape sequence (e.g. @TT "\\&"@)

    Text
      For the complete list, including characters escape sequences, see the Boost.Regex manual on
      @HREF {"https://www.boost.org/doc/libs/release/libs/regex/doc/html/boost_regex/format/perl_format.html", "format string syntax"}@.

    Tree
     :String processing functions that accept regular expressions
       match
       regex
       separate
       (select, String, String, String)
       (replace, String, String, String)
     :Other functions that accept regular expressions
       about
       apropos
       findFiles
       copyDirectory
       symlinkDirectory

    Text
      @HEADER2 "Complete References"@

      For complete documentation on regular expressions supported in Macaulay2, see the Boost.Regex manual on
      @HREF {"https://www.boost.org/doc/libs/release/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html", "Perl"}@ and
      @HREF {"https://www.boost.org/doc/libs/release/libs/regex/doc/html/boost_regex/syntax/basic_extended.html", "Extended"}@
      flavors, or read the entry for @TT "regex"@ in section 7 of the unix man pages.

      In addition to the functions mentioned below, regular expressions appear in @TO "about"@, @TO "apropos"@,
      @TO "findFiles"@, @TO "copyDirectory"@, and @TO "symlinkDirectory"@.
  Subnodes
    :functions that accept regular expressions
    match
    regex
    separate
    (select, String, String, String)
    (replace, String, String, String)
    regexQuote
///

doc ///
  Key
     regexQuote
    (regexQuote, String)
  Headline
    escape special characters in regular expressions
  Usage
    regexQuote s
  Inputs
    s:String
  Outputs
    :String
      obtained from @TT "s"@ by escaping all of the characters that have special meanings in regular expressions
      (@TT "\\, ^, $, ., |, ?, *, +, (, ), [, ], {, }"@) with a @TT "\\"@.
  Description
    Example
      match("2+2", "2+2")
      regexQuote "2+2"
      match(oo, "2+2")
  SeeAlso
    "regular expressions"
    regex
    match
///