File: stdlib_re_run_func.txt

package info (click to toggle)
erlang 1%3A27.3.4.1%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 225,000 kB
  • sloc: erlang: 1,658,966; ansic: 405,769; cpp: 177,850; xml: 82,435; makefile: 15,031; sh: 14,401; lisp: 9,812; java: 8,603; asm: 6,541; perl: 5,836; python: 5,484; sed: 72
file content (563 lines) | stat: -rw-r--r-- 25,196 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563

  run(Subject, RE)

  There is no documentation for run(Subject, RE, [])

  run(Subject, RE, Options)

  Executes a regular expression matching, and returns match/{match,
  Captured} or nomatch.

  The regular expression can be specified either as iodata/0 in
  which case it is automatically compiled (as by compile/2) and
  executed, or as a precompiled mp/0 in which case it is executed
  against the subject directly.

  When compilation is involved, exception badarg is thrown if a
  compilation error occurs. Call compile/2 to get information
  about the location of the error in the regular expression.

  If the regular expression is previously compiled, the option list
  can only contain the following options:

   • anchored

   • {capture, ValueSpec}/{capture, ValueSpec, Type}

   • global

   • {match_limit, integer() >= 0}

   • {match_limit_recursion, integer() >= 0}

   • {newline, NLSpec}

   • notbol

   • notempty

   • notempty_atstart

   • noteol

   • {offset, integer() >= 0}

   • report_errors

  Otherwise all options valid for function compile/2 are also
  allowed. Options allowed both for compilation and execution of a
  match, namely anchored and {newline, NLSpec}, affect both the
  compilation and execution if present together with a
  non-precompiled regular expression.

  If the regular expression was previously compiled with option 
  unicode, Subject is to be provided as a valid Unicode 
  charlist(), otherwise any iodata/0 will do. If compilation is
  involved and option unicode is specified, both Subject and the
  regular expression are to be specified as valid Unicode 
  charlists().

  {capture, ValueSpec}/{capture, ValueSpec, Type} defines what to
  return from the function upon successful matching. The capture
  tuple can contain both a value specification, telling which of the
  captured substrings are to be returned, and a type specification,
  telling how captured substrings are to be returned (as index
  tuples, lists, or binaries). The options are described in detail
  below.

  If the capture options describe that no substring capturing is to
  be done ({capture, none}), the function returns the single atom 
  match upon successful matching, otherwise the tuple {match,
  ValueList}. Disabling capturing can be done either by specifying 
  none or an empty list as ValueSpec.

  Option report_errors adds the possibility that an error tuple is
  returned. The tuple either indicates a matching error (
  match_limit or match_limit_recursion), or a compilation error,
  where the error tuple has the format {error, {compile,
  CompileErr}}. Notice that if option report_errors is not
  specified, the function never returns error tuples, but reports
  compilation errors as a badarg exception and failed matches
  because of exceeded match limits simply as nomatch.

  The following options are relevant for execution:

   • anchored - Limits run/3 to matching at the first matching
     position. If a pattern was compiled with anchored, or
     turned out to be anchored by virtue of its contents, it
     cannot be made unanchored at matching time, hence there is
     no unanchored option.

   • global - Implements global (repetitive) search (flag g in
     Perl). Each match is returned as a separate list/0
     containing the specific match and any matching
     subexpressions (or as specified by option capture. The 
     Captured part of the return value is hence a list/0 of 
     list/0s when this option is specified.

     The interaction of option global with a regular expression
     that matches an empty string surprises some users. When
     option global is specified, run/3 handles empty matches
     in the same way as Perl: a zero-length match at any point is
     also retried with options [anchored, notempty_atstart]. If
     that search gives a result of length > 0, the result is
     included. Example:

       re:run("cat","(|at)",[global]).

     The following matchings are performed:

      ○ At offset 0 - The regular expression (|at) first
        match at the initial position of string cat, giving
        the result set [{0,0},{0,0}] (the second {0,0} is
        because of the subexpression marked by the
        parentheses). As the length of the match is 0, we do
        not advance to the next position yet.

      ○ At offset 0 with [anchored, notempty_atstart] -
        The search is retried with options [anchored,
        notempty_atstart] at the same position, which does
        not give any interesting result of longer length, so
        the search position is advanced to the next character (
        a).

      ○ At offset 1 - The search results in [{1,0},{1,0}],
        so this search is also repeated with the extra
        options.

      ○ At offset 1 with [anchored, notempty_atstart] -
        Alternative ab is found and the result is
        [{1,2},{1,2}]. The result is added to the list of
        results and the position in the search string is
        advanced two steps.

      ○ At offset 3 - The search once again matches the
        empty string, giving [{3,0},{3,0}].

      ○ At offset 1 with [anchored, notempty_atstart] -
        This gives no result of length > 0 and we are at the
        last position, so the global search is complete.

     The result of the call is:

       {match,[[{0,0},{0,0}],[{1,0},{1,0}],[{1,2},{1,2}],[{3,0},{3,0}]]}

   • notempty - An empty string is not considered to be a valid
     match if this option is specified. If alternatives in the
     pattern exist, they are tried. If all the alternatives match
     the empty string, the entire match fails.

     Example:

     If the following pattern is applied to a string not
     beginning with "a" or "b", it would normally match the empty
     string at the start of the subject:

       a?b?

     With option notempty, this match is invalid, so run/3
     searches further into the string for occurrences of "a" or
     "b".

   • notempty_atstart - Like notempty, except that an empty
     string match that is not at the start of the subject is
     permitted. If the pattern is anchored, such a match can
     occur only if the pattern contains \K.

     Perl has no direct equivalent of notempty or 
     notempty_atstart, but it does make a special case of a
     pattern match of the empty string within its split()
     function, and when using modifier /g. The Perl behavior
     can be emulated after matching a null string by first trying
     the match again at the same offset with notempty_atstart
     and anchored, and then, if that fails, by advancing the
     starting offset (see below) and trying an ordinary match
     again.

   • notbol - Specifies that the first character of the subject
     string is not the beginning of a line, so the circumflex
     metacharacter is not to match before it. Setting this
     without multiline (at compile time) causes circumflex
     never to match. This option only affects the behavior of the
     circumflex metacharacter. It does not affect \A.

   • noteol - Specifies that the end of the subject string is
     not the end of a line, so the dollar metacharacter is not to
     match it nor (except in multiline mode) a newline
     immediately before it. Setting this without multiline (at
     compile time) causes dollar never to match. This option
     affects only the behavior of the dollar metacharacter. It
     does not affect \Z or \z.

   • report_errors - Gives better control of the error handling
     in run/3. When specified, compilation errors (if the
     regular expression is not already compiled) and runtime
     errors are explicitly returned as an error tuple.

     The following are the possible runtime errors:

      ○ match_limit - The PCRE library sets a limit on how
        many times the internal match function can be called.
        Defaults to 10,000,000 in the library compiled for
        Erlang. If {error, match_limit} is returned, the
        execution of the regular expression has reached this
        limit. This is normally to be regarded as a nomatch,
        which is the default return value when this occurs,
        but by specifying report_errors, you are informed
        when the match fails because of too many internal
        calls.

      ○ match_limit_recursion - This error is very similar to 
        match_limit, but occurs when the internal match
        function of PCRE is "recursively" called more times
        than the match_limit_recursion limit, which defaults
        to 10,000,000 as well. Notice that as long as the 
        match_limit and match_limit_default values are kept
        at the default values, the match_limit_recursion
        error cannot occur, as the match_limit error occurs
        before that (each recursive call is also a call, but
        not conversely). Both limits can however be changed,
        either by setting limits directly in the regular
        expression string (see section PCRE Regular
        Eexpression Details) or by specifying options to 
        run/3.

     It is important to understand that what is referred to as
     "recursion" when limiting matches is not recursion on the C
     stack of the Erlang machine or on the Erlang process stack.
     The PCRE version compiled into the Erlang VM uses machine
     "heap" memory to store values that must be kept over
     recursion in regular expression matches.

   • {match_limit, integer() >= 0} - Limits the execution time
     of a match in an implementation-specific way. It is
     described as follows by the PCRE documentation:

       The match_limit field provides a means of preventing
       PCRE from using up a vast amount of resources when
       running patterns that are not going to match, but which
       have a very large number of possibilities in their
       search trees. The classic example is a pattern that uses
       nested unlimited repeats. Internally, pcre_exec() uses a
       function called match(), which it calls repeatedly
       (sometimes recursively). The limit set by match_limit is
       imposed on the number of times this function is called
       during a match, which has the effect of limiting the
       amount of backtracking that can take place. For patterns
       that are not anchored, the count restarts from zero for
       each position in the subject string.

     This means that runaway regular expression matches can fail
     faster if the limit is lowered using this option. The
     default value 10,000,000 is compiled into the Erlang VM.

  Note

       This option does in no way affect the execution of the
       Erlang VM in terms of "long running BIFs". run/3
       always gives control back to the scheduler of Erlang
       processes at intervals that ensures the real-time
       properties of the Erlang system.

   • {match_limit_recursion, integer() >= 0} - Limits the
     execution time and memory consumption of a match in an
     implementation-specific way, very similar to match_limit.
     It is described as follows by the PCRE documentation:

       The match_limit_recursion field is similar to
       match_limit, but instead of limiting the total number of
       times that match() is called, it limits the depth of
       recursion. The recursion depth is a smaller number than
       the total number of calls, because not all calls to
       match() are recursive. This limit is of use only if it
       is set smaller than match_limit. Limiting the recursion
       depth limits the amount of machine stack that can be
       used, or, when PCRE has been compiled to use memory on
       the heap instead of the stack, the amount of heap memory
       that can be used.

     The Erlang VM uses a PCRE library where heap memory is used
     when regular expression match recursion occurs. This
     therefore limits the use of machine heap, not C stack.

     Specifying a lower value can result in matches with deep
     recursion failing, when they should have matched:

       1> re:run("aaaaaaaaaaaaaz","(a+)*z").
       {match,[{0,14},{0,13}]}
       2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]).
       nomatch
       3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]).
       {error,match_limit_recursion}

     This option and option match_limit are only to be used in
     rare cases. Understanding of the PCRE library internals is
     recommended before tampering with these limits.

   • {offset, integer() >= 0} - Start matching at the offset
     (position) specified in the subject string. The offset is
     zero-based, so that the default is {offset,0} (all of the
     subject string).

   • {newline, NLSpec} - Overrides the default definition of a
     newline in the subject string, which is LF (ASCII 10) in
     Erlang.

      ○ cr - Newline is indicated by a single character CR
        (ASCII 13).

      ○ lf - Newline is indicated by a single character LF
        (ASCII 10), the default.

      ○ crlf - Newline is indicated by the two-character CRLF
        (ASCII 13 followed by ASCII 10) sequence.

      ○ anycrlf - Any of the three preceding sequences is be
        recognized.

      ○ any - Any of the newline sequences above, and the
        Unicode sequences VT (vertical tab, U+000B), FF
        (formfeed, U+000C), NEL (next line, U+0085), LS (line
        separator, U+2028), and PS (paragraph separator,
        U+2029).

   • bsr_anycrlf - Specifies specifically that \R is to match
     only the CR LF, or CRLF sequences, not the Unicode-specific
     newline characters. (Overrides the compilation option.)

   • bsr_unicode - Specifies specifically that \R is to match
     all the Unicode newline characters (including CRLF, and so
     on, the default). (Overrides the compilation option.)

   • {capture, ValueSpec}/{capture, ValueSpec, Type} -
     Specifies which captured substrings are returned and in what
     format. By default, run/3 captures all of the matching
     part of the substring and all capturing subpatterns (all of
     the pattern is automatically captured). The default return
     type is (zero-based) indexes of the captured parts of the
     string, specified as {Offset,Length} pairs (the index 
     Type of capturing).

     As an example of the default behavior, the following call
     returns, as first and only captured string, the matching
     part of the subject ("abcd" in the middle) as an index pair 
     {3,4}, where character positions are zero-based, just as in
     offsets:

       re:run("ABCabcdABC","abcd",[]).

     The return value of this call is:

       {match,[{3,4}]}

     Another (and quite common) case is where the regular
     expression matches all of the subject:

       re:run("ABCabcdABC",".*abcd.*",[]).

     Here the return value correspondingly points out all of the
     string, beginning at index 0, and it is 10 characters long:

       {match,[{0,10}]}

     If the regular expression contains capturing subpatterns,
     like in:

       re:run("ABCabcdABC",".*(abcd).*",[]).

     all of the matched subject is captured, as well as the
     captured substrings:

       {match,[{0,10},{3,4}]}

     The complete matching pattern always gives the first return
     value in the list and the remaining subpatterns are added in
     the order they occurred in the regular expression.

     The capture tuple is built up as follows:

      ○ ValueSpec - Specifies which captured (sub)patterns
        are to be returned. ValueSpec can either be an atom
        describing a predefined set of return values, or a
        list containing the indexes or the names of specific
        subpatterns to return.

        The following are the predefined sets of subpatterns:

         ◼ all - All captured subpatterns including the
           complete matching string. This is the default.

         ◼ all_names - All named subpatterns in the
           regular expression, as if a list/0 of all the
           names in alphabetical order was specified. The
           list of all names can also be retrieved with 
           inspect/2.

         ◼ first - Only the first captured subpattern,
           which is always the complete matching part of
           the subject. All explicitly captured subpatterns
           are discarded.

         ◼ all_but_first - All but the first matching
           subpattern, that is, all explicitly captured
           subpatterns, but not the complete matching part
           of the subject string. This is useful if the
           regular expression as a whole matches a large
           part of the subject, but the part you are
           interested in is in an explicitly captured
           subpattern. If the return type is list or 
           binary, not returning subpatterns you are not
           interested in is a good way to optimize.

         ◼ none - Returns no matching subpatterns, gives
           the single atom match as the return value of
           the function when matching successfully instead
           of the {match, list()} return. Specifying an
           empty list gives the same behavior.

        The value list is a list of indexes for the
        subpatterns to return, where index 0 is for all of the
        pattern, and 1 is for the first explicit capturing
        subpattern in the regular expression, and so on. When
        using named captured subpatterns (see below) in the
        regular expression, one can use atom/0s or string/0
        s to specify the subpatterns to be returned. For
        example, consider the regular expression:

          ".*(abcd).*"

        matched against string "ABCabcdABC", capturing only
        the "abcd" part (the first explicit subpattern):

          re:run("ABCabcdABC",".*(abcd).*",[{capture,[1]}]).

        The call gives the following result, as the first
        explicitly captured subpattern is "(abcd)", matching
        "abcd" in the subject, at (zero-based) position 3, of
        length 4:

          {match,[{3,4}]}

        Consider the same regular expression, but with the
        subpattern explicitly named 'FOO':

          ".*(?<FOO>abcd).*"

        With this expression, we could still give the index of
        the subpattern with the following call:

          re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,[1]}]).

        giving the same result as before. But, as the
        subpattern is named, we can also specify its name in
        the value list:

          re:run("ABCabcdABC",".*(?<FOO>abcd).*",[{capture,['FOO']}]).

        This would give the same result as the earlier
        examples, namely:

          {match,[{3,4}]}

        The values list can specify indexes or names not
        present in the regular expression, in which case the
        return values vary depending on the type. If the type
        is index, the tuple {-1,0} is returned for values
        with no corresponding subpattern in the regular
        expression, but for the other types (binary and 
        list), the values are the empty binary or list,
        respectively.

      ○ Type - Optionally specifies how captured substrings
        are to be returned. If omitted, the default of index
        is used.

        Type can be one of the following:

         ◼ index - Returns captured substrings as pairs of
           byte indexes into the subject string and length
           of the matching string in the subject (as if the
           subject string was flattened with 
           erlang:iolist_to_binary/1 or 
           unicode:characters_to_binary/2 before
           matching). Notice that option unicode results
           in byte-oriented indexes in a (possibly
           virtual) UTF-8 encoded binary. A byte index
           tuple {0,2} can therefore represent one or two
           characters when unicode is in effect. This can
           seem counter-intuitive, but has been deemed the
           most effective and useful way to do it. To
           return lists instead can result in simpler code
           if that is desired. This return type is the
           default.

         ◼ list - Returns matching substrings as lists of
           characters (Erlang string/0s). It option 
           unicode is used in combination with the \C
           sequence in the regular expression, a captured
           subpattern can contain bytes that are not valid
           UTF-8 (\C matches bytes regardless of character
           encoding). In that case the list capturing can
           result in the same types of tuples that 
           unicode:characters_to_list/2 can return, namely
           three-tuples with tag incomplete or error,
           the successfully converted characters and the
           invalid UTF-8 tail of the conversion as a
           binary. The best strategy is to avoid using the
           \C sequence when capturing lists.

         ◼ binary - Returns matching substrings as
           binaries. If option unicode is used, these
           binaries are in UTF-8. If the \C sequence is
           used together with unicode, the binaries can
           be invalid UTF-8.

     In general, subpatterns that were not assigned a value in
     the match are returned as the tuple {-1,0} when type is 
     index. Unassigned subpatterns are returned as the empty
     binary or list, respectively, for other return types.
     Consider the following regular expression:

       ".*((?<FOO>abdd)|a(..d)).*"

     There are three explicitly capturing subpatterns, where the
     opening parenthesis position determines the order in the
     result, hence ((?<FOO>abdd)|a(..d)) is subpattern index 1, 
     (?<FOO>abdd) is subpattern index 2, and (..d) is
     subpattern index 3. When matched against the following
     string:

       "ABCabcdABC"

     the subpattern at index 2 does not match, as "abdd" is not
     present in the string, but the complete pattern matches
     (because of the alternative a(..d)). The subpattern at
     index 2 is therefore unassigned and the default return value
     is:

       {match,[{0,10},{3,4},{-1,0},{4,3}]}

     Setting the capture Type to binary gives:

       {match,[<<"ABCabcdABC">>,<<"abcd">>,<<>>,<<"bcd">>]}

     Here the empty binary (<<>>) represents the unassigned
     subpattern. In the binary case, some information about the
     matching is therefore lost, as <<>> can also be an empty
     string captured.

     If differentiation between empty matches and non-existing
     subpatterns is necessary, use the type index and do the
     conversion to the final type in Erlang code.

     When option global is speciified, the capture
     specification affects each match separately, so that:

       re:run("cacb","c(a|b)",[global,{capture,[1],list}]).

     gives

       {match,[["a"],["b"]]}

  For a descriptions of options only affecting the compilation step,
  see compile/2.