File: stdlib_re_compile_func.txt

package info (click to toggle)
erlang 1%3A27.3.4.1%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 225,000 kB
  • sloc: erlang: 1,658,966; ansic: 405,769; cpp: 177,850; xml: 82,435; makefile: 15,031; sh: 14,401; lisp: 9,812; java: 8,603; asm: 6,541; perl: 5,836; python: 5,484; sed: 72
file content (173 lines) | stat: -rw-r--r-- 8,450 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173

  compile(Regexp)

  The same as compile(Regexp,[])

  compile(Regexp, Options)

  Compiles a regular expression, with the syntax described below,
  into an internal format to be used later as a parameter to run/2
  and run/3.

  Compiling the regular expression before matching is useful if the
  same expression is to be used in matching against multiple
  subjects during the lifetime of the program. Compiling once and
  executing many times is far more efficient than compiling each
  time one wants to match.

  When option unicode is specified, the regular expression is to
  be specified as a valid Unicode charlist(), otherwise as any
  valid iodata/0.

  Options:

   • unicode - The regular expression is specified as a Unicode 
     charlist() and the resulting regular expression code is to
     be run against a valid Unicode charlist() subject. Also
     consider option ucp when using Unicode characters.

   • anchored - The pattern is forced to be "anchored", that is,
     it is constrained to match only at the first matching point
     in the string that is searched (the "subject string"). This
     effect can also be achieved by appropriate constructs in the
     pattern itself.

   • caseless - Letters in the pattern match both uppercase and
     lowercase letters. It is equivalent to Perl option /i and
     can be changed within a pattern by a (?i) option setting.
     Uppercase and lowercase letters are defined as in the ISO
     8859-1 character set.

   • dollar_endonly - A dollar metacharacter in the pattern
     matches only at the end of the subject string. Without this
     option, a dollar also matches immediately before a newline
     at the end of the string (but not before any other
     newlines). This option is ignored if option multiline is
     specified. There is no equivalent option in Perl, and it
     cannot be set within a pattern.

   • dotall - A dot in the pattern matches all characters,
     including those indicating newline. Without it, a dot does
     not match when the current position is at a newline. This
     option is equivalent to Perl option /s and it can be
     changed within a pattern by a (?s) option setting. A
     negative class, such as [^a], always matches newline
     characters, independent of the setting of this option.

   • extended - If this option is set, most white space
     characters in the pattern are totally ignored except when
     escaped or inside a character class. However, white space is
     not allowed within sequences such as (?> that introduce
     various parenthesized subpatterns, nor within a numerical
     quantifier such as {1,3}. However, ignorable white space
     is permitted between an item and a following quantifier and
     between a quantifier and a following + that indicates
     possessiveness.

     White space did not used to include the VT character (code
     11), because Perl did not treat this character as white
     space. However, Perl changed at release 5.18, so PCRE
     followed at release 8.34, and VT is now treated as white
     space.

     This also causes characters between an unescaped # outside a
     character class and the next newline, inclusive, to be
     ignored. This is equivalent to Perl's /x option, and it
     can be changed within a pattern by a (?x) option setting.

     With this option, comments inside complicated patterns can
     be included. However, notice that this applies only to data
     characters. Whitespace characters can never appear within
     special character sequences in a pattern, for example within
     sequence (?( that introduces a conditional subpattern.

   • firstline - An unanchored pattern is required to match
     before or at the first newline in the subject string,
     although the matched text can continue over the newline.

   • multiline - By default, PCRE treats the subject string as
     consisting of a single line of characters (even if it
     contains newlines). The "start of line" metacharacter (^)
     matches only at the start of the string, while the "end of
     line" metacharacter ($) matches only at the end of the
     string, or before a terminating newline (unless option 
     dollar_endonly is specified). This is the same as in Perl.

     When this option is specified, the "start of line" and "end
     of line" constructs match immediately following or
     immediately before internal newlines in the subject string,
     respectively, as well as at the very start and end. This is
     equivalent to Perl option /m and can be changed within a
     pattern by a (?m) option setting. If there are no newlines
     in a subject string, or no occurrences of ^ or $ in a
     pattern, setting multiline has no effect.

   • no_auto_capture - Disables the use of numbered capturing
     parentheses in the pattern. Any opening parenthesis that is
     not followed by ? behaves as if it is followed by ?:.
     Named parentheses can still be used for capturing (and they
     acquire numbers in the usual way). There is no equivalent
     option in Perl.

   • dupnames - Names used to identify capturing subpatterns
     need not be unique. This can be helpful for certain types of
     pattern when it is known that only one instance of the named
     subpattern can ever be matched. More details of named
     subpatterns are provided below.

   • ungreedy - Inverts the "greediness" of the quantifiers so
     that they are not greedy by default, but become greedy if
     followed by "?". It is not compatible with Perl. It can also
     be set by a (?U) option setting within the pattern.

   • {newline, NLSpec} - Overrides the default definition of a
     newline in the subject string, which is LF (ASCII 10) in
     Erlang.

      ○ cr - Newline is indicated by a single character cr
        (ASCII 13).

      ○ lf - Newline is indicated by a single character LF
        (ASCII 10), the default.

      ○ crlf - Newline is indicated by the two-character CRLF
        (ASCII 13 followed by ASCII 10) sequence.

      ○ anycrlf - Any of the three preceding sequences is to
        be recognized.

      ○ any - Any of the newline sequences above, and the
        Unicode sequences VT (vertical tab, U+000B), FF
        (formfeed, U+000C), NEL (next line, U+0085), LS (line
        separator, U+2028), and PS (paragraph separator,
        U+2029).

   • bsr_anycrlf - Specifies specifically that \R is to match
     only the CR, LF, or CRLF sequences, not the Unicode-specific
     newline characters.

   • bsr_unicode - Specifies specifically that \R is to match
     all the Unicode newline characters (including CRLF, and so
     on, the default).

   • no_start_optimize - Disables optimization that can
     malfunction if "Special start-of-pattern items" are present
     in the regular expression. A typical example would be when
     matching "DEFABC" against "(COMMIT)ABC", where the start
     optimization of PCRE would skip the subject up to "A" and
     never realize that the (COMMIT) instruction is to have made
     the matching fail. This option is only relevant if you use
     "start-of-pattern items", as discussed in section PCRE
     Regular Expression Details.

   • ucp - Specifies that Unicode character properties are to be
     used when resolving \B, \b, \D, \d, \S, \s, \W and \w.
     Without this flag, only ISO Latin-1 properties are used.
     Using Unicode properties hurts performance, but is
     semantically correct when working with Unicode characters
     beyond the ISO Latin-1 range.

   • never_utf - Specifies that the (UTF) and/or (UTF8)
     "start-of-pattern items" are forbidden. This flag cannot be
     combined with option unicode. Useful if ISO Latin-1
     patterns from an external source are to be compiled.