File: syntax.html

package info (click to toggle)
lib-gnu.regexp-java 1.0.8-1
  • links: PTS
  • area: main
  • in suites: potato
  • size: 772 kB
  • ctags: 675
  • sloc: java: 1,942; makefile: 227; sh: 17
file content (223 lines) | stat: -rw-r--r-- 9,488 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
<HTML>
<HEAD>
<TITLE>package gnu.regexp - Regular Expressions for Java</TITLE>
</HEAD>
<BODY BGCOLOR=WHITE TEXT=BLACK>
<FONT SIZE="+2"><B><CODE>package gnu.regexp;</CODE></B></FONT>
<HR NOSHADE>
<FONT SIZE="+2">Syntax and Usage Notes</FONT><BR>
<FONT SIZE="-1">This page was last updated on 8 October 1998</FONT>
<P>
<B>Brief Background</B>
<BR>

A regular expression consists of a character string where some
characters are given special meaning with regard to pattern matching.
Regular expressions have been in use from the early days of computing,
and provide a powerful and efficient way to parse, interpret and
search and replace text within an application.

<P>
<B>Supported Syntax</B>
<BR>
Within a regular expression, the following characters have special meaning:<BR>
<UL>
<LI><B><I>Positional Operators</I></B><BR>
<blockquote>
<code>^</code> matches at the beginning of a line<SUP><A HREF="#note1">1</A></SUP><BR>
<code>$</code> matches at the end of a line<SUP><A HREF="#note2">2</A></SUP><BR>
<code>\A</code> matches the start of the entire string<BR>
<code>\Z</code> matches the end of the entire string<BR>
</blockquote>

<li>
<B><I>One-Character Operators</I></B><BR>
<blockquote>
<code>.</code> matches any single character<SUP><A HREF="#note3">3</A></SUP><BR>
<code>\d</code> matches any decimal digit<BR>
<code>\D</code> matches any non-digit<BR>
<code>\n</code> matches a newline character<BR>
<code>\r</code> matches a return character<BR>
<code>\s</code> matches any whitespace character<BR>
<code>\S</code> matches any non-whitespace character<BR>
<code>\t</code> matches a horizontal tab character<BR>
<code>\w</code> matches any word (alphanumeric) character<BR>
<code>\W</code> matches any non-word (alphanumeric) character<BR>
<code>\<i>x</i></code> matches the character <i>x</i>, if <i>x</i> is not one of the above listed escape sequences.<BR>
</blockquote>

<li>
<B><I>Character Class Operator</I></B><BR>
<blockquote>
<code>[<i>abc</i>]</code> matches any character in the set <i>a</i>, <i>b</i> or <i>c</i><BR>
<code>[^<i>abc</i>]</code> matches any character not in the set <i>a</i>, <i>b</i> or <i>c</i><BR>
<code>[<i>a-z</i>]</code> matches any character in the range <i>a</i> to <i>z</i>, inclusive<BR>
A leading or trailing dash will be interpreted literally.<BR>
</blockquote>

Within a character class expression, the following sequences have special meaning if the syntax bit RE_CHAR_CLASSES is on:<BR>
<blockquote>
<code>[:alnum:]</code> Any alphanumeric character<br>
<code>[:alpha:]</code> Any alphabetical character<br>
<code>[:blank:]</code> A space or horizontal tab<br>
<code>[:cntrl:]</code> A control character<br>
<code>[:digit:]</code> A decimal digit<br>
<code>[:graph:]</code> A non-space, non-control character<br>
<code>[:lower:]</code> A lowercase letter<br>
<code>[:print:]</code> Same as graph, but also space and tab<br>
<code>[:punct:]</code> A punctuation character<br>
<code>[:space:]</code> Any whitespace character, including newline and return<br>
<code>[:upper:]</code> An uppercase letter<br>
<code>[:xdigit:]</code> A valid hexadecimal digit<br>
</blockquote>

<li>
<B><I>Subexpressions and Backreferences</I></B><BR>
<blockquote>
<code>(<i>abc</i>)</code> matches whatever the expression <i>abc</i> would match, and saves it as a subexpression.  Also used for grouping.<BR>
<code>(?:<i>...</i>)</code> pure grouping operator, does not save contents<BR>
<code>(?#<i>...</i>)</code> embedded comment, ignored by engine<BR>
<code>\<i>n</i></code> where 0 &lt; <i>n</i> &lt; 10, matches the same thing the <i>n</i><super>th</super> subexpression matched.<BR>
</blockquote>

<li>
<B><I>Branching (Alternation) Operator</I></B><BR>
<blockquote>
<code><i>a</i>|<i>b</i></code> matches whatever the expression <i>a</i> would match, or whatever the expression <i>b</i> would match.<BR>
</blockquote>

<li>
<B><I>Repeating Operators</I></B><BR>
These symbols operate on the previous atomic expression.
<blockquote>
<code>?</code> matches the preceding expression or the null string<BR>
<code>*</code> matches the null string or any number of repetitions of the preceding expression<BR>
<code>+</code> matches one or more repetitions of the preceding expression<BR>
<code>{<i>m</i>}</code> matches exactly <i>m</i> repetitions of the one-character expression<BR>
<code>{<i>m</i>,<i>n</i>}</code> matches between <i>m</i> and <i>n</i> repetitions of the preceding expression, inclusive<BR>
<code>{<i>m</i>,}</code> matches <i>m</i> or more repetitions of the preceding expression<BR>
</blockquote>
<li>
<B><I>Stingy (Minimal) Matching</I></B><BR>

If a repeating operator (above) is immediately followed by a
<code>?</code>, the repeating operator will stop at the smallest
number of repetitions that can complete the rest of the match.<BR>

</UL>
<P>
<B>Unsupported Syntax</B>
<BR>

Some flavors of regular expression utilities support additional escape
sequences, and this is not meant to be an exhaustive list.  In the
future, <code>gnu.regexp</code> may support some or all of the
following:<BR>

<blockquote>
<code>(?=<i>...</i>)</code> positive lookahead operator (Perl5)<BR>
<code>(?!<i>...</i>)</code> negative lookahead operator (Perl5)<BR>
<code>(?<i>mods</i>)</code> inlined compilation/execution modifiers (Perl5)<BR>
<code>\G</code> end of previous match (Perl5)<BR>
<code>\b</code> word break positional anchor (Perl5)<BR>
<code>\B</code> non-word break positional anchor (Perl5)<BR>
<code>\&lt;</code> start of word positional anchor (egrep)<BR>
<code>\&gt;</code> end of word positional anchor (egrep)<BR>
<code>[.<i>symbol</i>.]</code> collating symbol in class expression (POSIX)<BR>
<code>[=<i>class</i>=]</code> equivalence class in class expression (POSIX)<BR>
</blockquote>

<P>
<B>Java Integration</B>
<BR>

In a Java environment, a regular expression operates on a string of
Unicode characters, represented either as an instance of
<code>java.lang.String</code> or as an array of the primitive
<code>char</code> type.  This means that the unit of matching is a
Unicode character, not a single byte.  Generally this will not present
problems in a Java program, because Java takes pains to ensure that
all textual data uses the Unicode standard.

<P>

Because Java string processing takes care of certain escape sequences,
they are not implemented in <code>gnu.regexp</code>.  You should be
aware that the following escape sequences are handled by the Java
compiler if found in the Java source:<BR>

<blockquote>
<code>\b</code> backspace<BR>
<code>\f</code> form feed<BR>
<code>\n</code> newline<BR>
<code>\r</code> carriage return<BR>
<code>\t</code> horizontal tab<BR>
<code>\"</code> double quote<BR>
<code>\'</code> single quote<BR>
<code>\\</code> backslash<BR>
<code>\<i>xxx</i></code> character, in octal (000-377)<BR>
<code>\u<i>xxxx</i></code> Unicode character, in hexadecimal (0000-FFFF)<BR>
</blockquote>

In addition, note that the <code>\u</code> escape sequences are
meaningful anywhere in a Java program, not merely within a singly- or
doubly-quoted character string, and are converted prior to any of the
other escape sequences.  For example, the line <BR>

<code>gnu.regexp.RE exp = new gnu.regexp.RE("\u005cn");</code><BR>

would be converted by first replacing <code>\u005c</code> with a
backslash, then converting <code>\n</code> to a newline.  By the time
the RE constructor is called, it will be passed a String object
containing only the Unicode newline character.

<P>

The POSIX character classes (above), and the equivalent shorthand
escapes (<code>\d</code>, <code>\w</code> and the like) are
implemented to use the <code>java.lang.Character</code> static
functions whenever possible.  For example, <code>\w</code> and
<code>[:alnum:]</code> (the latter only from within a class
expression) will invoke the Java function
<code>Character.isLetterOrDigit()</code> when executing.  It is
<i>always</i> better to use the POSIX expressions than a range such as
<code>[a-zA-Z0-9]</code>, because the latter will not match any letter
characters in non-ISO 9660 encodings (for example, the umlaut
character, "<code>&uuml;</code>").

<P>
<B>Reference Material</B>
<BR>
<UL>
<LI><B><I>Print Books and Publications</I></B><BR>
Friedl, Jeffrey E.F., <I>Mastering Regular Expressions</I>. O'Reilly &amp; Associates, Inc., Sebastopol, California, 1997.<BR>
<P>
<LI><B><I>Software Manuals and Guides</I></B><BR>
Berry, Karl and Hargreaves, Kathryn A., <A HREF="http://www.cs.utah.edu/csinfo/texinfo/regex/regex_toc.html">GNU Info Regex Manual Edition 0.12a</A>, 19 September 1992.<BR>
<code>perlre(1)</code> man page (Perl Programmer's Reference Guide)<BR>
<code>regcomp(3)</code> man page (GNU C)<BR>
<code>gawk(1)</code> man page (GNU utilities)<BR>
<code>sed(1)</code> man page (GNU utilities)<BR>
<code>ed(1)</code> man page (GNU utilities)<BR>
<code>grep(1)</code> man page (GNU utilities)<BR>
<code>regexp(n)</code> and <code>regsub(n)</code> man pages (TCL)<BR>
</UL>

<P>
<B>Notes</B>
<BR>
<SUP><A NAME="note1">1</A></SUP> but see the REG_NOTBOL and REG_MULTILINE flags<BR>
<SUP><A NAME="note2">2</A></SUP> but see the REG_NOTEOL and REG_MULTILINE flags<BR>
<SUP><A NAME="note3">3</A></SUP> but see the REG_MULTILINE flag<BR>
<P>
<FONT SIZE="-1">
<A HREF="index.html">[gnu.regexp]</A>
<A HREF="changes.html">[change history]</A>
<A HREF="api/packages.html">[api documentation]</A>
<A HREF="reapplet.html">[test applet]</A>
<A HREF="credits.html">[credits]</A>
</FONT>

</BODY>
</HTML>