File: syntax.html

package info (click to toggle)
lib-gnu.regexp-java 1.0.8-1
links: PTS
area: main
in suites: potato
size: 772 kB
ctags: 675
sloc: java: 1,942; makefile: 227; sh: 17
file content (223 lines) | stat: -rw-r--r-- 9,488 bytes
<HTML>
<HEAD>
<TITLE>package gnu.regexp - Regular Expressions for Java</TITLE>
</HEAD>
<BODY BGCOLOR=WHITE TEXT=BLACK>
<FONT SIZE="+2"><B><CODE>package gnu.regexp;</CODE></B></FONT>
<HR NOSHADE>
<FONT SIZE="+2">Syntax and Usage Notes</FONT><BR>
<FONT SIZE="-1">This page was last updated on 8 October 1998</FONT>
<P>
<B>Brief Background</B>
<BR>

A regular expression consists of a character string where some
characters are given special meaning with regard to pattern matching.
Regular expressions have been in use from the early days of computing,
and provide a powerful and efficient way to parse, interpret and
search and replace text within an application.

<P>
<B>Supported Syntax</B>
<BR>
Within a regular expression, the following characters have special meaning:<BR>
<UL>
<LI><B><I>Positional Operators</I></B><BR>
<blockquote>
<code>^</code> matches at the beginning of a line<SUP><A HREF="#note1">1</A></SUP><BR>
<code>$</code> matches at the end of a line<SUP><A HREF="#note2">2</A></SUP><BR>
<code>\A</code> matches the start of the entire string<BR>
<code>\Z</code> matches the end of the entire string<BR>
</blockquote>

<li>
<B><I>One-Character Operators</I></B><BR>
<blockquote>
<code>.</code> matches any single character<SUP><A HREF="#note3">3</A></SUP><BR>
<code>\d</code> matches any decimal digit<BR>
<code>\D</code> matches any non-digit<BR>
<code>\n</code> matches a newline character<BR>
<code>\r</code> matches a return character<BR>
<code>\s</code> matches any whitespace character<BR>
<code>\S</code> matches any non-whitespace character<BR>
<code>\t</code> matches a horizontal tab character<BR>
<code>\w</code> matches any word (alphanumeric) character<BR>
<code>\W</code> matches any non-word (alphanumeric) character<BR>
<code>\<i>x</i></code> matches the character <i>x</i>, if <i>x</i> is not one of the above listed escape sequences.<BR>
</blockquote>

<li>
<B><I>Character Class Operator</I></B><BR>
<blockquote>
<code>[<i>abc</i>]</code> matches any character in the set <i>a</i>, <i>b</i> or <i>c</i><BR>
<code>[^<i>abc</i>]</code> matches any character not in the set <i>a</i>, <i>b</i> or <i>c</i><BR>
<code>[<i>a-z</i>]</code> matches any character in the range <i>a</i> to <i>z</i>, inclusive<BR>
A leading or trailing dash will be interpreted literally.<BR>
</blockquote>

Within a character class expression, the following sequences have special meaning if the syntax bit RE_CHAR_CLASSES is on:<BR>
<blockquote>
<code>[:alnum:]</code> Any alphanumeric character<br>
<code>[:alpha:]</code> Any alphabetical character<br>
<code>[:blank:]</code> A space or horizontal tab<br>
<code>[:cntrl:]</code> A control character<br>
<code>[:digit:]</code> A decimal digit<br>
<code>[:graph:]</code> A non-space, non-control character<br>
<code>[:lower:]</code> A lowercase letter<br>
<code>[:print:]</code> Same as graph, but also space and tab<br>
<code>[:punct:]</code> A punctuation character<br>
<code>[:space:]</code> Any whitespace character, including newline and return<br>
<code>[:upper:]</code> An uppercase letter<br>
<code>[:xdigit:]</code> A valid hexadecimal digit<br>
</blockquote>

<li>
<B><I>Subexpressions and Backreferences</I></B><BR>
<blockquote>
<code>(<i>abc</i>)</code> matches whatever the expression <i>abc</i> would match, and saves it as a subexpression.  Also used for grouping.<BR>
<code>(?:<i>...</i>)</code> pure grouping operator, does not save contents<BR>
<code>(?#<i>...</i>)</code> embedded comment, ignored by engine<BR>
<code>\<i>n</i></code> where 0 &lt; <i>n</i> &lt; 10, matches the same thing the <i>n</i><super>th</super> subexpression matched.<BR>
</blockquote>

<li>
<B><I>Branching (Alternation) Operator</I></B><BR>
<blockquote>
<code><i>a</i>|<i>b</i></code> matches whatever the expression <i>a</i> would match, or whatever the expression <i>b</i> would match.<BR>
</blockquote>

<li>
<B><I>Repeating Operators</I></B><BR>
These symbols operate on the previous atomic expression.
<blockquote>
<code>?</code> matches the preceding expression or the null string<BR>
<code>*</code> matches the null string or any number of repetitions of the preceding expression<BR>
<code>+</code> matches one or more repetitions of the preceding expression<BR>
<code>{<i>m</i>}</code> matches exactly <i>m</i> repetitions of the one-character expression<BR>
<code>{<i>m</i>,<i>n</i>}</code> matches between <i>m</i> and <i>n</i> repetitions of the preceding expression, inclusive<BR>
<code>{<i>m</i>,}</code> matches <i>m</i> or more repetitions of the preceding expression<BR>
</blockquote>
<li>
<B><I>Stingy (Minimal) Matching</I></B><BR>

If a repeating operator (above) is immediately followed by a
<code>?</code>, the repeating operator will stop at the smallest
number of repetitions that can complete the rest of the match.<BR>

</UL>
<P>
<B>Unsupported Syntax</B>
<BR>

Some flavors of regular expression utilities support additional escape
sequences, and this is not meant to be an exhaustive list.  In the
future, <code>gnu.regexp</code> may support some or all of the
following:<BR>

<blockquote>
<code>(?=<i>...</i>)</code> positive lookahead operator (Perl5)<BR>
<code>(?!<i>...</i>)</code> negative lookahead operator (Perl5)<BR>
<code>(?<i>mods</i>)</code> inlined compilation/execution modifiers (Perl5)<BR>
<code>\G</code> end of previous match (Perl5)<BR>
<code>\b</code> word break positional anchor (Perl5)<BR>
<code>\B</code> non-word break positional anchor (Perl5)<BR>
<code>\&lt;</code> start of word positional anchor (egrep)<BR>
<code>\&gt;</code> end of word positional anchor (egrep)<BR>
<code>[.<i>symbol</i>.]</code> collating symbol in class expression (POSIX)<BR>
<code>[=<i>class</i>=]</code> equivalence class in class expression (POSIX)<BR>
</blockquote>

<P>
<B>Java Integration</B>
<BR>

In a Java environment, a regular expression operates on a string of
Unicode characters, represented either as an instance of
<code>java.lang.String</code> or as an array of the primitive
<code>char</code> type.  This means that the unit of matching is a
Unicode character, not a single byte.  Generally this will not present
problems in a Java program, because Java takes pains to ensure that
all textual data uses the Unicode standard.

<P>

Because Java string processing takes care of certain escape sequences,
they are not implemented in <code>gnu.regexp</code>.  You should be
aware that the following escape sequences are handled by the Java
compiler if found in the Java source:<BR>

<blockquote>
<code>\b</code> backspace<BR>
<code>\f</code> form feed<BR>
<code>\n</code> newline<BR>
<code>\r</code> carriage return<BR>
<code>\t</code> horizontal tab<BR>
<code>\"</code> double quote<BR>
<code>\'</code> single quote<BR>
<code>\\</code> backslash<BR>
<code>\<i>xxx</i></code> character, in octal (000-377)<BR>
<code>\u<i>xxxx</i></code> Unicode character, in hexadecimal (0000-FFFF)<BR>
</blockquote>

In addition, note that the <code>\u</code> escape sequences are
meaningful anywhere in a Java program, not merely within a singly- or
doubly-quoted character string, and are converted prior to any of the
other escape sequences.  For example, the line <BR>

<code>gnu.regexp.RE exp = new gnu.regexp.RE("\u005cn");</code><BR>

would be converted by first replacing <code>\u005c</code> with a
backslash, then converting <code>\n</code> to a newline.  By the time
the RE constructor is called, it will be passed a String object
containing only the Unicode newline character.

<P>

The POSIX character classes (above), and the equivalent shorthand
escapes (<code>\d</code>, <code>\w</code> and the like) are
implemented to use the <code>java.lang.Character</code> static
functions whenever possible.  For example, <code>\w</code> and
<code>[:alnum:]</code> (the latter only from within a class
expression) will invoke the Java function
<code>Character.isLetterOrDigit()</code> when executing.  It is
<i>always</i> better to use the POSIX expressions than a range such as
<code>[a-zA-Z0-9]</code>, because the latter will not match any letter
characters in non-ISO 9660 encodings (for example, the umlaut
character, "<code>&uuml;</code>").

<P>
<B>Reference Material</B>
<BR>
<UL>
<LI><B><I>Print Books and Publications</I></B><BR>
Friedl, Jeffrey E.F., <I>Mastering Regular Expressions</I>. O'Reilly &amp; Associates, Inc., Sebastopol, California, 1997.<BR>
<P>
<LI><B><I>Software Manuals and Guides</I></B><BR>
Berry, Karl and Hargreaves, Kathryn A., <A HREF="http://www.cs.utah.edu/csinfo/texinfo/regex/regex_toc.html">GNU Info Regex Manual Edition 0.12a</A>, 19 September 1992.<BR>
<code>perlre(1)</code> man page (Perl Programmer's Reference Guide)<BR>
<code>regcomp(3)</code> man page (GNU C)<BR>
<code>gawk(1)</code> man page (GNU utilities)<BR>
<code>sed(1)</code> man page (GNU utilities)<BR>
<code>ed(1)</code> man page (GNU utilities)<BR>
<code>grep(1)</code> man page (GNU utilities)<BR>
<code>regexp(n)</code> and <code>regsub(n)</code> man pages (TCL)<BR>
</UL>

<P>
<B>Notes</B>
<BR>
<SUP><A NAME="note1">1</A></SUP> but see the REG_NOTBOL and REG_MULTILINE flags<BR>
<SUP><A NAME="note2">2</A></SUP> but see the REG_NOTEOL and REG_MULTILINE flags<BR>
<SUP><A NAME="note3">3</A></SUP> but see the REG_MULTILINE flag<BR>
<P>
<FONT SIZE="-1">
<A HREF="index.html">[gnu.regexp]</A>
<A HREF="changes.html">[change history]</A>
<A HREF="api/packages.html">[api documentation]</A>
<A HREF="reapplet.html">[test applet]</A>
<A HREF="credits.html">[credits]</A>
</FONT>

</BODY>
</HTML>