File: pattern.matching.xml

package info (click to toggle)
php-doc 20241205~git.dfcbb86%2Bdfsg-1
links: PTS, VCS
area: main
in suites: trixie
size: 70,956 kB
sloc: xml: 968,269; php: 23,883; javascript: 671; sh: 177; makefile: 37
file content (411 lines) | stat: -rw-r--r-- 12,960 bytes
<?xml version="1.0" encoding="utf-8"?>
<chapter xml:id="parle.pattern.matching" xmlns="http://docbook.org/ns/docbook">
 <title>Parle pattern matching</title>
 <titleabbrev>Pattern matching</titleabbrev>
 <para>
  Parle supports regex matching similar to flex.
  Also supported are the following POSIX character sets:
  <simplelist type="inline">
   <member><literal>[:alnum:]</literal></member>
   <member><literal>[:alpha:]</literal></member>
   <member><literal>[:blank:]</literal></member>
   <member><literal>[:cntrl:]</literal></member>
   <member><literal>[:digit:]</literal></member>
   <member><literal>[:graph:]</literal></member>
   <member><literal>[:lower:]</literal></member>
   <member><literal>[:print:]</literal></member>
   <member><literal>[:punct:]</literal></member>
   <member><literal>[:space:]</literal></member>
   <member><literal>[:upper:]</literal></member>
   <member><literal>[:xdigit:]</literal></member>
  </simplelist>
  .
 </para>
 <para>
  The Unicode character classes are currently not enabled by default, pass --enable-parle-utf32 to make them available.
  A particular encoding can be mapped with a correctly constructed regex.
  For example, to match the EURO symbol encoded in UTF-8, the regular expression <literal>[\xe2][\x82][\xac]</literal> can be used.
  The pattern for an UTF-8 encoded string could be <literal>[ -\x7f]{+}[\x80-\xbf]{+}[\xc2-\xdf]{+}[\xe0-\xef]{+}[\xf0-\xff]+</literal>.
 </para>

 <section xml:id="parle.regex.chars" annotations="chunk:false">
  <title>Character representations</title>
  <para>
   <table>
    <title>Character representations</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>\a</entry><entry>Alert (bell).</entry>
      </row>
      <row>
       <entry>\b</entry><entry>Backspace.</entry>
      </row>
      <row>
       <entry>\e</entry><entry>ESC character, \x1b.</entry>
      </row>
      <row>
       <entry>\n</entry><entry>Newline.</entry>
      </row>
      <row>
       <entry>\r</entry><entry>Carriage return.</entry>
      </row>
      <row>
       <entry>\f</entry><entry>Form feed, \x0c.</entry>
      </row>
      <row>
       <entry>\t</entry><entry>Horizontal tab, \x09.</entry>
      </row>
      <row>
       <entry>\v</entry><entry>Vertical tab, \x0b.</entry>
      </row>
      <row>
       <entry>\oct</entry><entry>Character specified by a three-digit octal code.</entry>
      </row>
      <row>
       <entry>\xhex</entry><entry>Character specified by a hex code.</entry>
      </row>
      <row>
       <entry>\cchar</entry><entry>Named control character.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
 </section>
 <section xml:id="parle.regex.charclass" annotations="chunk:false">
  <title>Character classes</title>
  <para>
   <table>
    <title>Character classes</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>[...]</entry><entry>A single character listed or contained within a listed range. Ranges can be combined with the <literal>{+}</literal> and <literal>{-}</literal> operators. For example <literal>[a-z]{+}[0-9]</literal> is the same as <literal>[0-9a-z]</literal> and <literal>[a-z]{-}[aeiou]</literal> is the same as <literal>[b-df-hj-np-tv-z]</literal>.</entry>
      </row>
      <row>
       <entry>[^...]</entry><entry>A single character not listed and not contained within a listed range.</entry>
      </row>
      <row>
       <entry>.</entry><entry>Any character, default <literal>[^\n].</literal></entry>
      </row>
      <row>
       <entry>\d</entry><entry>Digit character, <literal>[0-9]</literal>.</entry>
      </row>
      <row>
       <entry>\D</entry><entry>Non-digit character, <literal>[^0-9]</literal>.</entry>
      </row>
      <row>
       <entry>\s</entry><entry>White space character, <literal>[ \t\n\r\f\v]</literal>.</entry>
      </row>
      <row>
       <entry>\S</entry><entry>Non-white space character, <literal>[^ \t\n\r\f\v]</literal>.</entry>
      </row>
      <row>
       <entry>\w</entry><entry>Word character, <literal>[a-zA-Z0-9_]</literal>.</entry>
      </row>
      <row>
       <entry>\W</entry><entry>Non-word character, <literal>[^a-zA-Z0-9_]</literal>.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
 </section>
 <section xml:id="parle.regex.unicodecharclass" annotations="chunk:false">
  <title>Unicode character classes</title>
  <para>
   <table>
    <title>Unicode character classes</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>\p{C}</entry><entry>Other.</entry>
      </row>
      <row>
       <entry>\p{Cc}</entry><entry>Other, control.</entry>
      </row>
      <row>
       <entry>\p{Cf}</entry><entry>Other, format.</entry>
      </row>
      <row>
       <entry>\p{Co}</entry><entry>Other, private use.</entry>
      </row>
      <row>
       <entry>\p{Cs}</entry><entry>Other, surrogate.</entry>
      </row>
      <row>
       <entry>\p{L}</entry><entry>Letter.</entry>
      </row>
      <row>
       <entry>\p{LC}</entry><entry>Letter, cased.</entry>
      </row>
      <row>
       <entry>\p{Ll}</entry><entry>Letter, lowercase.</entry>
      </row>
      <row>
       <entry>\p{Lm}</entry><entry>Letter, modifier.</entry>
      </row>
      <row>
       <entry>\p{Lo}</entry><entry>Letter, other.</entry>
      </row>
      <row>
       <entry>\p{Lt}</entry><entry>Letter, titlecase.</entry>
      </row>
      <row>
       <entry>\p{Lu}</entry><entry>Letter, uppercase.</entry>
      </row>
      <row>
       <entry>\p{M}</entry><entry>Mark.</entry>
      </row>
      <row>
       <entry>\p{Mc}</entry><entry>Mark, space combining.</entry>
      </row>
      <row>
       <entry>\p{Me}</entry><entry>Mark, enclosing.</entry>
      </row>
      <row>
       <entry>\p{Mn}</entry><entry>Mark, nonspacing.</entry>
      </row>
      <row>
       <entry>\p{N}</entry><entry>Number.</entry>
      </row>
      <row>
       <entry>\p{Nd}</entry><entry>Number, decimal digit.</entry>
      </row>
      <row>
       <entry>\p{Nl}</entry><entry>Number, letter.</entry>
      </row>
      <row>
       <entry>\p{No}</entry><entry>Number, other.</entry>
      </row>
      <row>
       <entry>\p{P}</entry><entry>Punctuation.</entry>
      </row>
      <row>
       <entry>\p{Pc}</entry><entry>Punctiation, connector.</entry>
      </row>
      <row>
       <entry>\p{Pd}</entry><entry>Punctuation, dash.</entry>
      </row>
      <row>
       <entry>\p{Pe}</entry><entry>Punctuation, close.</entry>
      </row>
      <row>
       <entry>\p{Pf}</entry><entry>Punctuation, final quote.</entry>
      </row>
      <row>
       <entry>\p{Pi}</entry><entry>Punctuation, initial quote.</entry>
      </row>
      <row>
       <entry>\p{Po}</entry><entry>Punctuation, other.</entry>
      </row>
      <row>
       <entry>\p{Ps}</entry><entry>Punctuation, open.</entry>
      </row>
      <row>
       <entry>\p{S}</entry><entry>Symbol.</entry>
      </row>
      <row>
       <entry>\p{Sc}</entry><entry>Symbol, currency.</entry>
      </row>
      <row>
       <entry>\p{Sk}</entry><entry>Symbol, modifier.</entry>
      </row>
      <row>
       <entry>\p{Sm}</entry><entry>Symbol, math.</entry>
      </row>
      <row>
       <entry>\p{So}</entry><entry>Symbol, other.</entry>
      </row>
      <row>
       <entry>\p{Z}</entry><entry>Separator.</entry>
      </row>
      <row>
       <entry>\p{Zl}</entry><entry>Separator, line.</entry>
      </row>
      <row>
       <entry>\p{Zp}</entry><entry>Separator, paragraph.</entry>
      </row>
      <row>
       <entry>\p{Zs}</entry><entry>Separator, space.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
  <para>
   These character classes are only available, if the option --enable-parle-utf32 was passed at the compilation time.
  </para>
 </section>
 <section xml:id="parle.regex.alternation" annotations="chunk:false">
  <title>Alternation and repetition</title>
  <para>
   <table>
    <title>Alternation and repetition</title>
    <tgroup cols="3">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Greedy</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>...|...</entry><entry>-</entry><entry>Try sub-patterns in alternation.</entry>
      </row>
      <row>
       <entry>*</entry><entry>yes</entry><entry>Match 0 or more times.</entry>
      </row>
      <row>
       <entry>+</entry><entry>yes</entry><entry>Match 1 or more times.</entry>
      </row>
      <row>
       <entry>?</entry><entry>yes</entry><entry>Match 0 or 1 times.</entry>
      </row>
      <row>
       <entry>{n}</entry><entry>no</entry><entry>Match exactly n times.</entry>
      </row>
      <row>
       <entry>{n,}</entry><entry>yes</entry><entry>Match at least n times.</entry>
      </row>
      <row>
       <entry>{n,m}</entry><entry>yes</entry><entry>Match at least n times but no more than m times.</entry>
      </row>
      <row>
       <entry>*?</entry><entry>no</entry><entry>Match 0 or more times.</entry>
      </row>
      <row>
       <entry>+?</entry><entry>no</entry><entry>Match 1 or more times.</entry>
      </row>
      <row>
       <entry>??</entry><entry>no</entry><entry>Match 0 or 1 times.</entry>
      </row>
      <row>
       <entry>{n,}?</entry><entry>no</entry><entry>Match at least n times.</entry>
      </row>
      <row>
       <entry>{n,m}?</entry><entry>no</entry><entry>Match at least n times but no more than m times.</entry>
      </row>
      <row>
       <entry>{MACRO}</entry><entry>-</entry><entry>Include the regex MACRO in the current regex.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
 </section>
 <section xml:id="parle.regex.anchors" annotations="chunk:false">
  <title>Anchors</title>
  <para>
   <table>
    <title>Anchors</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Sequence</entry><entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>^</entry><entry>Start of string or after a newline.</entry>
      </row>
      <row>
       <entry>$</entry><entry>End of string or before a newline.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
 </section>
 <section xml:id="parle.regex.grouping" annotations="chunk:false">
  <title>Grouping</title>
  <para>
   <table>
    <title>Grouping</title>
    <tgroup cols="2">
     <thead>
      <row>
       <entry>Sequence</entry>
       <entry>Description</entry>
      </row>
     </thead>
     <tbody>
      <row>
       <entry>(...)</entry>
       <entry>Group a regular expression to override default operator precedence.</entry>
      </row>
      <row>
       <entry valign="top">(?r-s:pattern)</entry>
       <entry>
        <simpara>
         Apply option r and omit option s while interpreting pattern.
         Options may be zero or more of the characters i, s, or x.
        </simpara>
        <simpara>
         <literal>i</literal> means case-insensitive.
        </simpara>
        <simpara>
         <literal>-i</literal> means case-sensitive.
        </simpara>
        <simpara>
         <literal>s</literal> alters the meaning of <literal>.</literal> to match any character whatsoever.
        </simpara>
        <simpara>
         <literal>-s</literal> alters the meaning of <literal>.</literal> to match any character except <literal>\n</literal>.
        </simpara>
        <simpara>
         <literal>x</literal> ignores comments and whitespace in patterns.
         Whitespace is ignored unless it is backslash-escaped, contained within <literal>""s</literal>,
         or appears inside a character range.
        </simpara>
        <simpara>
         These options can be applied globally at the rules level by passing a combination of the bit flags to the lexer.
        </simpara>
       </entry>
      </row>
      <row>
       <entry>(?# comment )</entry>
       <entry>Omit everything within (). The first ) character encountered ends the pattern. It is not possible for the comment to contain a ) character. The comment may span lines.</entry>
      </row>
     </tbody>
    </tgroup>
   </table>
  </para>
 </section>
</chapter>

<!-- Keep this comment at the end of the file
Local variables:
mode: sgml
sgml-omittag:t
sgml-shorttag:t
sgml-minimize-attributes:nil
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
indent-tabs-mode:nil
sgml-parent-document:nil
sgml-default-dtd-file:"~/.phpdoc/manual.ced"
sgml-exposed-tags:nil
sgml-local-catalogs:nil
sgml-local-ecat-files:nil
End:
vim600: syn=xml fen fdm=syntax fdl=2 si
vim: et tw=78 syn=sgml
vi: ts=1 sw=1
-->