File: about_search_charclass.Rd

package info (click to toggle)
r-cran-stringi 1.7.12-1
links: PTS, VCS
area: main
in suites: bookworm
size: 39,772 kB
sloc: cpp: 482,349; ansic: 51,900; perl: 471; makefile: 9; sh: 1
file content (307 lines) | stat: -rw-r--r-- 12,995 bytes
parent folder | download | duplicates (2)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/search.R
\name{about_search_charclass}
\alias{about_search_charclass}
\alias{search_charclass}
\alias{stringi-search-charclass}
\title{Character Classes in \pkg{stringi}}
\description{
Here we describe how character classes (sets) can be specified
in the \pkg{stringi} package. These are useful for defining
search patterns (note that the \pkg{ICU} regex engine uses the same
scheme for denoting character classes) or, e.g.,
generating random code points with \code{\link{stri_rand_strings}}.
}
\details{
All \code{stri_*_charclass} functions in \pkg{stringi} perform
a single character (i.e., Unicode code point) search-based operations.
You may obtain the same results using \link{about_search_regex}.
However, these very functions aim to be faster.

Character classes are defined using \pkg{ICU}'s \code{UnicodeSet}
patterns. Below we briefly summarize their syntax.
For more details refer to the bibliographic References below.
}
\section{\code{UnicodeSet} patterns}{


A \code{UnicodeSet} represents a subset of Unicode code points
(recall that \pkg{stringi} converts strings in your native encoding
to Unicode automatically). Legal code points are U+0000 to U+10FFFF,
inclusive.

Patterns either consist of series of characters  bounded by
square brackets
(such patterns follow a syntax similar to that employed
by regular expression character classes)
or of Perl-like Unicode property set specifiers.

\code{[]} denotes an empty set, \code{[a]} --
a set consisting of character ``a'',
\code{[\\u0105]} -- a set with character U+0105,
and \code{[abc]} -- a set with ``a'', ``b'', and ``c''.

\code{[a-z]} denotes a set consisting of characters
``a'' through ``z'' inclusively, in Unicode code point order.

Some set-theoretic operations are available.
\code{^} denotes the complement, e.g., \code{[^a-z]} contains
all characters but ``a'' through ``z''.
Moreover, \code{[[pat1][pat2]]},
\code{[[pat1]\&[pat2]]}, and \code{[[pat1]-[pat2]]}
denote union, intersection, and asymmetric difference of sets
specified by \code{pat1} and \code{pat2}, respectively.

Note that all white-spaces are ignored unless they are quoted or back-slashed
(white spaces can be freely used for clarity, as \code{[a c d-f m]}
means the same as \code{[acd-fm]}).
\pkg{stringi} does not allow including multi-character strings
(see \code{UnicodeSet} API documentation).
Also, empty string patterns are disallowed.

Any character may be preceded by
a backslash in order to remove its special meaning.

A malformed pattern always results in an error.

Set expressions at a glance
(according to \url{https://unicode-org.github.io/icu/userguide/strings/regexp.html}):


Some examples:

\describe{
\item{\code{[abc]}}{Match any of the characters a, b or c.}
\item{\code{[^abc]}}{Negation -- match any character except a, b or c.}
\item{\code{[A-M]}}{Range -- match any character from A to M. The characters
   to include are determined by Unicode code point ordering.}
\item{\code{[\\u0000-\\U0010ffff]}}{Range -- match all characters.}
\item{\code{[\\p{Letter}]} or \code{[\\p{General_Category=Letter}]} or \code{[\\p{L}]}}{
   Characters with Unicode Category = Letter. All forms shown are equivalent.}
\item{\code{[\\P{Letter}]}}{Negated property
   (Note the upper case \code{\\P}) -- match everything except Letters.}
\item{\code{[\\p{numeric_value=9}]}}{Match all numbers with a numeric value of 9.
   Any Unicode Property may be used in set expressions.}
\item{\code{[\\p{Letter}&\\p{script=cyrillic}]}}{Set
    intersection -- match the set of all Cyrillic letters.}
\item{\code{[\\p{Letter}-\\p{script=latin}]}}{Set difference --
   match all non-Latin letters.}
\item{\code{[[a-z][A-Z][0-9]]} or \code{[a-zA-Z0-9]}}{Implicit union of
   sets -- match ASCII letters and digits (the two forms are equivalent).}
\item{\code{[:script=Greek:]}}{Alternative POSIX-like syntax for properties --
   equivalent to \code{\\p{script=Greek}}.}
}
}

\section{Unicode properties}{


Unicode property sets are specified with a POSIX-like syntax,
e.g., \code{[:Letter:]},
or with a (extended) Perl-style syntax, e.g., \code{\\p{L}}.
The complements of the above sets are
\code{[:^Letter:]} and \code{\\P{L}}, respectively.

The names are normalized before matching
(for example, the match is case-insensitive).
Moreover, many names have short aliases.

Among predefined Unicode properties we find, e.g.:
\itemize{
\item Unicode General Categories, e.g., \code{Lu} for uppercase letters,
\item Unicode Binary Properties, e.g., \code{WHITE_SPACE},
}
and many more (including Unicode scripts).

Each property provides access to the large and comprehensive
Unicode Character Database.
Generally, the list of properties available in \pkg{ICU}
is not well-documented. Please refer to the References section
for some links.

Please note that some classes might overlap.
However, e.g., General Category \code{Z} (some space) and Binary Property
\code{WHITE_SPACE} matches different character sets.
}

\section{Unicode General Categories}{


The Unicode General Category property of a code point provides the most
general classification of that code point.
Each code point falls into one and only one Category.

\describe{
 \item{\code{Cc}}{a C0 or C1 control code.}
 \item{\code{Cf}}{a format control character.}
 \item{\code{Cn}}{a reserved unassigned code point or a non-character.}
 \item{\code{Co}}{a private-use character.}
 \item{\code{Cs}}{a surrogate code point.}
 \item{\code{Lc}}{the union of Lu, Ll, Lt.}
 \item{\code{Ll}}{a lowercase letter.}
 \item{\code{Lm}}{a modifier letter.}
 \item{\code{Lo}}{other letters, including syllables and ideographs.}
 \item{\code{Lt}}{a digraphic character, with the first part uppercase.}
 \item{\code{Lu}}{an uppercase letter.}
 \item{\code{Mc}}{a spacing combining mark (positive advance width).}
 \item{\code{Me}}{an enclosing combining mark.}
 \item{\code{Mn}}{a non-spacing combining mark (zero advance width).}
 \item{\code{Nd}}{a decimal digit.}
 \item{\code{Nl}}{a letter-like numeric character.}
 \item{\code{No}}{a numeric character of other type.}
 \item{\code{Pd}}{a dash or hyphen punctuation mark.}
 \item{\code{Ps}}{an opening punctuation mark (of a pair).}
 \item{\code{Pe}}{a closing punctuation mark (of a pair).}
 \item{\code{Pc}}{a connecting punctuation mark, like a tie.}
 \item{\code{Po}}{a punctuation mark of other type.}
 \item{\code{Pi}}{an initial quotation mark.}
 \item{\code{Pf}}{a final quotation mark.}
 \item{\code{Sm}}{a symbol of mathematical use.}
 \item{\code{Sc}}{a currency sign.}
 \item{\code{Sk}}{a non-letter-like modifier symbol.}
 \item{\code{So}}{a symbol of other type.}
 \item{\code{Zs}}{a space character (of non-zero width).}
 \item{\code{Zl}}{U+2028 LINE SEPARATOR only.}
 \item{\code{Zp}}{U+2029 PARAGRAPH SEPARATOR only.}
 \item{\code{C} }{the union of Cc, Cf, Cs, Co, Cn.}
 \item{\code{L} }{the union of Lu, Ll, Lt, Lm, Lo.}
 \item{\code{M} }{the union of Mn, Mc, Me.}
 \item{\code{N} }{the union of Nd, Nl, No.}
 \item{\code{P} }{the union of Pc, Pd, Ps, Pe, Pi, Pf, Po.}
 \item{\code{S} }{the union of Sm, Sc, Sk, So.}
 \item{\code{Z} }{the union of Zs, Zl, Zp }
}
}

\section{Unicode Binary Properties}{


Each character may follow many Binary Properties at a time.

Here is a comprehensive list of supported Binary Properties:

\describe{
  \item{\code{ALPHABETIC}     }{alphabetic character.}
  \item{\code{ASCII_HEX_DIGIT}}{a character matching the \code{[0-9A-Fa-f]} charclass.}
  \item{\code{BIDI_CONTROL}   }{a format control which have specific functions
                             in the Bidi (bidirectional text) Algorithm.}
  \item{\code{BIDI_MIRRORED}  }{a character that may change display in right-to-left text.}
  \item{\code{DASH}           }{a kind of a dash character.}
  \item{\code{DEFAULT_IGNORABLE_CODE_POINT}}{characters that are ignorable in most
                               text processing activities,
                               e.g., <2060..206F, FFF0..FFFB, E0000..E0FFF>.}
  \item{\code{DEPRECATED}     }{a deprecated character according
          to the current Unicode standard (the usage of deprecated characters
          is strongly discouraged).}
  \item{\code{DIACRITIC}      }{a character that linguistically modifies
             the meaning of another character to which it applies.}
  \item{\code{EXTENDER}       }{a character that extends the value
                             or shape of a preceding alphabetic character,
                             e.g., a length and iteration mark.}
  \item{\code{HEX_DIGIT}      }{a character commonly
                            used for hexadecimal numbers,
                            see also \code{ASCII_HEX_DIGIT}.}
  \item{\code{HYPHEN}}{a dash used to mark connections between
              pieces of words, plus the Katakana middle dot.}
  \item{\code{ID_CONTINUE}}{a character that can continue an identifier,
                     \code{ID_START}+\code{Mn}+\code{Mc}+\code{Nd}+\code{Pc}.}
  \item{\code{ID_START}}{a character that can start an identifier,
                 \code{Lu}+\code{Ll}+\code{Lt}+\code{Lm}+\code{Lo}+\code{Nl}.}
  \item{\code{IDEOGRAPHIC}}{a CJKV (Chinese-Japanese-Korean-Vietnamese)
               ideograph.}
  \item{\code{LOWERCASE}}{...}
  \item{\code{MATH}}{...}
  \item{\code{NONCHARACTER_CODE_POINT}}{...}
  \item{\code{QUOTATION_MARK}}{...}
  \item{\code{SOFT_DOTTED}}{a character with a ``soft dot'', like i or j,
such that an accent placed on this character causes the dot to disappear.}
  \item{\code{TERMINAL_PUNCTUATION}}{a punctuation character that generally
marks the end of textual units.}
  \item{\code{UPPERCASE}}{...}
  \item{\code{WHITE_SPACE}}{a space character or TAB or CR or LF or ZWSP or ZWNBSP.}
  \item{\code{CASE_SENSITIVE}}{...}
  \item{\code{POSIX_ALNUM}}{...}
  \item{\code{POSIX_BLANK}}{...}
  \item{\code{POSIX_GRAPH}}{...}
  \item{\code{POSIX_PRINT}}{...}
  \item{\code{POSIX_XDIGIT}}{...}
  \item{\code{CASED}}{...}
  \item{\code{CASE_IGNORABLE}}{...}
  \item{\code{CHANGES_WHEN_LOWERCASED}}{...}
  \item{\code{CHANGES_WHEN_UPPERCASED}}{...}
  \item{\code{CHANGES_WHEN_TITLECASED}}{...}
  \item{\code{CHANGES_WHEN_CASEFOLDED}}{...}
  \item{\code{CHANGES_WHEN_CASEMAPPED}}{...}
  \item{\code{CHANGES_WHEN_NFKC_CASEFOLDED}}{...}
  \item{\code{EMOJI}}{Since ICU 57}
  \item{\code{EMOJI_PRESENTATION}}{Since ICU 57}
  \item{\code{EMOJI_MODIFIER}}{Since ICU 57}
  \item{\code{EMOJI_MODIFIER_BASE}}{Since ICU 57}
}
}

\section{POSIX Character Classes}{


Avoid using POSIX character classes,
e.g., \code{[:punct:]}. The ICU User Guide (see below)
states that in general they are not well-defined, so you may end up
with something different than you expect.

In particular, in POSIX-like regex engines, \code{[:punct:]} stands for
the character class corresponding to the \code{ispunct()} classification
function (check out \code{man 3 ispunct} on UNIX-like systems).
According to ISO/IEC 9899:1990 (ISO C90), the \code{ispunct()} function
tests for any printing character except for space or a character
for which \code{isalnum()} is true. However, in a POSIX setting,
the details of what characters belong into which class depend
on the current locale. So the \code{[:punct:]} class does not lead
to a portable code (again, in POSIX-like regex engines).

 Therefore, a POSIX flavor of \code{[:punct:]} is more like
\code{[\\p{P}\\p{S}]} in \pkg{ICU}. You have been warned.
}

\references{
\emph{The Unicode Character Database} -- Unicode Standard Annex #44,
\url{https://www.unicode.org/reports/tr44/}

\emph{UnicodeSet} -- ICU User Guide,
\url{https://unicode-org.github.io/icu/userguide/strings/unicodeset.html}

\emph{Properties} -- ICU User Guide,
\url{https://unicode-org.github.io/icu/userguide/strings/properties.html}

\emph{C/POSIX Migration} -- ICU User Guide,
\url{https://unicode-org.github.io/icu/userguide/icu/posix.html}

\emph{Unicode Script Data}, \url{https://www.unicode.org/Public/UNIDATA/Scripts.txt}

\emph{icu::Unicodeset Class Reference} -- ICU4C API Documentation,
\url{https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1UnicodeSet.html}
}
\seealso{
The official online manual of \pkg{stringi} at \url{https://stringi.gagolewski.com/}

Gagolewski M., \pkg{stringi}: Fast and portable character string processing in R, \emph{Journal of Statistical Software} 103(2), 2022, 1-59, \doi{10.18637/jss.v103.i02}

Other search_charclass: 
\code{\link{about_search}},
\code{\link{stri_trim_both}()}

Other stringi_general_topics: 
\code{\link{about_arguments}},
\code{\link{about_encoding}},
\code{\link{about_locale}},
\code{\link{about_search_boundaries}},
\code{\link{about_search_coll}},
\code{\link{about_search_fixed}},
\code{\link{about_search_regex}},
\code{\link{about_search}},
\code{\link{about_stringi}}
}
\concept{search_charclass}
\concept{stringi_general_topics}
\author{
\href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors
}