1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/search.R
\name{about_search_charclass}
\alias{about_search_charclass}
\alias{search_charclass}
\alias{stringi-search-charclass}
\title{Character Classes in \pkg{stringi}}
\description{
Here we describe how character classes (sets) can be specified
in the \pkg{stringi} package. These are useful for defining
search patterns (note that the \pkg{ICU} regex engine uses the same
scheme for denoting character classes) or, e.g.,
generating random code points with \code{\link{stri_rand_strings}}.
}
\details{
All \code{stri_*_charclass} functions in \pkg{stringi} perform
a single character (i.e., Unicode code point) search-based operations.
You may obtain the same results using \link{about_search_regex}.
However, these very functions aim to be faster.
Character classes are defined using \pkg{ICU}'s \code{UnicodeSet}
patterns. Below we briefly summarize their syntax.
For more details refer to the bibliographic References below.
}
\section{\code{UnicodeSet} patterns}{
A \code{UnicodeSet} represents a subset of Unicode code points
(recall that \pkg{stringi} converts strings in your native encoding
to Unicode automatically). Legal code points are U+0000 to U+10FFFF,
inclusive.
Patterns either consist of series of characters bounded by
square brackets
(such patterns follow a syntax similar to that employed
by regular expression character classes)
or of Perl-like Unicode property set specifiers.
\code{[]} denotes an empty set, \code{[a]} --
a set consisting of character ``a'',
\code{[\\u0105]} -- a set with character U+0105,
and \code{[abc]} -- a set with ``a'', ``b'', and ``c''.
\code{[a-z]} denotes a set consisting of characters
``a'' through ``z'' inclusively, in Unicode code point order.
Some set-theoretic operations are available.
\code{^} denotes the complement, e.g., \code{[^a-z]} contains
all characters but ``a'' through ``z''.
Moreover, \code{[[pat1][pat2]]},
\code{[[pat1]\&[pat2]]}, and \code{[[pat1]-[pat2]]}
denote union, intersection, and asymmetric difference of sets
specified by \code{pat1} and \code{pat2}, respectively.
Note that all white-spaces are ignored unless they are quoted or back-slashed
(white spaces can be freely used for clarity, as \code{[a c d-f m]}
means the same as \code{[acd-fm]}).
\pkg{stringi} does not allow including multi-character strings
(see \code{UnicodeSet} API documentation).
Also, empty string patterns are disallowed.
Any character may be preceded by
a backslash in order to remove its special meaning.
A malformed pattern always results in an error.
Set expressions at a glance
(according to \url{https://unicode-org.github.io/icu/userguide/strings/regexp.html}):
Some examples:
\describe{
\item{\code{[abc]}}{Match any of the characters a, b or c.}
\item{\code{[^abc]}}{Negation -- match any character except a, b or c.}
\item{\code{[A-M]}}{Range -- match any character from A to M. The characters
to include are determined by Unicode code point ordering.}
\item{\code{[\\u0000-\\U0010ffff]}}{Range -- match all characters.}
\item{\code{[\\p{Letter}]} or \code{[\\p{General_Category=Letter}]} or \code{[\\p{L}]}}{
Characters with Unicode Category = Letter. All forms shown are equivalent.}
\item{\code{[\\P{Letter}]}}{Negated property
(Note the upper case \code{\\P}) -- match everything except Letters.}
\item{\code{[\\p{numeric_value=9}]}}{Match all numbers with a numeric value of 9.
Any Unicode Property may be used in set expressions.}
\item{\code{[\\p{Letter}&\\p{script=cyrillic}]}}{Set
intersection -- match the set of all Cyrillic letters.}
\item{\code{[\\p{Letter}-\\p{script=latin}]}}{Set difference --
match all non-Latin letters.}
\item{\code{[[a-z][A-Z][0-9]]} or \code{[a-zA-Z0-9]}}{Implicit union of
sets -- match ASCII letters and digits (the two forms are equivalent).}
\item{\code{[:script=Greek:]}}{Alternative POSIX-like syntax for properties --
equivalent to \code{\\p{script=Greek}}.}
}
}
\section{Unicode properties}{
Unicode property sets are specified with a POSIX-like syntax,
e.g., \code{[:Letter:]},
or with a (extended) Perl-style syntax, e.g., \code{\\p{L}}.
The complements of the above sets are
\code{[:^Letter:]} and \code{\\P{L}}, respectively.
The names are normalized before matching
(for example, the match is case-insensitive).
Moreover, many names have short aliases.
Among predefined Unicode properties we find, e.g.:
\itemize{
\item Unicode General Categories, e.g., \code{Lu} for uppercase letters,
\item Unicode Binary Properties, e.g., \code{WHITE_SPACE},
}
and many more (including Unicode scripts).
Each property provides access to the large and comprehensive
Unicode Character Database.
Generally, the list of properties available in \pkg{ICU}
is not well-documented. Please refer to the References section
for some links.
Please note that some classes might overlap.
However, e.g., General Category \code{Z} (some space) and Binary Property
\code{WHITE_SPACE} matches different character sets.
}
\section{Unicode General Categories}{
The Unicode General Category property of a code point provides the most
general classification of that code point.
Each code point falls into one and only one Category.
\describe{
\item{\code{Cc}}{a C0 or C1 control code.}
\item{\code{Cf}}{a format control character.}
\item{\code{Cn}}{a reserved unassigned code point or a non-character.}
\item{\code{Co}}{a private-use character.}
\item{\code{Cs}}{a surrogate code point.}
\item{\code{Lc}}{the union of Lu, Ll, Lt.}
\item{\code{Ll}}{a lowercase letter.}
\item{\code{Lm}}{a modifier letter.}
\item{\code{Lo}}{other letters, including syllables and ideographs.}
\item{\code{Lt}}{a digraphic character, with the first part uppercase.}
\item{\code{Lu}}{an uppercase letter.}
\item{\code{Mc}}{a spacing combining mark (positive advance width).}
\item{\code{Me}}{an enclosing combining mark.}
\item{\code{Mn}}{a non-spacing combining mark (zero advance width).}
\item{\code{Nd}}{a decimal digit.}
\item{\code{Nl}}{a letter-like numeric character.}
\item{\code{No}}{a numeric character of other type.}
\item{\code{Pd}}{a dash or hyphen punctuation mark.}
\item{\code{Ps}}{an opening punctuation mark (of a pair).}
\item{\code{Pe}}{a closing punctuation mark (of a pair).}
\item{\code{Pc}}{a connecting punctuation mark, like a tie.}
\item{\code{Po}}{a punctuation mark of other type.}
\item{\code{Pi}}{an initial quotation mark.}
\item{\code{Pf}}{a final quotation mark.}
\item{\code{Sm}}{a symbol of mathematical use.}
\item{\code{Sc}}{a currency sign.}
\item{\code{Sk}}{a non-letter-like modifier symbol.}
\item{\code{So}}{a symbol of other type.}
\item{\code{Zs}}{a space character (of non-zero width).}
\item{\code{Zl}}{U+2028 LINE SEPARATOR only.}
\item{\code{Zp}}{U+2029 PARAGRAPH SEPARATOR only.}
\item{\code{C} }{the union of Cc, Cf, Cs, Co, Cn.}
\item{\code{L} }{the union of Lu, Ll, Lt, Lm, Lo.}
\item{\code{M} }{the union of Mn, Mc, Me.}
\item{\code{N} }{the union of Nd, Nl, No.}
\item{\code{P} }{the union of Pc, Pd, Ps, Pe, Pi, Pf, Po.}
\item{\code{S} }{the union of Sm, Sc, Sk, So.}
\item{\code{Z} }{the union of Zs, Zl, Zp }
}
}
\section{Unicode Binary Properties}{
Each character may follow many Binary Properties at a time.
Here is a comprehensive list of supported Binary Properties:
\describe{
\item{\code{ALPHABETIC} }{alphabetic character.}
\item{\code{ASCII_HEX_DIGIT}}{a character matching the \code{[0-9A-Fa-f]} charclass.}
\item{\code{BIDI_CONTROL} }{a format control which have specific functions
in the Bidi (bidirectional text) Algorithm.}
\item{\code{BIDI_MIRRORED} }{a character that may change display in right-to-left text.}
\item{\code{DASH} }{a kind of a dash character.}
\item{\code{DEFAULT_IGNORABLE_CODE_POINT}}{characters that are ignorable in most
text processing activities,
e.g., <2060..206F, FFF0..FFFB, E0000..E0FFF>.}
\item{\code{DEPRECATED} }{a deprecated character according
to the current Unicode standard (the usage of deprecated characters
is strongly discouraged).}
\item{\code{DIACRITIC} }{a character that linguistically modifies
the meaning of another character to which it applies.}
\item{\code{EXTENDER} }{a character that extends the value
or shape of a preceding alphabetic character,
e.g., a length and iteration mark.}
\item{\code{HEX_DIGIT} }{a character commonly
used for hexadecimal numbers,
see also \code{ASCII_HEX_DIGIT}.}
\item{\code{HYPHEN}}{a dash used to mark connections between
pieces of words, plus the Katakana middle dot.}
\item{\code{ID_CONTINUE}}{a character that can continue an identifier,
\code{ID_START}+\code{Mn}+\code{Mc}+\code{Nd}+\code{Pc}.}
\item{\code{ID_START}}{a character that can start an identifier,
\code{Lu}+\code{Ll}+\code{Lt}+\code{Lm}+\code{Lo}+\code{Nl}.}
\item{\code{IDEOGRAPHIC}}{a CJKV (Chinese-Japanese-Korean-Vietnamese)
ideograph.}
\item{\code{LOWERCASE}}{...}
\item{\code{MATH}}{...}
\item{\code{NONCHARACTER_CODE_POINT}}{...}
\item{\code{QUOTATION_MARK}}{...}
\item{\code{SOFT_DOTTED}}{a character with a ``soft dot'', like i or j,
such that an accent placed on this character causes the dot to disappear.}
\item{\code{TERMINAL_PUNCTUATION}}{a punctuation character that generally
marks the end of textual units.}
\item{\code{UPPERCASE}}{...}
\item{\code{WHITE_SPACE}}{a space character or TAB or CR or LF or ZWSP or ZWNBSP.}
\item{\code{CASE_SENSITIVE}}{...}
\item{\code{POSIX_ALNUM}}{...}
\item{\code{POSIX_BLANK}}{...}
\item{\code{POSIX_GRAPH}}{...}
\item{\code{POSIX_PRINT}}{...}
\item{\code{POSIX_XDIGIT}}{...}
\item{\code{CASED}}{...}
\item{\code{CASE_IGNORABLE}}{...}
\item{\code{CHANGES_WHEN_LOWERCASED}}{...}
\item{\code{CHANGES_WHEN_UPPERCASED}}{...}
\item{\code{CHANGES_WHEN_TITLECASED}}{...}
\item{\code{CHANGES_WHEN_CASEFOLDED}}{...}
\item{\code{CHANGES_WHEN_CASEMAPPED}}{...}
\item{\code{CHANGES_WHEN_NFKC_CASEFOLDED}}{...}
\item{\code{EMOJI}}{Since ICU 57}
\item{\code{EMOJI_PRESENTATION}}{Since ICU 57}
\item{\code{EMOJI_MODIFIER}}{Since ICU 57}
\item{\code{EMOJI_MODIFIER_BASE}}{Since ICU 57}
}
}
\section{POSIX Character Classes}{
Avoid using POSIX character classes,
e.g., \code{[:punct:]}. The ICU User Guide (see below)
states that in general they are not well-defined, so you may end up
with something different than you expect.
In particular, in POSIX-like regex engines, \code{[:punct:]} stands for
the character class corresponding to the \code{ispunct()} classification
function (check out \code{man 3 ispunct} on UNIX-like systems).
According to ISO/IEC 9899:1990 (ISO C90), the \code{ispunct()} function
tests for any printing character except for space or a character
for which \code{isalnum()} is true. However, in a POSIX setting,
the details of what characters belong into which class depend
on the current locale. So the \code{[:punct:]} class does not lead
to a portable code (again, in POSIX-like regex engines).
Therefore, a POSIX flavor of \code{[:punct:]} is more like
\code{[\\p{P}\\p{S}]} in \pkg{ICU}. You have been warned.
}
\references{
\emph{The Unicode Character Database} -- Unicode Standard Annex #44,
\url{https://www.unicode.org/reports/tr44/}
\emph{UnicodeSet} -- ICU User Guide,
\url{https://unicode-org.github.io/icu/userguide/strings/unicodeset.html}
\emph{Properties} -- ICU User Guide,
\url{https://unicode-org.github.io/icu/userguide/strings/properties.html}
\emph{C/POSIX Migration} -- ICU User Guide,
\url{https://unicode-org.github.io/icu/userguide/icu/posix.html}
\emph{Unicode Script Data}, \url{https://www.unicode.org/Public/UNIDATA/Scripts.txt}
\emph{icu::Unicodeset Class Reference} -- ICU4C API Documentation,
\url{https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1UnicodeSet.html}
}
\seealso{
The official online manual of \pkg{stringi} at \url{https://stringi.gagolewski.com/}
Gagolewski M., \pkg{stringi}: Fast and portable character string processing in R, \emph{Journal of Statistical Software} 103(2), 2022, 1-59, \doi{10.18637/jss.v103.i02}
Other search_charclass:
\code{\link{about_search}},
\code{\link{stri_trim_both}()}
Other stringi_general_topics:
\code{\link{about_arguments}},
\code{\link{about_encoding}},
\code{\link{about_locale}},
\code{\link{about_search_boundaries}},
\code{\link{about_search_coll}},
\code{\link{about_search_fixed}},
\code{\link{about_search_regex}},
\code{\link{about_search}},
\code{\link{about_stringi}}
}
\concept{search_charclass}
\concept{stringi_general_topics}
\author{
\href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors
}
|