File: stri_trans_nf.Rd

package info (click to toggle)
r-cran-stringi 1.7.12-1
links: PTS, VCS
area: main
in suites: bookworm
size: 39,772 kB
sloc: cpp: 482,349; ansic: 51,900; perl: 471; makefile: 9; sh: 1
file content (118 lines) | stat: -rw-r--r-- 3,765 bytes
parent folder | download | duplicates (2)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/trans_normalization.R
\name{stri_trans_nfc}
\alias{stri_trans_nfc}
\alias{stri_trans_nfd}
\alias{stri_trans_nfkd}
\alias{stri_trans_nfkc}
\alias{stri_trans_nfkc_casefold}
\alias{stri_trans_isnfc}
\alias{stri_trans_isnfd}
\alias{stri_trans_isnfkd}
\alias{stri_trans_isnfkc}
\alias{stri_trans_isnfkc_casefold}
\title{Perform or Check For Unicode Normalization}
\usage{
stri_trans_nfc(str)

stri_trans_nfd(str)

stri_trans_nfkd(str)

stri_trans_nfkc(str)

stri_trans_nfkc_casefold(str)

stri_trans_isnfc(str)

stri_trans_isnfd(str)

stri_trans_isnfkd(str)

stri_trans_isnfkc(str)

stri_trans_isnfkc_casefold(str)
}
\arguments{
\item{str}{character vector to be encoded}
}
\value{
The \code{stri_trans_nf*} functions return a character vector
of the same length as input (the output is always in UTF-8).

\code{stri_trans_isnf*} return a logical vector.
}
\description{
These functions convert strings to NFC, NFKC, NFD, NFKD, or NFKC_Casefold
Unicode Normalization Form or check whether strings are normalized.
}
\details{
Unicode Normalization Forms are formally defined normalizations of Unicode
strings which, e.g., make possible to determine whether any two
strings are equivalent.
Essentially, the Unicode Normalization Algorithm puts all combining
marks in a specified order, and uses rules for decomposition
and composition to transform each string into one of the
Unicode Normalization Forms.

The following Normalization Forms (NFs) are supported:
\itemize{
\item NFC (Canonical Decomposition, followed by Canonical Composition),
\item NFD (Canonical Decomposition),
\item NFKC (Compatibility Decomposition, followed by Canonical Composition),
\item NFKD (Compatibility Decomposition),
\item NFKC_Casefold (combination of NFKC, case folding, and removing ignorable
 characters which was introduced with Unicode 5.2).
}

Note that many W3C Specifications recommend using NFC for all content,
because this form avoids potential interoperability problems arising
from the use of canonically equivalent, yet different,
character sequences in document formats on the Web.
Thus, you will rather not use these functions in typical
string processing activities. Most often you may assume
that a string is in NFC, see RFC5198.

As usual in \pkg{stringi},
if the input character vector is in the native encoding,
it will be automatically converted to UTF-8.

For more general text transforms refer to \code{\link{stri_trans_general}}.
}
\examples{
stri_trans_nfd('\u0105') # a with ogonek -> a, ogonek
stri_trans_nfkc('\ufdfa') # 1 codepoint -> 18 codepoints

}
\references{
\emph{Unicode Normalization Forms} -- Unicode Standard Annex #15,
   \url{https://unicode.org/reports/tr15/}

\emph{Unicode Format for Network Interchange}
-- RFC5198, \url{https://www.rfc-editor.org/rfc/rfc5198}

\emph{Character Model for the World Wide Web 1.0: Normalization}
-- W3C Working Draft, \url{https://www.w3.org/TR/charmod-norm/}

\emph{Normalization} -- ICU User Guide,
   \url{https://unicode-org.github.io/icu/userguide/transforms/normalization/}
   (technical details)

\emph{Unicode Equivalence} -- Wikipedia,
\url{https://en.wikipedia.org/wiki/Unicode_equivalence}
}
\seealso{
The official online manual of \pkg{stringi} at \url{https://stringi.gagolewski.com/}

Gagolewski M., \pkg{stringi}: Fast and portable character string processing in R, \emph{Journal of Statistical Software} 103(2), 2022, 1-59, \doi{10.18637/jss.v103.i02}

Other transform: 
\code{\link{stri_trans_char}()},
\code{\link{stri_trans_general}()},
\code{\link{stri_trans_list}()},
\code{\link{stri_trans_tolower}()}
}
\concept{transform}
\author{
\href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors
}