File: stri_enc_detect2.Rd

package info (click to toggle)
r-cran-stringi 1.8.4-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 30,632 kB
sloc: cpp: 301,844; perl: 471; makefile: 9; sh: 1
file content (95 lines) | stat: -rw-r--r-- 3,285 bytes
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/encoding_detection.R
\name{stri_enc_detect2}
\alias{stri_enc_detect2}
\title{[DEPRECATED] Detect Locale-Sensitive Character Encoding}
\usage{
stri_enc_detect2(str, locale = NULL)
}
\arguments{
\item{str}{character vector, a raw vector, or
a list of \code{raw} vectors}

\item{locale}{\code{NULL} or \code{''} for the default locale,
or a single string with locale identifier.}
}
\value{
Just like \code{\link{stri_enc_detect}},
this function returns a list of length equal to the length of \code{str}.
Each list element is a data frame with the following three named components:
\itemize{
   \item \code{Encoding} -- string; guessed encodings; \code{NA} on failure
   (if and only if \code{encodings} is empty),
   \item \code{Language} -- always \code{NA},
   \item \code{Confidence} -- numeric in [0,1]; the higher the value,
   the more confidence there is in the match; \code{NA} on failure.
}
The guesses are ordered by decreasing confidence.
}
\description{
This function tries to detect character encoding
in case the language of text is known.
}
\details{
Vectorized over \code{str}.

First, the text is checked whether it is valid
UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE, UTF-8
(as in \code{\link{stri_enc_detect}},
this is roughly inspired by \pkg{ICU}'s \code{i18n/csrucode.cpp}) or ASCII.


If \code{locale} is not \code{NA} and the above fails,
the text is checked for the number of occurrences
of language-specific code points (data provided by the \pkg{ICU} library)
converted to all possible 8-bit encodings
that fully cover the indicated language.
The encoding is selected based on the greatest number of total
byte hits.

The guess is of course imprecise,
as it is obtained using statistics and heuristics.
Because of this, detection works best if you supply at least a few hundred
bytes of character data that is in a single language.


If you have no initial guess on the language and encoding, try with
\code{\link{stri_enc_detect}} (uses \pkg{ICU} facilities).
}
\seealso{
The official online manual of \pkg{stringi} at \url{https://stringi.gagolewski.com/}

Gagolewski M., \pkg{stringi}: Fast and portable character string processing in R, \emph{Journal of Statistical Software} 103(2), 2022, 1-59, \doi{10.18637/jss.v103.i02}

Other locale_sensitive: 
\code{\link{\%s<\%}()},
\code{\link{about_locale}},
\code{\link{about_search_boundaries}},
\code{\link{about_search_coll}},
\code{\link{stri_compare}()},
\code{\link{stri_count_boundaries}()},
\code{\link{stri_duplicated}()},
\code{\link{stri_extract_all_boundaries}()},
\code{\link{stri_locate_all_boundaries}()},
\code{\link{stri_opts_collator}()},
\code{\link{stri_order}()},
\code{\link{stri_rank}()},
\code{\link{stri_sort_key}()},
\code{\link{stri_sort}()},
\code{\link{stri_split_boundaries}()},
\code{\link{stri_trans_tolower}()},
\code{\link{stri_unique}()},
\code{\link{stri_wrap}()}

Other encoding_detection: 
\code{\link{about_encoding}},
\code{\link{stri_enc_detect}()},
\code{\link{stri_enc_isascii}()},
\code{\link{stri_enc_isutf16be}()},
\code{\link{stri_enc_isutf8}()}
}
\concept{encoding_detection}
\concept{locale_sensitive}
\author{
\href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors
}