File: stri_enc_detect.Rd

package info (click to toggle)
r-cran-stringi 1.8.4-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 30,632 kB
  • sloc: cpp: 301,844; perl: 471; makefile: 9; sh: 1
file content (129 lines) | stat: -rw-r--r-- 4,852 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/encoding_detection.R
\name{stri_enc_detect}
\alias{stri_enc_detect}
\title{Detect Character Set and Language}
\usage{
stri_enc_detect(str, filter_angle_brackets = FALSE)
}
\arguments{
\item{str}{character vector, a raw vector, or
a list of \code{raw} vectors}

\item{filter_angle_brackets}{logical; If filtering is enabled,
text within angle brackets ('<' and '>') will be removed before detection,
which will remove most HTML or XML markup.}
}
\value{
Returns a list of length equal to the length of \code{str}.
Each list element is a data frame with the following three named vectors
representing all the guesses:
\itemize{
   \item \code{Encoding} -- string; guessed encodings; \code{NA} on failure,
   \item \code{Language} -- string; guessed languages; \code{NA} if the language could
   not be determined (e.g., in case of UTF-8),
   \item \code{Confidence} -- numeric in [0,1]; the higher the value,
   the more confidence there is in the match; \code{NA} on failure.
}
The guesses are ordered by decreasing confidence.
}
\description{
This function uses the \pkg{ICU} engine to determine the character set,
or encoding, of character data in an unknown format.
}
\details{
Vectorized over \code{str} and \code{filter_angle_brackets}.

For a character vector input, merging all text lines
via \code{\link{stri_flatten}(str, collapse='\n')}
might be needed if \code{str} has been obtained via a call to
\code{readLines} and in fact represents an image of a single text file.

This is, at best, an imprecise operation using statistics and heuristics.
Because of this, detection works best if you supply at least a few hundred
bytes of character data that is mostly in a single language.
However, because the detection only looks at a limited amount of the input
data, some of the returned character sets may fail to handle all of the
input data. Note that in some cases,
the language can be determined along with the encoding.

Several different techniques are used for character set detection.
For multi-byte encodings, the sequence of bytes is checked for legible
patterns. The detected characters are also checked against a list of
frequently used characters in that encoding. For single byte encodings,
the data is checked against a list of the most commonly occurring three
letter groups for each language that can be written using that encoding.

The detection process can be configured to optionally ignore
HTML or XML style markup (using \pkg{ICU}'s internal facilities),
which can interfere with the detection
process by changing the statistics.

This function should most often be used for byte-marked input strings,
especially after loading them from text files and before the main
conversion with \code{\link{stri_encode}}.
The input encoding is of course not taken into account here, even
if marked.

The following table shows all the encodings that can be detected:

\tabular{ll}{
\strong{Character_Set} \tab \strong{Languages}\cr
UTF-8 \tab -- \cr
UTF-16BE \tab -- \cr
UTF-16LE \tab -- \cr
UTF-32BE \tab -- \cr
UTF-32LE \tab -- \cr
Shift_JIS \tab Japanese \cr
ISO-2022-JP \tab Japanese \cr
ISO-2022-CN \tab Simplified Chinese \cr
ISO-2022-KR \tab Korean \cr
GB18030 \tab Chinese \cr
Big5 \tab Traditional Chinese \cr
EUC-JP \tab Japanese \cr
EUC-KR \tab Korean \cr
ISO-8859-1 \tab Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish \cr
ISO-8859-2 \tab Czech, Hungarian, Polish, Romanian \cr
ISO-8859-5 \tab Russian \cr
ISO-8859-6 \tab Arabic \cr
ISO-8859-7 \tab Greek \cr
ISO-8859-8 \tab Hebrew \cr
ISO-8859-9 \tab Turkish \cr
windows-1250 \tab Czech, Hungarian, Polish, Romanian \cr
windows-1251 \tab Russian \cr
windows-1252 \tab Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish \cr
windows-1253 \tab Greek \cr
windows-1254 \tab Turkish \cr
windows-1255 \tab Hebrew \cr
windows-1256 \tab Arabic \cr
KOI8-R \tab Russian \cr
IBM420 \tab Arabic \cr
IBM424 \tab Hebrew \cr
}
}
\examples{
## Not run:
## f <- rawToChar(readBin('test.txt', 'raw', 100000))
## stri_enc_detect(f)

}
\references{
\emph{Character Set Detection} -- ICU User Guide,
\url{https://unicode-org.github.io/icu/userguide/conversion/detection.html}
}
\seealso{
The official online manual of \pkg{stringi} at \url{https://stringi.gagolewski.com/}

Gagolewski M., \pkg{stringi}: Fast and portable character string processing in R, \emph{Journal of Statistical Software} 103(2), 2022, 1-59, \doi{10.18637/jss.v103.i02}

Other encoding_detection: 
\code{\link{about_encoding}},
\code{\link{stri_enc_detect2}()},
\code{\link{stri_enc_isascii}()},
\code{\link{stri_enc_isutf16be}()},
\code{\link{stri_enc_isutf8}()}
}
\concept{encoding_detection}
\author{
\href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors
}