1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/encoding_conversion.R
\name{stri_encode}
\alias{stri_encode}
\alias{stri_conv}
\title{Convert Strings Between Given Encodings}
\usage{
stri_encode(str, from = NULL, to = NULL, to_raw = FALSE)
stri_conv(str, from = NULL, to = NULL, to_raw = FALSE)
}
\arguments{
\item{str}{a character vector, a raw vector, or
a list of \code{raw} vectors to be converted}
\item{from}{input encoding:
\code{NULL} or \code{''} for the default encoding
or internal encoding marks' usage (see Details);
otherwise, a single string with encoding name,
see \code{\link{stri_enc_list}}}
\item{to}{target encoding:
\code{NULL} or \code{''} for default encoding
(see \code{\link{stri_enc_get}}),
or a single string with encoding name}
\item{to_raw}{a single logical value; indicates whether a list of raw vectors
rather than a character vector should be returned}
}
\value{
If \code{to_raw} is \code{FALSE},
then a character vector with encoded strings (and appropriate
encoding marks) is returned.
Otherwise, a list of vectors of type raw is produced.
}
\description{
These functions convert strings between encodings.
They aim to serve as a more portable and faster replacement
for \R's own \code{\link{iconv}}.
}
\details{
\code{stri_conv} is an alias for \code{stri_encode}.
Refer to \code{\link{stri_enc_list}} for the list
of supported encodings and \link{stringi-encoding}
for a general discussion.
If \code{from} is either missing, \code{''}, or \code{NULL},
and if \code{str} is a character vector
then the marked encodings are used
(see \code{\link{stri_enc_mark}}) -- in such a case \code{bytes}-declared
strings are disallowed.
Otherwise, i.e., if \code{str} is a \code{raw}-type vector
or a list of raw vectors,
we assume that the input encoding is the current default encoding
as given by \code{\link{stri_enc_get}}.
However, if \code{from} is given explicitly,
the internal encoding declarations are always ignored.
For \code{to_raw=FALSE}, the output
strings always have the encodings marked according to the target converter
used (as specified by \code{to}) and the current default Encoding
(\code{ASCII}, \code{latin1}, \code{UTF-8}, \code{native},
or \code{bytes} in all other cases).
Note that some issues might occur if \code{to} indicates, e.g,
UTF-16 or UTF-32, as the output strings may have embedded NULs.
In such cases, please use \code{to_raw=TRUE} and consider
specifying a byte order marker (BOM) for portability reasons
(e.g., set \code{UTF-16} or \code{UTF-32} which automatically
adds the BOMs).
Note that \code{stri_encode(as.raw(data), 'encodingname')}
is a clever substitute for \code{\link{rawToChar}}.
In the current version of \pkg{stringi}, if an incorrect code point is found
on input, it is replaced with the default (for that target encoding)
'missing/erroneous' character (with a warning), e.g.,
the SUBSTITUTE character (U+001A) or the REPLACEMENT one (U+FFFD).
Occurrences thereof can be located in the output string to diagnose
the problematic sequences, e.g., by calling:
\code{stri_locate_all_regex(converted_string, '[\\ufffd\\u001a]'}.
Because of the way this function is currently implemented,
maximal size of a single string to be converted cannot exceed ~0.67 GB.
}
\references{
\emph{Conversion} -- ICU User Guide,
\url{https://unicode-org.github.io/icu/userguide/conversion/}
}
\seealso{
The official online manual of \pkg{stringi} at \url{https://stringi.gagolewski.com/}
Gagolewski M., \pkg{stringi}: Fast and portable character string processing in R, \emph{Journal of Statistical Software} 103(2), 2022, 1-59, \doi{10.18637/jss.v103.i02}
Other encoding_conversion:
\code{\link{about_encoding}},
\code{\link{stri_enc_fromutf32}()},
\code{\link{stri_enc_toascii}()},
\code{\link{stri_enc_tonative}()},
\code{\link{stri_enc_toutf32}()},
\code{\link{stri_enc_toutf8}()}
}
\concept{encoding_conversion}
\author{
\href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors
}
|