File: stri_split_boundaries.Rd

package info (click to toggle)
r-cran-stringi 1.7.12-1
links: PTS, VCS
area: main
in suites: bookworm
size: 39,772 kB
sloc: cpp: 482,349; ansic: 51,900; perl: 471; makefile: 9; sh: 1
file content (128 lines) | stat: -rw-r--r-- 4,513 bytes
parent folder | download | duplicates (2)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/search_split_bound.R
\name{stri_split_boundaries}
\alias{stri_split_boundaries}
\title{Split a String at Text Boundaries}
\usage{
stri_split_boundaries(
  str,
  n = -1L,
  tokens_only = FALSE,
  simplify = FALSE,
  ...,
  opts_brkiter = NULL
)
}
\arguments{
\item{str}{character vector or an object coercible to}

\item{n}{integer vector, maximal number of strings to return}

\item{tokens_only}{single logical value; may affect the result if \code{n}
is positive, see Details}

\item{simplify}{single logical value; if \code{TRUE} or \code{NA},
then a character matrix is returned; otherwise (the default), a list of
character vectors is given, see Value}

\item{...}{additional settings for \code{opts_brkiter}}

\item{opts_brkiter}{a named list with \pkg{ICU} BreakIterator's settings,
see \code{\link{stri_opts_brkiter}}; \code{NULL} for the
default break iterator, i.e., \code{line_break}}
}
\value{
If \code{simplify=FALSE} (the default),
then the functions return a list of character vectors.

Otherwise, \code{\link{stri_list2matrix}} with \code{byrow=TRUE}
and \code{n_min=n} arguments is called on the resulting object.
In such a case, a character matrix with \code{length(str)} rows
is returned. Note that \code{\link{stri_list2matrix}}'s \code{fill}
argument is set to an empty string and \code{NA},
for \code{simplify} equal to \code{TRUE} and \code{NA}, respectively.
}
\description{
This function locates text boundaries
(like character, word, line, or sentence boundaries)
and splits strings at the indicated positions.
}
\details{
Vectorized over \code{str} and \code{n}.

If \code{n} is negative (the default), then all text pieces are extracted.

Otherwise, if \code{tokens_only} is \code{FALSE} (which is the default),
then \code{n-1} tokens are extracted (if possible) and the \code{n}-th string
gives the (non-split) remainder (see Examples).
On the other hand, if \code{tokens_only} is \code{TRUE},
then only full tokens (up to \code{n} pieces) are extracted.

For more information on text boundary analysis
performed by \pkg{ICU}'s \code{BreakIterator}, see
\link{stringi-search-boundaries}.
}
\examples{
test <- 'The\u00a0above-mentioned    features are very useful. ' \%s+\%
   'Spam, spam, eggs, bacon, and spam. 123 456 789'
stri_split_boundaries(test, type='line')
stri_split_boundaries(test, type='word')
stri_split_boundaries(test, type='word', skip_word_none=TRUE)
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_letter=TRUE)
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_number=TRUE)
stri_split_boundaries(test, type='sentence')
stri_split_boundaries(test, type='sentence', skip_sentence_sep=TRUE)
stri_split_boundaries(test, type='character')

# a filtered break iterator with the new ICU:
stri_split_boundaries('Mr. Jones and Mrs. Brown are very happy.
So am I, Prof. Smith.', type='sentence', locale='en_US@ss=standard') # ICU >= 56 only

}
\seealso{
The official online manual of \pkg{stringi} at \url{https://stringi.gagolewski.com/}

Gagolewski M., \pkg{stringi}: Fast and portable character string processing in R, \emph{Journal of Statistical Software} 103(2), 2022, 1-59, \doi{10.18637/jss.v103.i02}

Other search_split: 
\code{\link{about_search}},
\code{\link{stri_split_lines}()},
\code{\link{stri_split}()}

Other locale_sensitive: 
\code{\link{\%s<\%}()},
\code{\link{about_locale}},
\code{\link{about_search_boundaries}},
\code{\link{about_search_coll}},
\code{\link{stri_compare}()},
\code{\link{stri_count_boundaries}()},
\code{\link{stri_duplicated}()},
\code{\link{stri_enc_detect2}()},
\code{\link{stri_extract_all_boundaries}()},
\code{\link{stri_locate_all_boundaries}()},
\code{\link{stri_opts_collator}()},
\code{\link{stri_order}()},
\code{\link{stri_rank}()},
\code{\link{stri_sort_key}()},
\code{\link{stri_sort}()},
\code{\link{stri_trans_tolower}()},
\code{\link{stri_unique}()},
\code{\link{stri_wrap}()}

Other text_boundaries: 
\code{\link{about_search_boundaries}},
\code{\link{about_search}},
\code{\link{stri_count_boundaries}()},
\code{\link{stri_extract_all_boundaries}()},
\code{\link{stri_locate_all_boundaries}()},
\code{\link{stri_opts_brkiter}()},
\code{\link{stri_split_lines}()},
\code{\link{stri_trans_tolower}()},
\code{\link{stri_wrap}()}
}
\concept{locale_sensitive}
\concept{search_split}
\concept{text_boundaries}
\author{
\href{https://www.gagolewski.com/}{Marek Gagolewski} and other contributors
}