1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/chunk-text.R
\name{chunk_text}
\alias{chunk_text}
\title{Chunk text into smaller segments}
\usage{
chunk_text(x, chunk_size = 100, doc_id = names(x), ...)
}
\arguments{
\item{x}{A character vector or a list of character vectors to be tokenized
into n-grams. If \code{x} is a character vector, it can be of any length,
and each element will be chunked separately. If \code{x} is a list of
character vectors, each element of the list should have a length of 1.}
\item{chunk_size}{The number of words in each chunk.}
\item{doc_id}{The document IDs as a character vector. This will be taken from
the names of the \code{x} vector if available. \code{NULL} is acceptable.}
\item{...}{Arguments passed on to \code{\link{tokenize_words}}.}
}
\description{
Given a text or vector/list of texts, break the texts into smaller segments
each with the same number of words. This allows you to treat a very long
document, such as a novel, as a set of smaller documents.
}
\details{
Chunking the text passes it through \code{\link{tokenize_words}},
which will strip punctuation and lowercase the text unless you provide
arguments to pass along to that function.
}
\examples{
\dontrun{
chunked <- chunk_text(mobydick, chunk_size = 100)
length(chunked)
chunked[1:3]
}
}
|