File: chunk_text.Rd

package info (click to toggle)

r-cran-tokenizers 0.3.0-1

links: PTS, VCS
area: main
in suites: bookworm, forky, sid, trixie
size: 824 kB
sloc: cpp: 143; sh: 13; makefile: 2

file content (38 lines) | stat: -rw-r--r-- 1,337 bytes

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/chunk-text.R
\name{chunk_text}
\alias{chunk_text}
\title{Chunk text into smaller segments}
\usage{
chunk_text(x, chunk_size = 100, doc_id = names(x), ...)
}
\arguments{
\item{x}{A character vector or a list of character vectors to be tokenized
into n-grams. If \code{x} is a character vector, it can be of any length,
and each element will be chunked separately. If \code{x} is a list of
character vectors, each element of the list should have a length of 1.}

\item{chunk_size}{The number of words in each chunk.}

\item{doc_id}{The document IDs as a character vector. This will be taken from
the names of the \code{x} vector if available. \code{NULL} is acceptable.}

\item{...}{Arguments passed on to \code{\link{tokenize_words}}.}
}
\description{
Given a text or vector/list of texts, break the texts into smaller segments
each with the same number of words. This allows you to treat a very long
document, such as a novel, as a set of smaller documents.
}
\details{
Chunking the text passes it through \code{\link{tokenize_words}},
  which will strip punctuation and lowercase the text unless you provide
  arguments to pass along to that function.
}
\examples{
\dontrun{
chunked <- chunk_text(mobydick, chunk_size = 100)
length(chunked)
chunked[1:3]
}
}