File: chunk_text.Rd

package info (click to toggle)
r-cran-tokenizers 0.3.0-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 824 kB
  • sloc: cpp: 143; sh: 13; makefile: 2
file content (38 lines) | stat: -rw-r--r-- 1,337 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/chunk-text.R
\name{chunk_text}
\alias{chunk_text}
\title{Chunk text into smaller segments}
\usage{
chunk_text(x, chunk_size = 100, doc_id = names(x), ...)
}
\arguments{
\item{x}{A character vector or a list of character vectors to be tokenized
into n-grams. If \code{x} is a character vector, it can be of any length,
and each element will be chunked separately. If \code{x} is a list of
character vectors, each element of the list should have a length of 1.}

\item{chunk_size}{The number of words in each chunk.}

\item{doc_id}{The document IDs as a character vector. This will be taken from
the names of the \code{x} vector if available. \code{NULL} is acceptable.}

\item{...}{Arguments passed on to \code{\link{tokenize_words}}.}
}
\description{
Given a text or vector/list of texts, break the texts into smaller segments
each with the same number of words. This allows you to treat a very long
document, such as a novel, as a set of smaller documents.
}
\details{
Chunking the text passes it through \code{\link{tokenize_words}},
  which will strip punctuation and lowercase the text unless you provide
  arguments to pass along to that function.
}
\examples{
\dontrun{
chunked <- chunk_text(mobydick, chunk_size = 100)
length(chunked)
chunked[1:3]
}
}