File: Tokenizer.Rd

package info (click to toggle)
r-cran-nlp 0.3-2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 456 kB
  • sloc: makefile: 2
file content (82 lines) | stat: -rw-r--r-- 2,852 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
\name{Tokenizer}
\alias{Span_Tokenizer}
\alias{as.Span_Tokenizer}
\alias{is.Span_Tokenizer}
\alias{Token_Tokenizer}
\alias{as.Token_Tokenizer}
\alias{is.Token_Tokenizer}
\title{Tokenizer objects}
\description{
  Create tokenizer objects.
}
\usage{
Span_Tokenizer(f, meta = list())
as.Span_Tokenizer(x, ...)

Token_Tokenizer(f, meta = list())
as.Token_Tokenizer(x, ...)
}
\arguments{
  \item{f}{a tokenizer function taking the string to tokenize as
    argument, and returning either the tokens (for
    \code{Token_Tokenizer}) or their spans (for
    \code{Span_Tokenizer}).}
  \item{meta}{a named or empty list of tokenizer metadata tag-value
    pairs.}
  \item{x}{an \R object.}
  \item{...}{further arguments passed to or from other methods.}
}
\details{
  Tokenization is the process of breaking a text string up into words,
  phrases, symbols, or other meaningful elements called tokens.  This
  can be accomplished by returning the sequence of tokens, or the
  corresponding spans (character start and end positions).
  We refer to tokenization resources of the respective kinds as
  \dQuote{token tokenizers} and \dQuote{span tokenizers}.

  \code{Span_Tokenizer()} and \code{Token_Tokenizer()} return tokenizer
  objects which are functions with metadata and suitable class
  information, which in turn can be used for converting between the two
  kinds using \code{as.Span_Tokenizer()} or \code{as.Token_Tokenizer()}.
  It is also possible to coerce annotator (pipeline) objects to
  tokenizer objects, provided that the annotators provide suitable
  token annotations.  By default, word tokens are used; this can be
  controlled via the \code{type} argument of the coercion methods (e.g.,
  use \code{type = "sentence"} to extract sentence tokens).

  There are also \code{print()} and \code{format()} methods for
  tokenizer objects, which use the \code{description} element of the
  metadata if available.
}
\seealso{
  \code{\link{Regexp_Tokenizer}()} for creating regexp span tokenizers.
}
\examples{
## A simple text.
s <- String("  First sentence.  Second sentence.  ")
##           ****5****0****5****0****5****0****5**

## Use a pre-built regexp (span) tokenizer:
wordpunct_tokenizer
wordpunct_tokenizer(s)
## Turn into a token tokenizer:
tt <- as.Token_Tokenizer(wordpunct_tokenizer)
tt
tt(s)
## Of course, in this case we could simply have done
s[wordpunct_tokenizer(s)]
## to obtain the tokens from the spans.
## Conversion also works the other way round: package 'tm' provides
## the following token tokenizer function:
scan_tokenizer <- function(x)
    scan(text = as.character(x), what = "character", quote = "", 
         quiet = TRUE)
## Create a token tokenizer from this:
tt <- Token_Tokenizer(scan_tokenizer)
tt(s)
## Turn into a span tokenizer:
st <- as.Span_Tokenizer(tt)
st(s)
## Checking tokens from spans:
s[st(s)]
}