File: annotators.Rd

package info (click to toggle)
r-cran-nlp 0.3-2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 456 kB
  • sloc: makefile: 2
file content (189 lines) | stat: -rw-r--r-- 8,779 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
\name{annotators}
\alias{Simple_Para_Token_Annotator}
\alias{Simple_Sent_Token_Annotator}
\alias{Simple_Word_Token_Annotator}
\alias{Simple_POS_Tag_Annotator}
\alias{Simple_Entity_Annotator}
\alias{Simple_Chunk_Annotator}
\alias{Simple_Stem_Annotator}
\alias{Simple annotator generators}
\title{Simple annotator generators}
\description{
  Create annotator objects for composite basic NLP tasks based on
  functions performing simple basic tasks.
}
\usage{
Simple_Para_Token_Annotator(f, meta = list(), classes = NULL)
Simple_Sent_Token_Annotator(f, meta = list(), classes = NULL)
Simple_Word_Token_Annotator(f, meta = list(), classes = NULL)
Simple_POS_Tag_Annotator(f, meta = list(), classes = NULL)
Simple_Entity_Annotator(f, meta = list(), classes = NULL)
Simple_Chunk_Annotator(f, meta = list(), classes = NULL)
Simple_Stem_Annotator(f, meta = list(), classes = NULL)
}
\arguments{
  \item{f}{a function performing a \dQuote{simple} basic NLP task (see
    \bold{Details}).}
  \item{meta}{an empty or named list of annotator (pipeline) metadata
    tag-value pairs.}
  \item{classes}{a character vector or \code{NULL} (default) giving
    classes to be used for the created annotator object in addition to
    the default ones (see \bold{Details}).}
}
\details{
  The purpose of these functions is to facilitate the creation of
  annotators for basic NLP tasks as described below.

  \code{Simple_Para_Token_Annotator()} creates \dQuote{simple} paragraph
  token annotators.  Argument \code{f} should be a paragraph tokenizer,
  which takes a string \code{s} with the whole text to be processed, and
  returns the spans of the paragraphs in \code{s}, or an annotation
  object with these spans and (possibly) additional features.  The
  generated annotator inherits from the default classes
  \code{"Simple_Para_Token_Annotator"} and \code{"Annotator"}.  It uses
  the results of the simple paragraph tokenizer to create and return
  annotations with unique ids and type \sQuote{paragraph}.
  
  \code{Simple_Sent_Token_Annotator()} creates \dQuote{simple} sentence
  token annotators.  Argument \code{f} should be a sentence tokenizer,
  which takes a string \code{s} with the whole text to be processed, and
  returns the spans of the sentences in \code{s}, or an annotation
  object with these spans and (possibly) additional features.  The
  generated annotator inherits from the default classes
  \code{"Simple_Sent_Token_Annotator"} and \code{"Annotator"}.  It uses
  the results of the simple sentence tokenizer to create and return
  annotations with unique ids and type \sQuote{sentence}, possibly
  combined with sentence constituent features for already available
  paragraph annotations.

  \code{Simple_Word_Token_Annotator()} creates \dQuote{simple} word
  token annotators.  Argument \code{f} should be a simple word
  tokenizer, which takes a string \code{s} giving a sentence to be
  processed, and returns the spans of the word tokens in \code{s}, or an 
  annotation object with these spans and (possibly) additional features.
  The generated annotator inherits from the default classes
  \code{"Simple_Word_Token_Annotator"} and \code{"Annotator"}.
  It uses already available sentence token annotations to extract the
  sentences and obtains the results of the word tokenizer for these.  It
  then adds the sentence character offsets and unique word token ids,
  and word token constituents features for the sentences, and returns
  the word token annotations combined with the augmented sentence token
  annotations.

  \code{Simple_POS_Tag_Annotator()} creates \dQuote{simple} POS tag
  annotators.  Argument \code{f} should be a simple POS tagger, which
  takes a character vector giving the word tokens in a sentence, and
  returns either a character vector with the tags, or a list of feature
  maps with the tags as \sQuote{POS} feature and possibly other
  features.  The generated annotator inherits from the default classes
  \code{"Simple_POS_Tag_Annotator"} and \code{"Annotator"}.  It uses
  already available sentence and word token annotations to extract the
  word tokens for each sentence and obtains the results of the simple
  POS tagger for these, and returns annotations for the word tokens with
  the features obtained from the POS tagger.

  \code{Simple_Entity_Annotator()} creates \dQuote{simple} entity
  annotators.  Argument \code{f} should be a simple entity detector
  (\dQuote{named entity recognizer}) which takes a character vector
  giving the word tokens in a sentence, and return an annotation object
  with the \emph{word} token spans, a \sQuote{kind} feature giving the
  kind of the entity detected, and possibly other features.  The
  generated annotator inherits from the default classes
  \code{"Simple_Entity_Annotator"} and \code{"Annotator"}.  It uses
  already available sentence and word token annotations to extract the
  word tokens for each sentence and obtains the results of the simple
  entity detector for these, transforms word token spans to character
  spans and adds unique ids, and returns the combined entity
  annotations.
  
  \code{Simple_Chunk_Annotator()} creates \dQuote{simple} chunk
  annotators.  Argument \code{f} should be a simple chunker, which takes
  as arguments character vectors giving the word tokens and the
  corresponding POS tags, and returns either a character vector with the
  chunk tags, or a list of feature lists with the tags as
  \sQuote{chunk_tag} feature and possibly other features.  The generated
  annotator inherits from the default classes
  \code{"Simple_Chunk_Annotator"} and \code{"Annotator"}.  It uses
  already available annotations to extract the word tokens and POS tags
  for each sentence and obtains the results of the simple chunker for
  these, and returns word token annotations with the chunk features
  (only).

  \code{Simple_Stem_Annotator()} creates \dQuote{simple} stem
  annotators.  Argument \code{f} should be a simple stemmer, which takes
  as arguments a character vector giving the word tokens, and returns a
  character vector with the corresponding word stems.  The generated
  annotator inherits from the default classes
  \code{"Simple_Stem_Annotator"} and \code{"Annotator"}.  It uses
  already available annotations to extract the word tokens, and returns
  word token annotations with the corresponding stem features (only).

  In all cases, if the underlying simple processing function returns
  annotation objects these should not provide their own ids (or use such
  in the features), as the generated annotators will necessarily provide
  these (the already available annotations are only available at the
  annotator level, but not at the simple processing level).
}
\value{
  An annotator object inheriting from the given classes and the default
  ones.
}
\seealso{
  Package \pkg{openNLP} which provides annotator generators for sentence
  and word tokens, POS tags, entities and chunks, using processing
  functions based on the respective Apache OpenNLP MaxEnt processing
  resources.
}
\examples{
## A simple text.
s <- String("  First sentence.  Second sentence.  ")
##           ****5****0****5****0****5****0****5**

## A very trivial sentence tokenizer.
sent_tokenizer <-
function(s) {
    s <- as.String(s)
    m <- gregexpr("[^[:space:]][^.]*\\\\.", s)[[1L]]
    Span(m, m + attr(m, "match.length") - 1L)
}
## (Could also use Regexp_Tokenizer() with the above regexp pattern.)
sent_tokenizer(s)
## A simple sentence token annotator based on the sentence tokenizer.
sent_token_annotator <- Simple_Sent_Token_Annotator(sent_tokenizer)
sent_token_annotator
a1 <- annotate(s, sent_token_annotator)
a1
## Extract the sentence tokens.
s[a1]

## A very trivial word tokenizer.
word_tokenizer <-
function(s) {
    s <- as.String(s)
    ## Remove the last character (should be a period when using
    ## sentences determined with the trivial sentence tokenizer).
    s <- substring(s, 1L, nchar(s) - 1L)
    ## Split on whitespace separators.
    m <- gregexpr("[^[:space:]]+", s)[[1L]]
    Span(m, m + attr(m, "match.length") - 1L)
}
lapply(s[a1], word_tokenizer)
## A simple word token annotator based on the word tokenizer.
word_token_annotator <- Simple_Word_Token_Annotator(word_tokenizer)
word_token_annotator
a2 <- annotate(s, word_token_annotator, a1)
a2
## Extract the word tokens.
s[subset(a2, type == "word")]

## A simple word token annotator based on wordpunct_tokenizer():
word_token_annotator <-
    Simple_Word_Token_Annotator(wordpunct_tokenizer,
                                list(description =
                                     "Based on wordpunct_tokenizer()."))
word_token_annotator
a2 <- annotate(s, word_token_annotator, a1)
a2
## Extract the word tokens.
s[subset(a2, type == "word")]
}