1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
|
\name{CoNLLUTextDocument}
\alias{CoNLLUTextDocument}
\alias{read_CoNNLU}
\title{
CoNNL-U Text Documents
}
\description{
Create text documents from CoNNL-U format files.
}
\usage{
CoNLLUTextDocument(con, meta = list(), text = NULL)
read_CoNNLU(con)
}
\arguments{
\item{con}{a connection object or a character string.
See \code{\link{scan}()} for details.
}
\item{meta}{a named or empty list of document metadata tag-value
pairs.}
\item{text}{a character vector giving the text of the CoNNL-U
annotation. If \code{NULL}, the \code{text} comments of the
annotation are used.}
}
\details{
The CoNLL-U format (see
\url{https://universaldependencies.org/format.html})
is a CoNLL-style format for annotated texts popularized and employed
by the Universal Dependencies project
(see \url{https://universaldependencies.org/}).
For each \dQuote{word} in the text, this provides exactly the 10
fields
\code{ID},
\code{FORM} (word form or punctuation symbol),
\code{LEMMA} (lemma or stem of word form),
\code{UPOSTAG} (universal part-of-speech tag, see
\url{https://universaldependencies.org/u/pos/index.html}),
\code{XPOSTAG} (language-specific part-of-speech tag, may be
unavailable),
\code{FEATS} (list of morphological features),
\code{HEAD},
\code{DEPREL},
\code{DEPS}, and
\code{MISC}.
\code{read_CoNNLU()} reads the lines with these fields and optional
comments from the given connection and splits into fields using
\code{\link{scan}()}. This is combined with consecutive sentence ids
into a data frame inheriting from class \code{"CoNNLU_Annotation"}
used for representing the annotation information,
\code{CoNLLUTextDocument()} combines this annotation information with
the given metadata (and optionally the original pre-tokenized text)
into a CoNLL-U text document inheriting from classes
\code{"CoNLLUTextDocument"} and \code{"\link{TextDocument}"}.
The complete annotation information data frame can be extracted via
\code{content()}. CoNLL-U v2 requires providing the complete texts of
each sentence (or a reconstruction thereof) in \samp{# text =} comment
lines. Where consistently provided, these are made available in the
\code{text} attribute of the content data frame.
In addition, there are methods for generics
\code{\link{as.character}()},
\code{\link{words}()},
\code{\link{sents}()},
\code{\link{tagged_words}()}, and
\code{\link{tagged_sents}()}
and class \code{"CoNLLUTextDocument"},
which should be used to access the text in such text document
objects.
The CoNLL-U format allows to represent both words and (multiword)
tokens (see section \sQuote{Words, Tokens and Empty Nodes} in the
format documentation), as distinguished by ids being integers or
integer ranges, with the words being annotated further. One can
use \code{as.character()} to extract the \emph{tokens}; all other
viewers listed above use the \emph{words}. Finally, the viewers
incorporating POS tags take a \code{which} argument to specify using
the universal or language-specific tags, by giving a substring of
\code{"UPOSTAG"} (default) or \code{"XPOSTAG"}.
}
\value{
For \code{CoNLLUTextDocument()}, an object inheriting from
\code{"CoNLLUTextDocument"} and \code{"\link{TextDocument}"}.
For \code{read_CoNNLU()}, an object inherting from
\code{"CoNNLU_Annotation"} and \code{"\link{data.frame}"}
}
\seealso{
\code{\link{TextDocument}} for basic information on the text document
infrastructure employed by package \pkg{NLP}.
\url{https://universaldependencies.org/} for access to the Universal
Dependencies treebanks, which provide annotated texts in \emph{many}
different languages using CoNLL-U format.
}
|