1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265
|
Lemmatization and Normalization
===============================
Wn provides two methods for expanding queries: lemmatization_ and
normalization_\ . Wn also has a setting that allows `alternative forms
<alternative-forms_>`_ stored in the database to be included in
queries.
.. seealso::
The :mod:`wn.morphy` module is a basic English lemmatizer included
with Wn.
.. _lemmatization:
Lemmatization
-------------
When querying a wordnet with wordforms from natural language text, it
is important to be able to find entries for inflected forms as the
database generally contains only lemmatic forms, or *lemmas* (or
*lemmata*, if you prefer irregular plurals).
>>> import wn
>>> en = wn.Wordnet('oewn:2021')
>>> en.words('plurals')
[]
>>> en.words('plural')
[Word('oewn-plural-a'), Word('oewn-plural-n')]
Lemmas are sometimes called *citation forms* or *dictionary forms* as
they are often used as the head words in dictionary entries. In
Natural Language Processing (NLP), *lemmatization* is a technique
where a possibly inflected word form is transformed to yield a
lemma. In Wn, this concept is generalized somewhat to mean a
transformation that yields a form matching wordforms stored in the
database. For example, the English word *sparrows* is the plural
inflection of *sparrow*, while the word *leaves* is ambiguous between
the plural inflection of the nouns *leaf* and *leave* and the
3rd-person singular inflection of the verb *leave*.
For tasks where high-accuracy is needed, wrapping the wordnet queries
with external tools that handle tokenization, lemmatization, and
part-of-speech tagging will likely yield the best results as this
method can make use of word context. That is, something like this:
.. code-block:: python
for lemma, pos in fancy_shmancy_analysis(corpus):
synsets = w.synsets(lemma, pos=pos)
For modest needs, however, Wn provides a way to integrate basic
lemmatization directly into the queries.
Lemmatization in Wn works as follows: if a :class:`wn.Wordnet` object
is instantiated with a *lemmatizer* argument, then queries involving
wordforms (e.g., :meth:`wn.Wordnet.words`, :meth:`wn.Wordnet.senses`,
:meth:`wn.Wordnet.synsets`) will first lemmatize the wordform and then
check all resulting wordforms and parts of speech against the
database as successive queries.
Lemmatization Functions
'''''''''''''''''''''''
The *lemmatizer* argument of :class:`wn.Wordnet` is a callable that
takes two string arguments: (1) the original wordform, and (2) a
part-of-speech or :python:`None`. It returns a dictionary mapping
parts-of-speech to sets of lemmatized wordforms. The signature is as
follows:
.. code-block:: python
lemmatizer(s: str, pos: str | None) -> Dict[str | None, Set[str]]
The part-of-speech may be used by the function to determine which
morphological rules to apply. If the given part-of-speech is
:python:`None`, then it is not specified and any rule may apply. A
lemmatizer that only deinflects should not change any specified
part-of-speech, but this is not a requirement, and a function could be
provided that undoes derivational morphology (e.g., *democratic* →
*democracy*).
Querying With Lemmatization
'''''''''''''''''''''''''''
As the needs of lemmatization differs from one language to another, Wn
does not provide a lemmatizer by default, and therefore it is
unavailable to the convenience functions :func:`wn.words`,
:func:`wn.senses`, and :func:`wn.synsets`. A lemmatizer can be added
to a :class:`wn.Wordnet` object. For example, using :mod:`wn.morphy`:
>>> import wn
>>> from wn.morphy import Morphy
>>> en = wn.Wordnet('oewn:2021', lemmatizer=Morphy())
>>> en.words('sparrows')
[Word('oewn-sparrow-n')]
>>> en.words('leaves')
[Word('oewn-leave-v'), Word('oewn-leaf-n'), Word('oewn-leave-n')]
Querying Without Lemmatization
''''''''''''''''''''''''''''''
When lemmatization is not used, inflected terms may not return any
results:
>>> en = wn.Wordnet('oewn:2021')
>>> en.words('sparrows')
[]
Depending on the lexicon, there may be situations where results are
returned for inflected lemmas, such as when the inflected form is
lexicalized as its own entry:
>>> en.words('glasses')
[Word('oewn-glasses-n')]
Or if the lexicon lists the inflected form as an alternative form. For
example, the English Wordnet lists irregular inflections as
alternative forms:
>>> en.words('lemmata')
[Word('oewn-lemma-n')]
See below for excluding alternative forms from such queries.
.. _alternative-forms:
Alternative Forms in the Database
---------------------------------
A lexicon may include alternative forms in addition to lemmas for each
word, and by default these are included in queries. What exactly is
included as an alternative form depends on the lexicon. The English
Wordnet, for example, adds irregular inflections (or "exceptional
forms"), while the Japanese Wordnet includes the same word in multiple
orthographies (original, hiragana, katakana, and two romanizations).
For the English Wordnet, this means that you might get basic
lemmatization for irregular forms only:
>>> en = wn.Wordnet('oewn:2021')
>>> en.words('learnt', pos='v')
[Word('oewn-learn-v')]
>>> en.words('learned', pos='v')
[]
If this is undesirable, the alternative forms can be excluded from
queries with the *search_all_forms* parameter:
>>> en = wn.Wordnet('oewn:2021', search_all_forms=False)
>>> en.words('learnt', pos='v')
[]
>>> en.words('learned', pos='v')
[]
.. _normalization:
Normalization
-------------
While lemmatization deals with morphological variants of words,
normalization handles minor orthographic variants. Normalized forms,
however, may be invalid as wordforms in the target language, and as
such they are only used behind the scenes for query expansion and not
presented to users. For instance, a user might attempt to look up
*résumé* in the English wordnet, but the wordnet only contains the
form without diacritics: *resume*. With strict string matching, the
entry would not be found using the wordform in the query. By
normalizing the query word, the entry can be found. Similarly in the
Spanish wordnet, *soñar* (to dream) and *sonar* (to ring) are two
different words. A user who types *soñar* likely does not want to get
results for *sonar*, but one who types *sonar* may be a non-Spanish
speaker who is unaware of the missing diacritic or does not have an
input method that allows them to type the diacritic, so this query
would return both entries by matching against the normalized forms in
the database. Wn handles all of these use cases.
When a lexicon is added to the database, potentially two wordforms are
inserted for every one in the lexicon: the original wordform and a
normalized form. When querying against the database, the original
query string is first compared with the original wordforms and, if
normalization is enabled, with the normalized forms in the database as
well. If this first attempt yields no results and if normalization is
enabled, the query string is normalized and tried again.
Normalization Functions
'''''''''''''''''''''''
The normalized form is obtained from a *normalizer* function, passed
as an argument to :class:`wn.Wordnet`, that takes a single string
argument and returns a string. That is, a function with the following
signature:
.. code-block:: python
normalizer(s: str) -> str
While custom *normalizer* functions could be used, in practice the
choice is either the default normalizer or :python:`None`. The default
normalizer works by downcasing the string and performing NFKD_
normalization to remove diacritics. If the normalized form is the same
as the original, only the original is inserted into the database.
.. table:: Examples of normalization
:align: center
============= ===============
Original Form Normalized Form
============= ===============
résumé resume
soñar sonar
San José san jose
ハラペーニョ ハラヘーニョ
============= ===============
.. _NFKD: https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
Querying With Normalization
'''''''''''''''''''''''''''
By default, normalization is enabled when a :class:`wn.Wordnet` is
created. Enabling normalization does two things: it allows queries to
check the original wordform in the query against the normalized forms
in the database and, if no results are returned in the first step, it
allows the queried wordform to be normalized as a back-off technique.
>>> en = wn.Wordnet('oewn:2021')
>>> en.words('résumé')
[Word('oewn-resume-n'), Word('oewn-resume-v')]
>>> es = wn.Wordnet('omw-es:1.4')
>>> es.words('soñar')
[Word('omw-es-soñar-v')]
>>> es.words('sonar')
[Word('omw-es-sonar-v'), Word('omw-es-soñar-v')]
.. note::
Users may supply a custom *normalizer* function to the
:class:`wn.Wordnet` object, but currently this is discouraged as
the result is unlikely to match normalized forms in the database
and there is not yet a way to customize the normalization of forms
added to the database.
Querying Without Normalization
''''''''''''''''''''''''''''''
Normalization can be disabled by passing :python:`None` as the
argument of the *normalizer* parameter of :class:`wn.Wordnet`. The
queried wordform will not be checked against normalized forms in the
database and neither will it be normalized as a back-off technique.
>>> en = wn.Wordnet('oewn:2021', normalizer=None)
>>> en.words('résumé')
[]
>>> es = wn.Wordnet('omw-es:1.4', normalizer=None)
>>> es.words('soñar')
[Word('omw-es-soñar-v')]
>>> es.words('sonar')
[Word('omw-es-sonar-v')]
.. note::
It is not possible to disable normalization for the convenience
functions :func:`wn.words`, :func:`wn.senses`, and
:func:`wn.synsets`.
|