1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244
|
Interlingual Queries
====================
This guide explains how interlingual queries work within Wn. To get
started, you'll need at least two lexicons that use interlingual
indices (ILIs). For this guide, we'll use the Open English WordNet
(``oewn:2024``), the Open German WordNet (``odenet:1.4``), also
known as OdeNet, and the Japanese wordnet (``omw-ja:1.4``).
>>> import wn
>>> wn.download('oewn:2024')
>>> wn.download('odenet:1.4')
>>> wn.download('omw-ja:1.4')
We will query these wordnets with the following :class:`~wn.Wordnet`
objects:
>>> en = wn.Wordnet('oewn:2024')
>>> de = wn.Wordnet('odenet:1.4')
The object for the Japanese wordnet will be discussed and created
below, in :ref:`cross-lingual-relation-traversal`.
What are Interlingual Indices?
------------------------------
It is common for users of the `Princeton WordNet
<https://wordnet.princeton.edu/>`_ to refer to synsets by their `WNDB
<https://wordnet.princeton.edu/documentation/wndb5wn>`_ offset and type,
but this is problematic because the offset is a byte-offset in the
wordnet data files and it will differ for wordnets in other languages
and even between versions of the same wordnet. Interlingual indices
(ILIs) address this issue by providing stable identifiers for concepts,
whether for a synset across versions of a wordnet or across languages.
The idea of ILIs was proposed by [Vossen99]_ and it came to fruition
with the release of the Collaborative Interlingual Index (CILI;
[Bond16]_). CILI therefore represents an instance of, and a namespace
for, ILIs. There could, in theory, be alternative indexes for
particular domains (e.g., names of people or places), but currently
there is only the one.
As an example, the synset for *apricot* (fruit) in WordNet 3.0 is
``07750872-n``, but it is ``07766848-n`` in WordNet 3.1. In OdeNet
1.4, which is not released in the WNDB format and therefore doesn't
use offsets at all, it is ``13235-n`` for the equivalent word
(*Aprikose*). However, all three use the same ILI: ``i77784``.
Generally, only one synset within a wordnet will be mapped to a
particular ILI, but this may not always be true, nor does every synset
necessarily map to an ILI. Some concepts that are lexicalized in one
language may not be in another language. For example, *rice* in English
may refer to the rice plant, rice grain, or cooked rice, but in
languages like Japanese they are distinct things (稲 *ine*, 米 *kome*,
and 飯 *meshi* / ご飯 *gohan*, respectively).
The ``ili`` property of Synsets serves two purposes in Wn. Mainly it is
for encoding the ILI identifier associated with the synset, but it is
also used to indicate when a lexicon is proposing a new concept that is
not yet part of CILI. In the latter case, a WN-LMF lexicon file will
have the special value of ``in`` for a synset's ILI and it will provide
an ``<ILIDefinition>`` element. In Wn, this translates to
:attr:`wn.Synset.ili` returning :python:`None`, the same as if no ILI
were mapped at all. Both synsets with proposed ILIs and those with no
ILI cannot be used in interlingual queries. Proposed ILIs can be
inspected using the :mod:`wn.ili.get_proposed` function, if you know
have the synset, or :mod:`wn.ili.get_all_proposed` to get all of them.
.. [Vossen99]
Vossen, Piek, Wim Peters, and Julio Gonzalo.
"Towards a universal index of meaning."
In Proceedings of ACL-99 workshop, Siglex-99, standardizing lexical resources, pp. 81-90.
University of Maryland, 1999.
.. [Bond16]
Bond, Francis, Piek Vossen, John Philip McCrae, and Christiane Fellbaum.
"CILI: the Collaborative Interlingual Index."
In Proceedings of the 8th Global WordNet Conference (GWC), pp. 50-57. 2016.
Using Interlingual Indices
--------------------------
For synsets that have an associated ILI, you can retrieve it via the
:data:`wn.Synset.ili` property:
>>> apricot = en.synsets('apricot')[1]
>>> apricot.ili
'i77784'
The value is a :class:`str` ILI identifier. These may be used directly
for things like interlingual synset lookups:
>>> de.synsets(ili=apricot.ili)[0].lemmas()
['Marille', 'Aprikose']
There may be more information about the ILI itself which you can get
from the :mod:`wn.ili` module:
>>> from wn import ili
>>> apricot_ili = ili.get(apricot.ili)
>>> apricot_ili
ILI(id='i77784')
From this object you can get various properties of the ILI, such as
the ID string, its status, and its definition, but if you have
not added CILI to Wn's database, it will not be very informative:
>>> apricot_ili.id
'i77784'
>>> apricot_ili.status
'presupposed'
>>> apricot_ili.definition() is None
True
The ``presupposed`` status means that the ILI ID is in use by a
lexicon, but there is no other source of truth for the index. CILI can
be downloaded just like a lexicon:
>>> wn.download('cili:1.0')
Now the status and definition should be more useful:
>>> apricot_ili.status
'active'
>>> apricot_ili.definition()
'downy yellow to rosy-colored fruit resembling a small peach'
Translating Words, Senses, and Synsets
--------------------------------------
Rather than manually inserting the ILI IDs into Wn's lookup functions
as shown above, Wn provides the :meth:`wn.Synset.translate` method to
make it easier:
>>> apricot.translate(lexicon='odenet:1.4')
[Synset('odenet-13235-n')]
The method returns a list for two reasons: first, it's not guaranteed
that the target lexicon has only one synset with the ILI and, second,
you can translate to more than one lexicon at a time.
:class:`~wn.Sense` objects also have a :meth:`~wn.Sense.translate`
method, returning a list of senses instead of synsets:
>>> de_senses = apricot.senses()[0].translate(lexicon='odenet:1.4')
>>> [s.word().lemma() for s in de_senses]
['Marille', 'Aprikose']
:class:`~wn.Word` have a :meth:`~wn.Word.translate` method, too, but
it works a bit differently. Since each word may be part of multiple
synsets, the method returns a mapping of each word sense to the list
of translated words:
>>> result = en.words('apricot')[0].translate(lexicon='odenet:1.4')
>>> for sense, de_words in result.items():
... print(sense, [w.lemma() for w in de_words])
...
Sense('oewn-apricot__1.20.00..') []
Sense('oewn-apricot__1.13.00..') ['Marille', 'Aprikose']
Sense('oewn-apricot__1.07.00..') ['lachsrosa', 'lachsfarbig', 'in Lachs', 'lachsfarben', 'lachsrot', 'lachs']
The three senses above are for *apricot* as a tree, a fruit, and a
color. OdeNet does not have a synset for apricot trees, or it has one
not associated with the appropriate ILI, and therefore it could not
translate any words for that sense.
.. _cross-lingual-relation-traversal:
Cross-lingual Relation Traversal
--------------------------------
ILIs have a second use in Wn, which is relation traversal for wordnets
that depend on other lexicons, i.e., those created with the *expand*
methodology. These wordnets, such as many of those in the `Open
Multilingual Wordnet <https://github.com/omwn/>`_, do not include
synset relations on their own as they were built using the English
WordNet as their taxonomic scaffolding. Trying to load such a lexicon
when the lexicon it requires is not added to the database presents a
warning to the user:
>>> ja = wn.Wordnet('omw-ja:1.4')
[...] WnWarning: lexicon dependencies not available: omw-en:1.4
>>> ja.expanded_lexicons()
[]
.. warning::
Do not rely on the presence of a warning to determine if the
lexicon has its expand lexicon loaded. Python's default warning
filter may only show the warning the first time it is
encountered. Instead, inspect :meth:`wn.Wordnet.expanded_lexicons`
to see if it is non-empty.
When a dependency is unmet, Wn only issues a warning, not an error,
and you can continue to use the lexicon as it is, but it won't be
useful for exploring relations such as hypernyms and hyponyms:
>>> anzu = ja.synsets(ili='i77784')[0]
>>> anzu.lemmas()
['アンズ', 'アプリコット', '杏']
>>> anzu.hypernyms()
[]
One way to resolve this issue is to install the lexicon it requires:
>>> wn.download('omw-en:1.4')
>>> ja = wn.Wordnet('omw-ja:1.4') # no warning
>>> ja.expanded_lexicons()
[<Lexicon omw-en:1.4 [en]>]
Wn will detect the dependency and load ``omw-en:1.4`` as the *expand*
lexicon for ``omw-ja:1.4`` when the former is in the database. You may
also specify an expand lexicon manually, even one that isn't the
specified dependency:
>>> ja = wn.Wordnet('omw-ja:1.4', expand='oewn:2024') # no warning
>>> ja.expanded_lexicons()
[<Lexicon oewn:2024 [en]>]
In this case, the Open English WordNet is an actively-developed fork
of the lexicon that ``omw-ja:1.4`` depends on, and it should contain
all the relations, so you'll see little difference between using it
and ``omw-en:1.4``. This works because the relations are found using
ILIs and not synset offsets. You may still prefer to use the specified
dependency if you have strict compatibility needs, such as for
experiment reproducibility and/or compatibility with the `NLTK
<https://nltk.org>`_. Using some other lexicon as the expand lexicon
may yield very different results. For instance, ``odenet:1.4`` is much
smaller than the English wordnets and has fewer relations, so it would
not be a good substitute for ``omw-ja:1.4``'s expand lexicon.
When an appropriate expand lexicon is loaded, relations between
synsets, such as hypernyms, are more likely to be present:
>>> anzu = ja.synsets(ili='i77784')[0] # recreate the synset object
>>> anzu.hypernyms()
[Synset('omw-ja-07705931-n')]
>>> anzu.hypernyms()[0].lemmas()
['果物']
>>> anzu.hypernyms()[0].translate(lexicon='oewn:2024')[0].lemmas()
['edible fruit']
|