1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281
|
Working with Lexicons
=====================
Terminology
-----------
In Wn, the following terminology is used:
:lexicon: An inventory of words, senses, synsets, relations, etc. that
share a namespace (i.e., that can refer to each other).
:wordnet: A group of lexicons (but usually just one).
:resource: A file containing lexicons.
:package: A directory containing a resource and optionally some
metadata files.
:collection: A directory containing packages and optionally some
metadata files.
:project: A general term for a resource, package, or collection,
particularly pertaining to its creation, maintenance, and
distribution.
In general, each resource contains one lexicon. For large projects
like the `Open English WordNet`_, that lexicon is also a wordnet on
its own. For a collection like the `Open Multilingual Wordnet`_, most
lexicons do not include relations as they are instead expected to use
those from the OMW's included English wordnet, which is derived from
the `Princeton WordNet`_. As such, a wordnet for these sub-projects is
best thought of as the grouping of the lexicon with the lexicon
providing the relations.
.. _Open English WordNet: https://en-word.net
.. _Open Multilingual Wordnet: https://github.com/omwn/
.. _Princeton WordNet: https://wordnet.princeton.edu/
.. _lexicon-specifiers:
Lexicon and Project Specifiers
------------------------------
Wn uses *lexicon specifiers* to deal with the possibility of having
multiple lexicons and multiple versions of lexicons loaded in the same
database. The specifiers are the joining of a lexicon's name (ID) and
version, delimited by ``:``. Here are the possible forms:
.. code-block:: none
* -- any/all lexicons
id -- the most recently added lexicon with the given id
id:* -- all lexicons with the given id
id:version -- the lexicon with the given id and version
*:version -- all lexicons with the given version
For example, if ``ewn:2020`` was installed followed by ``ewn:2019``,
then ``ewn`` would specify the ``2019`` version, ``ewn:*`` would
specify both versions, and ``ewn:2020`` would specify the ``2020``
version.
The same format is used for *project specifiers*, which refer to
projects as defined in Wn's index. In most cases the project specifier
is the same as the lexicon specifier (e.g., ``ewn:2020`` refers both
to the project to be downloaded and the lexicon that is installed),
but sometimes it is not. The 1.4 release of the `Open Multilingual
Wordnet`_, for instance, has the project specifier ``omw:1.4`` but it
installs a number of lexicons with their own lexicon specifiers
(``omw-zsm:1.4``, ``omw-cmn:1.4``, etc.). When only an id is given
(e.g., ``ewn``), a project specifier gets the *first* version listed
in the index (in the default index, conventionally, the first version
is the latest release).
.. _lexicon-filters:
Filtering Queries with Lexicons
-------------------------------
Queries against the database will search all installed lexicons unless
they are filtered by ``lang`` or ``lexicon`` arguments:
>>> import wn
>>> len(wn.words())
1538449
>>> len(wn.words(lang="en"))
318289
>>> len(wn.words(lexicon="oewn:2024"))
161705
The ``lexicon`` parameter can also take multiple specifiers so you can
include things like lexicon extensions or to explicitly include
multiple lexicons:
>>> len(wn.words(lexicon="oewn:2024 omw-en:1.4"))
318289
If a lexicon selected by the ``lexicon`` or ``lang`` arguments
specifies a dependency, the dependency is automatically added as an
*expand* lexicon. Explicitly set :python:`expand=''` to disable this
behavior:
>>> wn.lexicons(lexicon="omw-es:1.4")[0].requires() # omw-es requires omw-en
{'omw-en:1.4': <Lexicon omw-en:1.4 [en]>}
>>> es = wn.Wordnet("omw-es:1.4")
>>> es.lexicons()
[<Lexicon omw-es:1.4 [es]>]
>>> es.expanded_lexicons() # omw-en automatically added
[<Lexicon omw-en:1.4 [en]>]
>>> es_no_en = wn.Wordnet("omw-es:1.4", expand='')
>>> es_no_en.lexicons()
[<Lexicon omw-es:1.4 [es]>]
>>> es_no_en.expanded_lexicons() # no expand lexicons
[]
Also see :ref:`cross-lingual-relation-traversal` for
selecting expand lexicons for relations.
The objects returned by queries retain the "lexicon configuration"
used, which includes the lexicons and expand lexicons. This
configuration determines which lexicons are searched during secondary
queries. The lexicon configuration also stores a flag indicating
whether no lexicon filters were used at all, which triggers
:ref:`default mode <default-mode>` secondary queries.
.. _default-mode:
Default Mode Queries
--------------------
A special "default mode" is activated when making a module-function
query (:func:`wn.words`, :func:`wn.synsets`, etc.) or instantiating a
:class:`wn.Wordnet` object with no ``lexicon`` or ``lang`` argument
(so-named because the mode is triggered by using the default values of
``lexicon`` and ``lang``):
>>> w = wn.Wordnet()
>>> wn.words("pineapple") # for example
Default-mode causes the following behavior:
1. Primary queries search any installed lexicon
2. Secondary queries only search the lexicon of the primary entity
(e.g., :meth:`Synset.words` only finds words from the same lexicon
as the synset). If the lexicon has any extensions or is itself an
extension, any extension/base lexicons are also included.
3. If the ``expand`` argument is :python:`None` (always true for
module functions like :func:`wn.synsets`), all installed lexicons
are used as expand lexicons for relations queries.
.. warning::
Default-mode queries are not reproducible as the results can change
as lexicons are added or removed from the database. For anything
more than a casual query, it is highly suggested to instead create
a :class:`wn.Wordnet` object with fully-specified ``lexicon`` and
``expand`` arguments.
Downloading Lexicons
--------------------
Use :py:func:`wn.download` to download lexicons from the web given
either an indexed project specifier or the URL of a resource, package,
or collection.
>>> import wn
>>> wn.download('odenet') # get the latest Open German WordNet
>>> wn.download('odenet:1.3') # get the 1.3 version
>>> # download from a URL
>>> wn.download('https://github.com/omwn/omw-data/releases/download/v1.4/omw-1.4.tar.xz')
The project specifier is only used to retrieve information from Wn's
index. The lexicon IDs of the corresponding resource files are what is
stored in the database.
Adding Local Lexicons
---------------------
Lexicons can be added from local files with :py:func:`wn.add`:
>>> wn.add('~/data/omw-1.4/omw-nb/omw-nb.xml')
Or with the parent directory as a package:
>>> wn.add('~/data/omw-1.4/omw-nb/')
Or with the grandparent directory as a collection (installing all
packages contained by the collection):
>>> wn.add('~/data/omw-1.4/')
Or from a compressed archive of one of the above:
>>> wn.add('~/data/omw-1.4/omw-nb/omw-nb.xml.xz')
>>> wn.add('~/data/omw-1.4/omw-nb.tar.xz')
>>> wn.add('~/data/omw-1.4.tar.xz')
Listing Installed Lexicons
--------------------------
If you wish to see which lexicons have been added to the database,
:py:func:`wn.lexicons()` returns the list of :py:class:`wn.Lexicon`
objects that describe each one.
>>> for lex in wn.lexicons():
... print(f'{lex.id}:{lex.version}\t{lex.label}')
...
omw-en:1.4 OMW English Wordnet based on WordNet 3.0
omw-nb:1.4 Norwegian Wordnet (Bokmål)
odenet:1.3 Offenes Deutsches WordNet
ewn:2020 English WordNet
ewn:2019 English WordNet
Removing Lexicons
-----------------
Lexicons can be removed from the database with :py:func:`wn.remove`:
>>> wn.remove('omw-nb:1.4')
Note that this removes a single lexicon and not a project, so if, for
instance, you've installed a multi-lexicon project like ``omw``, you
will need to remove each lexicon individually or use a star specifier:
>>> wn.remove('omw-*:1.4')
WN-LMF Files, Packages, and Collections
---------------------------------------
Wn can handle projects with 3 levels of structure:
* WN-LMF XML files
* WN-LMF packages
* WN-LMF collections
WN-LMF XML Files
''''''''''''''''
A WN-LMF XML file is a file with a ``.xml`` extension that is valid
according to the `WN-LMF specification
<https://github.com/globalwordnet/schemas/>`_.
WN-LMF Packages
'''''''''''''''
If one needs to distribute metadata or additional files along with
WN-LMF XML file, a WN-LMF package allows them to include the files in
a directory. The directory should contain exactly one ``.xml`` file,
which is the WN-LMF XML file. In addition, it may contain additional
files and Wn will recognize three of them:
:``LICENSE`` (``.txt`` | ``.md`` | ``.rst`` ): the full text of the license
:``README`` (``.txt`` | ``.md`` | ``.rst`` ): the project README
:``citation.bib``: a BibTeX file containing academic citations for the project
.. code-block::
omw-sq/
├── omw-sq.xml
├── LICENSE.txt
└── README.md
WN-LMF Collections
''''''''''''''''''
In some cases a project may manage multiple resources and distribute
them as a collection. A collection is a directory containing
subdirectories which are WN-LMF packages. The collection may contain
its own README, LICENSE, and citation files which describe the project
as a whole.
.. code-block::
omw-1.4/
├── omw-sq
│ ├── oms-sq.xml
│ ├── LICENSE.txt
│ └── README.md
├── omw-lt
│ ├── citation.bib
│ ├── LICENSE
│ └── omw-lt.xml
├── ...
├── citation.bib
├── LICENSE
└── README.md
|