File: wn.ic.rst

package info (click to toggle)
python-wn 0.13.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky
  • size: 2,184 kB
  • sloc: python: 7,592; xml: 493; sql: 220; makefile: 12
file content (171 lines) | stat: -rw-r--r-- 7,040 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171

wn.ic
=====

.. automodule:: wn.ic

The mathematical formulae for information content are defined in
`Formal Description`_, and the corresponding Python API function are
described in `Calculating Information Content`_. These functions
require information content weights obtained either by `computing them
from a corpus <Computing Corpus Weights_>`_, or by `loading
pre-computed weights from a file <Reading Pre-computed Information
Content Files_>`_.

.. note::

   The term *information content* can be ambiguous. It often, and most
   accurately, refers to the result of the :func:`information_content`
   function (:math:`\text{IC}(c)` in the mathematical notation), but
   is also sometimes used to refer to the corpus frequencies/weights
   (:math:`\text{freq}(c)` in the mathematical notation) returned by
   :func:`load` or :func:`compute`, as these weights are the basis of
   the value computed by :func:`information_content`. The Wn
   documentation tries to consistently refer to former as the
   *information content value*, or just *information content*, and the
   latter as *information content weights*, or *weights*.


Formal Description
------------------

The Information Content (IC) of a concept (synset) is a measure of its
specificity computed from the wordnet's taxonomy structure and corpus
frequencies. It is defined by Resnik 1995 ([RES95]_), following
information theory, as the negative log-probability of a concept:

.. math::

   \text{IC}(c) = -\log{p(c)}

A concept's probability is the empirical probability over a corpus:

.. math::

   p(c) = \frac{\text{freq}(c)}{N}

Here, :math:`N` is the total count of words of the same category as
concept :math:`c` ([RES95]_ only considered nouns) where each word has
some representation in the wordnet, and :math:`\text{freq}` is defined
as the sum of corpus counts of words in :math:`\text{words}(c)`, which
is the set of words subsumed by concept :math:`c`:

.. math::

   \text{freq}(c) = \sum_{w \in \text{words}(c)}{\text{count}(w)}

It is common for :math:`\text{freq}` to not contain actual frequencies
but instead weights distributed evenly among the synsets for a
word. These weights are calculated as the word frequency divided by
the number of synsets for the word:

.. math::

   \text{freq}_{\text{distributed}}(c)
   = \sum_{w \in \text{words}(c)}{\frac{\text{count}(w)}{|\text{synsets}(w)|}}

.. [RES95] Resnik, Philip. "Using information content to evaluate
   semantic similarity." In Proceedings of the 14th International
   Joint Conference on Artificial Intelligence (IJCAI-95), Montreal,
   Canada, pp. 448-453. 1995.


Example
-------

In the Princeton WordNet 3.0 (hereafter *WordNet*, but note that the
equivalent lexicon in Wn is the *OMW English Wordnet based on WordNet
3.0* with specifier ``omw-en:1.4``), the frequency of a concept like
**stone fruit** is not just the number of occurrences of *stone
fruit*, but also includes the counts of the words for its hyponyms
(*almond*, *olive*, etc.) and other taxonomic descendants (*Jordan
almond*, *green olive*, etc.). The word *almond* has two synsets: one
for the fruit or nut, another for the plant. Thus, if the word
*almond* is encountered :math:`n` times in a corpus, then the weight
(either the frequency :math:`n` or distributed weight
:math:`\frac{n}{2}`) is added to the total weights for both synsets
and to those of their ancestors, but not for descendant synsets, such
as for **Jordan almond**. The fruit/nut synset of almond has two
hypernym paths which converge on **fruit**:

1. **almond** ⊃ **stone fruit** ⊃ **fruit**
2. **almond** ⊃ **nut** ⊃ **seed** ⊃ **fruit**

The weight is added to each ancestor (**stone fruit**, **nut**,
**seed**, **fruit**, ...) once. That is, the weight is not added to
the convergent ancestor for **fruit** twice, but only once.


Calculating Information Content
-------------------------------

.. autofunction:: information_content
.. autofunction:: synset_probability


Computing Corpus Weights
------------------------

If pre-computed weights are not available for a wordnet or for some
domain, they can be computed given a corpus and a wordnet.

The corpus is an iterable of words. For large corpora it may help to
use a generator for this iterable, but the entire vocabulary (i.e.,
unique words and counts) will be held at once in memory. Multi-word
expressions are also possible if they exist in the wordnet. For
instance, WordNet has *stone fruit*, with a single space delimiting
the words, as an entry.

The :class:`wn.Wordnet` object must be instantiated with a single
lexicon, although it may have expand-lexicons for relation
traversal. For best results, the wordnet should use a lemmatizer to
help it deal with inflected wordforms from running text.

.. autofunction:: compute


Reading Pre-computed Information Content Files
----------------------------------------------

The :func:`load` function reads pre-computed information content
weights files as used by the `WordNet::Similarity
<http://wn-similarity.sourceforge.net/>`_ Perl module or the `NLTK
<http://www.nltk.org/>`_ Python package. These files are computed for
a specific version of a wordnet using the synset offsets from the
`WNDB <https://wordnet.princeton.edu/documentation/wndb5wn>`_ format,
which Wn does not use. These offsets therefore must be converted into
an identifier that matches those used by the wordnet. By default,
:func:`load` uses the lexicon identifier from its *wordnet* argument
with synset offsets (padded with 0s to make 8 digits) and
parts-of-speech from the weights file to format an identifier, such as
``omw-en-00001174-n``. For wordnets that use a different identifier
scheme, the *get_synset_id* parameter of :func:`load` can be given a
callable created with :func:`wn.util.synset_id_formatter`. It can also
be given another callable with the same signature as shown below:

.. code-block:: python

   get_synset_id(*, offset: int, pos: str) -> str


When loading pre-computed information content files, it is recommended
to use the ones with smoothing (i.e., ``*-add1.dat`` or
``*-resnik-add1.dat``) to avoid math domain errors when computing the
information content value.

.. warning::

   The weights files are only valid for the version of wordnet for
   which they were created. Files created for WordNet 3.0 do not work
   for WordNet 3.1 because the offsets used in its identifiers are
   different, although the *get_synset_id* parameter of :func:`load`
   could be given a function that performs a suitable mapping. Some
   `Open Multilingual Wordnet <https://github.com/omwn/omw-data>`_
   wordnets use the WordNet 3.0 offsets in their identifiers and can
   therefore technically use the weights, but this usage is
   discouraged because the distributional properties of text in
   another language and the structure of the other wordnet will not be
   compatible with that of the English WordNet. For these cases, it is
   recommended to compute new weights using :func:`compute`.

.. autofunction:: load