1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
|
wn.similarity
=============
.. automodule:: wn.similarity
Taxonomy-based Metrics
----------------------
The `Path <Path Similarity_>`_, `Leacock-Chodorow <Leacock-Chodorow
Similarity_>`_, and `Wu-Palmer <Wu-Palmer Similarity_>`_ similarity
metrics work by finding path distances in the hypernym/hyponym
taxonomy. As such, they are most useful when the synsets are, in fact,
arranged in a taxonomy. For the Princeton WordNet and derivative
wordnets, such as the `Open English Wordnet`_ and `OMW English Wordnet
based on WordNet 3.0`_ available to Wn, synsets for nouns and verbs
are arranged taxonomically: the nouns mostly form a single structure
with a single root while verbs form many smaller structures with many
roots. Synsets for the other parts of speech do not use
hypernym/hyponym relations at all. This situation may be different for
other wordnet projects or future versions of the English wordnets.
.. _Open English Wordnet: https://en-word.net
.. _OMW English Wordnet based on WordNet 3.0: https://github.com/omwn/omw-data
The similarity metrics tend to fail when the synsets are not connected
by some path. When the synsets are in different parts of speech, or
even in separate lexicons, this failure is acceptable and
expected. But for cases like the verbs in the Princeton WordNet, it
might be more useful to pretend that there is some unique root for all
verbs so as to create a path connecting any two of them. For this
purpose, the *simulate_root* parameter is available on the
:func:`path`, :func:`lch`, and :func:`wup` functions, where it is
passed on to calls to :meth:`wn.Synset.shortest_path` and
:meth:`wn.Synset.lowest_common_hypernyms`. Setting *simulate_root* to
:python:`True` can, however, give surprising results if the words are
from a different lexicon. Currently, computing similarity for synsets
from a different part of speech raises an error.
Path Similarity
'''''''''''''''
When :math:`p` is the length of the shortest path between two synsets,
the path similarity is:
.. math::
\frac{1}{p + 1}
The similarity score ranges between 0.0 and 1.0, where the higher the
score is, the more similar the synsets are. The score is 1.0 when a
synset is compared to itself, and 0.0 when there is no path between
the two synsets (i.e., the path distance is infinite).
.. autofunction:: path
.. _leacock-chodorow-similarity:
Leacock-Chodorow Similarity
'''''''''''''''''''''''''''
When :math:`p` is the length of the shortest path between two synsets
and :math:`d` is the maximum taxonomy depth, the Leacock-Chodorow
similarity is:
.. math::
-\text{log}\left(\frac{p + 1}{2d}\right)
.. autofunction:: lch
Wu-Palmer Similarity
''''''''''''''''''''
When *LCS* is the lowest common hypernym (also called "least common
subsumer") between two synsets, :math:`i` is the shortest path
distance from the first synset to *LCS*, :math:`j` is the shortest
path distance from the second synset to *LCS*, and :math:`k` is the
number of nodes (distance + 1) from *LCS* to the root node, then the
Wu-Palmer similarity is:
.. math::
\frac{2k}{i + j + 2k}
.. autofunction:: wup
Information Content-based Metrics
---------------------------------
The `Resnik <Resnik Similarity_>`_, `Jiang-Conrath <Jiang-Conrath
Similarity_>`_, and `Lin <Lin Similarity_>`_ similarity metrics work
by computing the information content of the synsets and/or that of
their lowest common hypernyms. They therefore require information
content weights (see :mod:`wn.ic`), and the values returned
necessarily depend on the weights used.
Resnik Similarity
'''''''''''''''''
The Resnik similarity (`Resnik 1995
<https://arxiv.org/pdf/cmp-lg/9511007.pdf>`_) is the maximum
information content value of the common subsumers (hypernym ancestors)
of the two synsets. Formally it is defined as follows, where
:math:`c_1` and :math:`c_2` are the two synsets being compared.
.. math::
\text{max}_{c \in \text{S}(c_1, c_2)} \text{IC}(c)
Since a synset's information content is always equal or greater than
the information content of its hypernyms, :math:`S(c_1, c_2)` above is
more efficiently computed using the lowest common hypernyms instead of
all common hypernyms.
.. autofunction:: res
Jiang-Conrath Similarity
''''''''''''''''''''''''
The Jiang-Conrath similarity metric (`Jiang and Conrath, 1997
<https://www.aclweb.org/anthology/O97-1002.pdf>`_) combines the ideas
of the taxonomy-based and information content-based metrics. It is
defined as follows, where :math:`c_1` and :math:`c_2` are the two
synsets being compared and :math:`c_0` is the lowest common hypernym
of the two with the highest information content weight:
.. math::
\frac{1}{\text{IC}(c_1) + \text{IC}(c_2) - 2(\text{IC}(c_0))}
This equation is the simplified form given in the paper were several
parameterized terms are cancelled out because the full form is not
often used in practice.
There are two special cases:
1. If the information content of :math:`c_0`, :math:`c_1`, and
:math:`c_2` are all zero, the metric returns zero. This occurs when
both :math:`c_1` and :math:`c_2` are the root node, but it can also
occur if the synsets did not occur in the corpus and the smoothing
value was set to zero.
2. Otherwise if :math:`c_1 + c_2 = 2c_0`, the metric returns
infinity. This occurs when the two synsets are the same, one is a
descendant of the other, etc., such that they have the same
frequency as each other and as their lowest common hypernym.
.. autofunction:: jcn
Lin Similarity
''''''''''''''
Another formulation of information content-based similarity is the Lin
metric (`Lin 1997 <https://www.aclweb.org/anthology/P97-1009.pdf>`_),
which is defined as follows, where :math:`c_1` and :math:`c_2` are the
two synsets being compared and :math:`c_0` is the lowest common
hypernym with the highest information content weight:
.. math::
\frac{2(\text{IC}(c_0))}{\text{IC}(c_1) + \text{IC}(c_0)}
One special case is if either synset has an information content value
of zero, in which case the metric returns zero.
.. autofunction:: lin
|