1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
|
Parse Ranking and Word Sense Statistics
---------------------------------------
This directory contains SQL tables that are used in computing a parse
ranking, as well as a word-sense probability (based on WordNet 3.0) by
looking up frequency statistics from an SQL database. The database used
is the SQLite database; it has been choosen because it is "administration
-free" for the user, and because its license is compatbile with the
current link-grammar license.
The data tables needed to operate this stuff can be downloaded from the
web. See the main README file for download details.
Disjuncts Table
---------------
The disjuncts.db database contains two tables. The first records the
probability that a given disjunct will be used for some given word.
This probability was measured by parsing a large quantity of text,
and simply counting disjunct frequencies. This probability can be
used to rank parses, or to discriminate between alternate parses
for a sentence.
CREATE TABLE Disjuncts (
inflected_word TEXT NOT NULL,
disjunct TEXT NOT NULL,
log_cond_probability FLOAT
);
CREATE INDEX ifwdj ON Disjuncts (inflected_word, disjunct);
The log_cond_probability field contains the value of -log_2 p(d|w)
where p(d|w) is the conditional probability of seeing the disjunct d
given that the (inflected) word w was already seen.
Word Senses Table
-----------------
The DisjunctSenses table associates word senses to (word,disjunct)
pairs. The core idea behind this table is that certain word senses
are used only in certain ways in sentence constructions, and that
the Link Grammar disjuncts are fine-grained enough to detect such
differences, if they exist. The key idea is "if they exist" -- in
most cases, grammar is insufficient to discriminate between word
senses in a sentence -- but in some cases, it is. The goal here
is to try to provide this info, as well as possible.
CREATE TABLE DisjunctSenses (
word_sense TEXT NOT NULL,
inflected_word TEXT NOT NULL,
disjunct TEXT NOT NULL,
log_cond_probability FLOAT
);
CREATE INDEX siwdj ON DisjunctSenses (inflected_word, disjunct);
The log_cond_probability field records -log_2 p(s|w,d) where s==sense,
w==word, d==disjunct, so that p(s|w,d) is the probability of seeing the
sense s, given the word w and the disjunct d. This probability was
obtained by parsing a large quantity of text, and then applying the
Radu Mihalcea word-sense disambiguation algorithm to it.
Cluster Table
-------------
When LG is unable to parse a sentence, one possible strategy is to
increase the number of disjuncts that words can participate in, and
hoping that the broadend collection of disjuncts will allow the parse
to proceed. To obtain the broadened coverage, one can look up the
word to see if it participates in a cluster of syntactically similar
words, and then use a union of all of the disjuncts of words in the
cluster. If this allows the parse to proceed, then the system will
have "automatically" discovered a usable parse that is not otherwise
covered by the existing dictionaries.
CREATE TABLE ClusterMembers (
cluster_name TEXT NOT NULL,
inflected_word TEXT NOT NULL
);
CREATE INDEX iiw ON ClusterMembers (inflected_word);
CREATE TABLE ClusterDisjuncts (
cluster_name TEXT NOT NULL,
disjunct TEXT NOT NULL,
cost FLOAT
);
CREATE INDEX icn ON ClusterDisjuncts (cluster_name);
To create these tables:
1) Perform clustering, a la Siva Reddy scripts.
2) Create sqlite3 database "clusters.db" in the data/sql directory
3) Create above tables and indexes
4) Run data/tools/cluster-pop.pl to populate the first table.
5) Run link-grammar/corpus/cluster-pop to populate the second table.
A recent version has this table at 172MBytes.
To use these tables:
1) Copy them into this directory
2) Enable the parser to the above-desribed algo, by saing !cluster at
the parser prompt.
Report: At this time, the results of using the above algo are
underwhelming. This is somewhat surprising, as the above aglo is the
"defacto" algo used manually, when debugging the dictionary: one tries
"similar words" until one finds a working parse, and then modifies the
dictionary appropriately. The reason for the poor experience so far may
be that the clusters are too small, and not sufficiently encompassing.
Note also that many failed parses are not due to the mis-categorization
of a word that's already in the dictionary, but rather, because the
dictionary fails to account for some particular linguistic phenomenon.
Thus, even in the best case, the above algo will fail to fix all parse
failures.
Notes:
------
Go to http://opencog.org/ and download the source code. Then review the
contents of opencog/nlp/wsd-post/README. That README explains the data
processing pipleine used to create the word-senses table, and the
disjunct-frequency table. To summarize, in breif:
Compute the marginal and conditional probabilities for the Disjuncts
table by running the
opencog/nlp/wsd-post/compute-mi.scm script.
Merge sysnsets by running
opencog/nlp/wsd-post/synset-renorm.pl
Recompute the DisjunstSenses conditional probs by:
opencog/nlp/wsd-post/dj-probs.pl
To populate the disjunct table: first, dump the disjuncts:
pg_dump -D -O -t disjuncts lexat
To populate the DisjunctSenses table:
Then remove the count column, and the bogus entries, via
select; run the cleanup scripts.
pg_dump -D -O -t djsxxxtmp lexat
============
|