1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- This is the developer's manual for Aspell.
Copyright © 2002, 2003, 2004, 2006 Kevin Atkinson.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.1 or
any later version published by the Free Software Foundation; with no
Invariant Sections, no Front-Cover Texts and no Back-Cover Texts. A
copy of the license is included in the section entitled "GNU Free
Documentation License". -->
<!-- Created by GNU Texinfo 5.2, http://www.gnu.org/software/texinfo/ -->
<head>
<title>Aspell Developer’s Manual: Part 1 - Compiled Dictionary Format</title>
<meta name="description" content="Aspell spell checker developer’s manual.">
<meta name="keywords" content="Aspell Developer’s Manual: Part 1 - Compiled Dictionary Format">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link href="index.html#Top" rel="start" title="Top">
<link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="How-It-All-Works.html#How-It-All-Works" rel="up" title="How It All Works">
<link href="Part-2-_002d-Quickly-Finding-Similar-Soundslike.html#Part-2-_002d-Quickly-Finding-Similar-Soundslike" rel="next" title="Part 2 - Quickly Finding Similar Soundslike">
<link href="How-It-All-Works.html#How-It-All-Works" rel="prev" title="How It All Works">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.smallquotation {font-size: smaller}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.indentedblock {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
div.smalldisplay {margin-left: 3.2em}
div.smallexample {margin-left: 3.2em}
div.smallindentedblock {margin-left: 3.2em; font-size: smaller}
div.smalllisp {margin-left: 3.2em}
kbd {font-style:oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
pre.smalldisplay {font-family: inherit; font-size: smaller}
pre.smallexample {font-size: smaller}
pre.smallformat {font-family: inherit; font-size: smaller}
pre.smalllisp {font-size: smaller}
span.nocodebreak {white-space:nowrap}
span.nolinebreak {white-space:nowrap}
span.roman {font-family:serif; font-weight:normal}
span.sansserif {font-family:sans-serif; font-weight:normal}
ul.no-bullet {list-style: none}
table:not([class]), table:not([class]) th, table:not([class]) td {
padding: 2px 0.3em 2px 0.3em;
border: thin solid #D0D0D0;
border-collapse: collapse;
}
-->
</style>
<meta name=viewport content="width=device-width, initial-scale=1">
</head>
<body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
<a name="Part-1-_002d-Compiled-Dictionary-Format"></a>
<div class="header">
<p>
Next: <a href="Part-2-_002d-Quickly-Finding-Similar-Soundslike.html#Part-2-_002d-Quickly-Finding-Similar-Soundslike" accesskey="n" rel="next">Part 2 - Quickly Finding Similar Soundslike</a>, Previous: <a href="How-It-All-Works.html#How-It-All-Works" accesskey="p" rel="prev">How It All Works</a>, Up: <a href="How-It-All-Works.html#How-It-All-Works" accesskey="u" rel="up">How It All Works</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
<hr>
<a name="Part-1-_002d-The-Compiled-Dictionary-Format"></a>
<h3 class="section">15.1 Part 1 - The Compiled Dictionary Format</h3>
<p>In this part you will see how the data is laid out in the compiled
dictionary for Aspell 0.60. See source file <samp>readonly_ws.cpp</samp>.
</p>
<p>Aspell’s main compiled wordlist dictionary file is made as follows:
</p>
<ul>
<li> header
</li><li> jump table for editdist 1
</li><li> jump table for editdist 2
</li><li> data block
</li><li> hash table
</li></ul>
<p>There is nothing particularly interesting about the header. Just a
bunch of meta information.
</p>
<p>The jump tables are described in the next section …
</p>
<p>Words in the data block are grouped based on the soundslike. Each
group is as follows:
</p>
<div class="example">
<pre class="example"><8 bit: offset to next item><8 bit: soundslike size><soundslike>
<null><words>
</pre></div>
<p>Each word group is as follows:
</p>
<div class="example">
<pre class="example"><8 bit: flags><8 bit: offset to next word><8 bit: word size><word><null>
[<affix info><null>]
</pre></div>
<p>The offset for the final word in each group points to the next word in
the following group. The offset for the final word and soundslike
group in the dictionary is 0.
</p>
<p>There is some provisions for additional info to be stored with the word
but for simplicity, it’s left out here. If soundslike data is not used
then the soundslike block it not used.
</p>
<p>This format makes it easy to iterate over the data without using the
hash table.
</p>
<p>Each soundslike group can be a maximum of about 256 bytes. If this
limit is reached then the soundslike group is split. Using 2 bytes for
the soundslike offset would of solved this problem however 256 bytes
is normally sufficient, thus I would of wasted some space by using an
extra byte. More importantly, Using 2 bytes means I would of had to
worry about alignment issues.
</p>
<p>The soundslike groups are sorted in more or less alphabetic order.
</p>
<p>The hash table is a simple open address hash table. The key is the
dictionary word in all lowercase form with all accents removed (what is
known as the "clean" form of the word). The value stored in the table
is a 32-bit offset to the beginning of the word. A 32-bit integer
offset is used rather than a pointer so that the compiled dictionary
can be mmaped to make loading the dictionary very fast and so that the
memory can be shared between processed, and on 64 bit platforms using
pointers would have doubled the size of the hash table.
</p>
<p>Additional information for each word can be derived from each offset:
</p>
<div class="example">
<pre class="example">word size: offset - 1
offset to next word: offset - 2
flags: offset - 3
</pre></div>
<p>I use helper functions for getting this information. Doing it this way
rather than having a data structure is slightly evil, but admittedly,
I have my reasons.
</p>
<p>In the next section, you will see how Aspell uses the jump tables to
search the list for <em>soundslike</em> with <em>edit-distance</em> of 1 or 2.
</p>
<hr>
<div class="header">
<p>
Next: <a href="Part-2-_002d-Quickly-Finding-Similar-Soundslike.html#Part-2-_002d-Quickly-Finding-Similar-Soundslike" accesskey="n" rel="next">Part 2 - Quickly Finding Similar Soundslike</a>, Previous: <a href="How-It-All-Works.html#How-It-All-Works" accesskey="p" rel="prev">How It All Works</a>, Up: <a href="How-It-All-Works.html#How-It-All-Works" accesskey="u" rel="up">How It All Works</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
</body>
</html>
|