File: Part-1-_002d-Compiled-Dictionary-Format.html

package info (click to toggle)
aspell 0.60.8.2-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 15,336 kB
  • sloc: cpp: 24,378; sh: 12,340; perl: 1,924; ansic: 1,661; makefile: 852; sed: 16
file content (162 lines) | stat: -rw-r--r-- 7,314 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- This is the developer's manual for Aspell.

Copyright © 2002, 2003, 2004, 2006 Kevin Atkinson.

Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.1 or
any later version published by the Free Software Foundation; with no
Invariant Sections, no Front-Cover Texts and no Back-Cover Texts.  A
copy of the license is included in the section entitled "GNU Free
Documentation License". -->
<!-- Created by GNU Texinfo 5.2, http://www.gnu.org/software/texinfo/ -->
<head>
<title>Aspell Developer&rsquo;s Manual: Part 1 - Compiled Dictionary Format</title>

<meta name="description" content="Aspell spell checker developer&rsquo;s manual.">
<meta name="keywords" content="Aspell Developer&rsquo;s Manual: Part 1 - Compiled Dictionary Format">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link href="index.html#Top" rel="start" title="Top">
<link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="How-It-All-Works.html#How-It-All-Works" rel="up" title="How It All Works">
<link href="Part-2-_002d-Quickly-Finding-Similar-Soundslike.html#Part-2-_002d-Quickly-Finding-Similar-Soundslike" rel="next" title="Part 2 - Quickly Finding Similar Soundslike">
<link href="How-It-All-Works.html#How-It-All-Works" rel="prev" title="How It All Works">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.smallquotation {font-size: smaller}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.indentedblock {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
div.smalldisplay {margin-left: 3.2em}
div.smallexample {margin-left: 3.2em}
div.smallindentedblock {margin-left: 3.2em; font-size: smaller}
div.smalllisp {margin-left: 3.2em}
kbd {font-style:oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
pre.smalldisplay {font-family: inherit; font-size: smaller}
pre.smallexample {font-size: smaller}
pre.smallformat {font-family: inherit; font-size: smaller}
pre.smalllisp {font-size: smaller}
span.nocodebreak {white-space:nowrap}
span.nolinebreak {white-space:nowrap}
span.roman {font-family:serif; font-weight:normal}
span.sansserif {font-family:sans-serif; font-weight:normal}
ul.no-bullet {list-style: none}
table:not([class]), table:not([class]) th, table:not([class]) td {
    padding: 2px 0.3em 2px 0.3em;
    border: thin solid #D0D0D0;
    border-collapse: collapse;
}

-->
</style>

<meta name=viewport content="width=device-width, initial-scale=1">
</head>

<body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
<a name="Part-1-_002d-Compiled-Dictionary-Format"></a>
<div class="header">
<p>
Next: <a href="Part-2-_002d-Quickly-Finding-Similar-Soundslike.html#Part-2-_002d-Quickly-Finding-Similar-Soundslike" accesskey="n" rel="next">Part 2 - Quickly Finding Similar Soundslike</a>, Previous: <a href="How-It-All-Works.html#How-It-All-Works" accesskey="p" rel="prev">How It All Works</a>, Up: <a href="How-It-All-Works.html#How-It-All-Works" accesskey="u" rel="up">How It All Works</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
<hr>
<a name="Part-1-_002d-The-Compiled-Dictionary-Format"></a>
<h3 class="section">15.1 Part 1 - The Compiled Dictionary Format</h3>

<p>In this part you will see how the data is laid out in the compiled
dictionary for Aspell 0.60.  See source file <samp>readonly_ws.cpp</samp>.
</p>
<p>Aspell&rsquo;s main compiled wordlist dictionary file is made as follows:
</p>
<ul>
<li> header
</li><li> jump table for editdist 1
</li><li> jump table for editdist 2
</li><li> data block
</li><li> hash table
</li></ul>

<p>There is nothing particularly interesting about the header.  Just a
bunch of meta information.
</p>
<p>The jump tables are described in the next section &hellip;
</p>
<p>Words in the data block are grouped based on the soundslike.  Each
group is as follows:
</p>
<div class="example">
<pre class="example">&lt;8 bit: offset to next item&gt;&lt;8 bit: soundslike size&gt;&lt;soundslike&gt;
&lt;null&gt;&lt;words&gt;
</pre></div>

<p>Each word group is as follows: 
</p>
<div class="example">
<pre class="example">&lt;8 bit: flags&gt;&lt;8 bit: offset to next word&gt;&lt;8 bit: word size&gt;&lt;word&gt;&lt;null&gt;
[&lt;affix info&gt;&lt;null&gt;]
</pre></div>

<p>The offset for the final word in each group points to the next word in
the following group.  The offset for the final word and soundslike
group in the dictionary is 0.
</p>
<p>There is some provisions for additional info to be stored with the word
but for simplicity, it&rsquo;s left out here.  If soundslike data is not used
then the soundslike block it not used.
</p>
<p>This format makes it easy to iterate over the data without using the
hash table.
</p>
<p>Each soundslike group can be a maximum of about 256 bytes.  If this
limit is reached then the soundslike group is split. Using 2 bytes for
the soundslike offset would of solved this problem however 256 bytes
is normally sufficient, thus I would of wasted some space by using an
extra byte.  More importantly, Using 2 bytes means I would of had to
worry about alignment issues.
</p>
<p>The soundslike groups are sorted in more or less alphabetic order.
</p>
<p>The hash table is a simple open address hash table.  The key is the
dictionary word in all lowercase form with all accents removed (what is
known as the &quot;clean&quot; form of the word).  The value stored in the table
is a 32-bit offset to the beginning of the word.  A 32-bit integer
offset is used rather than a pointer so that the compiled dictionary
can be mmaped to make loading the dictionary very fast and so that the
memory can be shared between processed, and on 64 bit platforms using
pointers would have doubled the size of the hash table.
</p>
<p>Additional information for each word can be derived from each offset:
</p>
<div class="example">
<pre class="example">word size: offset - 1
offset to next word: offset - 2
flags: offset - 3
</pre></div>

<p>I use helper functions for getting this information.  Doing it this way
rather than having a data structure is slightly evil, but admittedly,
I have my reasons.
</p>
<p>In the next section, you will see how Aspell uses the jump tables to
search the list for <em>soundslike</em> with <em>edit-distance</em> of 1 or 2.
</p>
<hr>
<div class="header">
<p>
Next: <a href="Part-2-_002d-Quickly-Finding-Similar-Soundslike.html#Part-2-_002d-Quickly-Finding-Similar-Soundslike" accesskey="n" rel="next">Part 2 - Quickly Finding Similar Soundslike</a>, Previous: <a href="How-It-All-Works.html#How-It-All-Works" accesskey="p" rel="prev">How It All Works</a>, Up: <a href="How-It-All-Works.html#How-It-All-Works" accesskey="u" rel="up">How It All Works</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>



</body>
</html>