1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396
|
DBZ(3Z) DBZ(3Z)
NNAAMMEE
dbminit, fetch, store, dbmclose - somewhat dbm-compatible
database routines
dbzfresh, dbzagain, dbzfetch, dbzstore - database routines
dbzsync, dbzsize, dbzincore, dbzcancel, dbzdebug -
database routines
SSYYNNOOPPSSIISS
##iinncclluuddee <<ddbbzz..hh>>
ddbbmmiinniitt((bbaassee))
cchhaarr **bbaassee;;
ddaattuumm
ffeettcchh((kkeeyy))
ddaattuumm kkeeyy;;
ssttoorree((kkeeyy,, vvaalluuee))
ddaattuumm kkeeyy;;
ddaattuumm vvaalluuee;;
ddbbmmcclloossee(())
ddbbzzffrreesshh((bbaassee,, ssiizzee,, ffiieellddsseepp,, ccmmaapp,, ttaaggmmaasskk))
cchhaarr **bbaassee;;
lloonngg ssiizzee;;
iinntt ffiieellddsseepp;;
iinntt ccmmaapp;;
lloonngg ttaaggmmaasskk;;
ddbbzzaaggaaiinn((bbaassee,, oollddbbaassee))
cchhaarr **bbaassee;;
cchhaarr **oollddbbaassee;;
ddaattuumm
ddbbzzffeettcchh((kkeeyy))
ddaattuumm kkeeyy;;
ddbbzzssttoorree((kkeeyy,, vvaalluuee))
ddaattuumm kkeeyy;;
ddaattuumm vvaalluuee;;
ddbbzzssyynncc(())
lloonngg
ddbbzzssiizzee((nneennttrriieess))
lloonngg nneennttrriieess;;
ddbbzziinnccoorree((nneewwvvaalluuee))
ddbbzzccaanncceell(())
ddbbzzddeebbuugg((nneewwvvaalluuee))
3 Feb 1991 1
DBZ(3Z) DBZ(3Z)
DDEESSCCRRIIPPTTIIOONN
These functions provide an indexing system for rapid ran-
dom access to a text file (the _b_a_s_e _f_i_l_e). Subject to
certain constraints, they are call-compatible with _d_b_m(3),
although they also provide some extensions. (Note that
they are _n_o_t file-compatible with _d_b_m or any variant
thereof.)
In principle, _d_b_z stores key-value pairs, where both key
and value are arbitrary sequences of bytes, specified to
the functions by values of type _d_a_t_u_m, typedefed in the
header file to be a structure with members _d_p_t_r (a value
of type _c_h_a_r _* pointing to the bytes) and _d_s_i_z_e (a value
of type _i_n_t indicating how long the byte sequence is).
In practice, _d_b_z is more restricted than _d_b_m. A _d_b_z
database must be an index into a base file, with the
database _v_a_l_u_es being _f_s_e_e_k(3) offsets into the base file.
Each such _v_a_l_u_e must ``point to'' a place in the base file
where the corresponding _k_e_y sequence is found. A key can
be no longer than DBZMAXKEY (a constant defined in the
header file) bytes. No key can be an initial subsequence
of another, which in most applications requires that keys
be either bracketed or terminated in some way (see the
discussion of the _f_i_e_l_d_s_e_p parameter of _d_b_z_f_r_e_s_h, below,
for a fine point on terminators).
_D_b_m_i_n_i_t opens a database, an index into the base file
_b_a_s_e, consisting of files _b_a_s_e..ddiirr and _b_a_s_e..ppaagg which must
already exist. (If the database is new, they should be
zero-length files.) Subsequent accesses go to that
database until _d_b_m_c_l_o_s_e is called to close the database.
The base file need not exist at the time of the _d_b_m_i_n_i_t,
but it must exist before accesses are attempted.
_F_e_t_c_h searches the database for the specified _k_e_y, return-
ing the corresponding _v_a_l_u_e if any. _S_t_o_r_e stores the _k_e_y-
_v_a_l_u_e pair in the database. _S_t_o_r_e will fail unless the
database files are writeable. See below for a complica-
tion arising from case mapping.
_D_b_z_f_r_e_s_h is a variant of _d_b_m_i_n_i_t for creating a new
database with more control over details. Unlike for
_d_b_m_i_n_i_t, the database files need not exist: they will be
created if necessary, and truncated in any case.
_D_b_z_f_r_e_s_h's _s_i_z_e parameter specifies the size of the first
hash table within the database, in key-value pairs. Per-
formance will be best if _s_i_z_e is a prime number and the
number of key-value pairs stored in the database does not
exceed about 2/3 of _s_i_z_e. (The _d_b_z_s_i_z_e function, given
the expected number of key-value pairs, will suggest a
database size that meets these criteria.) Assuming that
an _f_s_e_e_k offset is 4 bytes, the ..ppaagg file will be 4*_s_i_z_e
3 Feb 1991 2
DBZ(3Z) DBZ(3Z)
bytes (the ..ddiirr file is tiny and roughly constant in size)
until the number of key-value pairs exceeds about 80% of
_s_i_z_e. (Nothing awful will happen if the database grows
beyond 100% of _s_i_z_e, but accesses will slow down somewhat
and the ..ppaagg file will grow somewhat.)
_D_b_z_f_r_e_s_h's _f_i_e_l_d_s_e_p parameter specifies the field separa-
tor in the base file. If this is not NUL (0), and the
last character of a _k_e_y argument is NUL, that NUL compares
equal to either a NUL or a _f_i_e_l_d_s_e_p in the base file.
This permits use of NUL to terminate key strings without
requiring that NULs appear in the base file. The _f_i_e_l_d_s_e_p
of a database created with _d_b_m_i_n_i_t is the horizontal-tab
character.
For use in news systems, various forms of case mapping
(e.g. uppercase to lowercase) in keys are available. The
_c_m_a_p parameter to _d_b_z_f_r_e_s_h is a single character specify-
ing which of several mapping algorithms to use. Available
algorithms are:
00 case-sensitive: no case mapping
BB same as 00
NNUULL same as 00
== case-insensitive: uppercase and lowercase
equivalent
bb same as ==
CC RFC822 message-ID rules, case-sensitive
before `@' (with certain exceptions) and
case-insensitive after
?? whatever the local default is, normally CC
Mapping algorithm 00 (no mapping) is faster than the others
and is overwhelmingly the correct choice for most applica-
tions. Unless compatibility constraints interfere, it is
more efficient to pre-map the keys, storing mapped keys in
the base file, than to have _d_b_z do the mapping on every
search.
For historical reasons, _f_e_t_c_h and _s_t_o_r_e expect their _k_e_y
arguments to be pre-mapped, but expect unmapped keys in
the base file. _D_b_z_f_e_t_c_h and _d_b_z_s_t_o_r_e do the same jobs but
handle all case mapping internally, so the customer need
not worry about it.
_D_b_z stores only the database _v_a_l_u_es in its files, relying
on reference to the base file to confirm a hit on a key.
References to the base file can be minimized, greatly
3 Feb 1991 3
DBZ(3Z) DBZ(3Z)
speeding up searches, if a little bit of information about
the keys can be stored in the _d_b_z files. This is ``free''
if there are some unused bits in an _f_s_e_e_k offset, so that
the offset can be _t_a_g_g_e_d with some information about the
key. The _t_a_g_m_a_s_k parameter of _d_b_z_f_r_e_s_h allows specifying
the location of unused bits. _T_a_g_m_a_s_k should be a mask
with one group of contiguous 11 bits. The bits in the mask
should be unused (0) in _m_o_s_t offsets. The bit immediately
above the mask (the _f_l_a_g bit) should be unused (0) in _a_l_l
offsets; _(_d_b_z_)_s_t_o_r_e will reject attempts to store a key-
value pair in which the _v_a_l_u_e has the flag bit on. Apart
from this restriction, tagging is invisible to the user.
As a special case, a _t_a_g_m_a_s_k of 1 means ``no tagging'',
for use with enormous base files or on systems with
unusual offset representations.
A _s_i_z_e of 0 given to _d_b_z_f_r_e_s_h is synonymous with the local
default; the normal default is suitable for tables of
90-100,000 key-value pairs. A _c_m_a_p of 0 (NUL) is synony-
mous with the character 00, signifying no case mapping
(note that the character ?? specifies the local default
mapping, normally CC). A _t_a_g_m_a_s_k of 0 is synonymous with
the local default tag mask, normally 0x7f000000 (specify-
ing the top bit in a 32-bit offset as the flag bit, and
the next 7 bits as the mask, which is suitable for base
files up to circa 24MB). Calling _d_b_m_i_n_i_t_(_n_a_m_e_) with the
database files empty is equivalent to calling
_d_b_z_f_r_e_s_h_(_n_a_m_e_,_0_,_'_\_t_'_,_'_?_'_,_0_).
When databases are regenerated periodically, as in news,
it is simplest to pick the parameters for a new database
based on the old one. This also permits some memory of
past sizes of the old database, so that a new database
size can be chosen to cover expected fluctuations. _D_b_z_a_-
_g_a_i_n is a variant of _d_b_m_i_n_i_t for creating a new database
as a new generation of an old database. The database
files for _o_l_d_b_a_s_e must exist. _D_b_z_a_g_a_i_n is equivalent to
calling _d_b_z_f_r_e_s_h with the same field separator, case map-
ping, and tag mask as the old database, and a _s_i_z_e equal
to the result of applying _d_b_z_s_i_z_e to the largest number of
entries in the _o_l_d_b_a_s_e database and its previous 10 gener-
ations.
When many accesses are being done by the same program, _d_b_z
is massively faster if its first hash table is in memory.
If an internal flag is 1, an attempt is made to read the
table in when the database is opened, and _d_b_m_c_l_o_s_e writes
it out to disk again (if it was read successfully and has
been modified). _D_b_z_i_n_c_o_r_e sets the flag to _n_e_w_v_a_l_u_e
(which should be 0 or 1) and returns the previous value;
this does not affect the status of a database that has
already been opened. The default is 0. The attempt to
read the table in may fail due to memory shortage; in this
case _d_b_z quietly falls back on its default behavior.
3 Feb 1991 4
DBZ(3Z) DBZ(3Z)
_S_t_o_r_es to an in-memory database are not (in general) writ-
ten out to the file until _d_b_m_c_l_o_s_e or _d_b_z_s_y_n_c, so if
robustness in the presence of crashes or concurrent
accesses is crucial, in-memory databases should probably
be avoided.
_D_b_z_s_y_n_c causes all buffers etc. to be flushed out to the
files. It is typically used as a precaution against
crashes or concurrent accesses when a _d_b_z-using process
will be running for a long time. It is a somewhat expen-
sive operation, especially for an in-memory database.
_D_b_z_c_a_n_c_e_l cancels any pending writes from buffers. This
is typically useful only for in-core databases, since
writes are otherwise done immediately. Its main purpose
is to let a child process, in the wake of a _f_o_r_k, do a
_d_b_m_c_l_o_s_e without writing its parent's data to disk.
If _d_b_z has been compiled with debugging facilities avail-
able (which makes it bigger and a bit slower), _d_b_z_d_e_b_u_g
alters the value (and returns the previous value) of an
internal flag which (when 1; default is 0) causes verbose
and cryptic debugging output on standard output.
Concurrent reading of databases is fairly safe, but there
is no (inter)locking, so concurrent updating is not.
The database files include a record of the byte order of
the processor creating the database, and accesses by pro-
cessors with different byte order will work, although they
will be slightly slower. Byte order is preserved by _d_b_z_a_-
_g_a_i_n. However, agreement on the size and internal struc-
ture of an _f_s_e_e_k offset is necessary, as is consensus on
the character set.
An open database occupies three _s_t_d_i_o streams and their
corresponding file descriptors; a fourth is needed for an
in-memory database. Memory consumption is negligible
(except for _s_t_d_i_o buffers) except for in-memory databases.
SSEEEE AALLSSOO
dbz(1), dbm(3)
DDIIAAGGNNOOSSTTIICCSS
Functions returning _i_n_t values return 0 for success, -1
for failure. Functions returning _d_a_t_u_m values return a
value with _d_p_t_r set to NULL for failure. _D_b_m_i_n_i_t attempts
to have _e_r_r_n_o set plausibly on return, but otherwise this
is not guaranteed. An _e_r_r_n_o of EEDDOOMM from _d_b_m_i_n_i_t indi-
cates that the database did not appear to be in _d_b_z for-
mat.
HHIISSTTOORRYY
The original _d_b_z was written by Jon Zeeff (zeeff@b-
3 Feb 1991 5
DBZ(3Z) DBZ(3Z)
tech.ann-arbor.mi.us). Later contributions by David But-
ler and Mark Moraes. Extensive reworking, including this
documentation, by Henry Spencer (henry@zoo.toronto.edu) as
part of the C News project. Hashing function by Peter
Honeyman.
BBUUGGSS
The _d_p_t_r members of returned _d_a_t_u_m values point to static
storage which is overwritten by later calls.
Unlike _d_b_m, _d_b_z will misbehave if an existing key-value
pair is `overwritten' by a new _(_d_b_z_)_s_t_o_r_e with the same
key. The user is responsible for avoiding this by using
_(_d_b_z_)_f_e_t_c_h first to check for duplicates; an internal
optimization remembers the result of the first search so
there is minimal overhead in this.
Waiting until after _d_b_m_i_n_i_t to bring the base file into
existence will fail if _c_h_d_i_r(2) has been used meanwhile.
The RFC822 case mapper implements only a first approxima-
tion to the hideously-complex RFC822 case rules.
The prime finder in _d_b_z_s_i_z_e is not particularly quick.
Should implement the _d_b_m functions _d_e_l_e_t_e, _f_i_r_s_t_k_e_y, and
_n_e_x_t_k_e_y.
On C implementations which trap integer overflow, _d_b_z will
refuse to _(_d_b_z_)_s_t_o_r_e an _f_s_e_e_k offset equal to the greatest
representable positive number, as this would cause over-
flow in the biased representation used.
_D_b_z_a_g_a_i_n perhaps ought to notice when many offsets in the
old database were too big for tagging, and shrink the tag
mask to match.
Marking _d_b_z's file descriptors close-on-_e_x_e_c would be a
better approach to the problem _d_b_z_c_a_n_c_e_l tries to address,
but that's harder to do portably.
3 Feb 1991 6
|