File: CDB_File.pm

package info (click to toggle)
libcdb-file-perl 1.05-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye, sid
  • size: 724 kB
  • sloc: perl: 257; makefile: 11
file content (441 lines) | stat: -rw-r--r-- 14,127 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
package CDB_File;

use strict;

use XSLoader ();
use Exporter ();

our @ISA       = qw(Exporter);
our $VERSION   = '1.05';
our @EXPORT_OK = qw(create);

=head1 NAME

CDB_File - Perl extension for access to cdb databases

=head1 SYNOPSIS

    use CDB_File;
    $c = tie(%h, 'CDB_File', 'file.cdb') or die "tie failed: $!\n";

    # If accessing a utf8 stored CDB_File
    $c = tie(%h, 'CDB_File', 'file.cdb', utf8 => 1) or die "tie failed: $!\n";

    $fh = $c->handle;
    sysseek $fh, $c->datapos, 0 or die ...;
    sysread $fh, $x, $c->datalen;
    undef $c;
    untie %h;

    $t = CDB_File->new('t.cdb', "t.$$") or die ...;
    $t->insert('key', 'value');
    $t->finish;

    CDB_File::create %t, $file, "$file.$$";

or

    use CDB_File 'create';
    create %t, $file, "$file.$$";

    # If you want to store the data in utf8 mode.
    create %t, $file, "$file.$$", utf8 => 1;
=head1 DESCRIPTION

B<CDB_File> is a module which provides a Perl interface to Dan
Bernstein's B<cdb> package:

    cdb is a fast, reliable, lightweight package for creating and
    reading constant databases.

=head2 Reading from a cdb

After the C<tie> shown above, accesses to C<%h> will refer
to the B<cdb> file C<file.cdb>, as described in L<perlfunc/tie>.

Low level access to the database is provided by the three methods
C<handle>, C<datapos>, and C<datalen>.  To use them, you must remember
the C<CDB_File> object returned by the C<tie> call: C<$c> in the
example above.  The C<datapos> and C<datalen> methods return the
file offset position and length respectively of the most recently
visited key (for example, via C<exists>).

Beware that if you create an extra reference to the C<CDB_File> object
(like C<$c> in the example above) you must destroy it (with C<undef>)
before calling C<untie> on the hash.  This ensures that the object's
C<DESTROY> method is called.  Note that C<perl -w> will check this for
you; see L<perltie> for further details.

=head2 Creating a cdb

A B<cdb> file is created in three steps.  First call C<new CDB_File
($final, $tmp)>, where C<$final> is the name of the database to be
created, and C<$tmp> is the name of a temporary file which can be
atomically renamed to C<$final>.  Secondly, call the C<insert> method
once for each (I<key>, I<value>) pair.  Finally, call the C<finish>
method to complete the creation and renaming of the B<cdb> file.

Alternatively, call the C<insert()> method with multiple key/value
pairs. This can be significantly faster because there is less crossing
over the bridge from perl to C code. One simple way to do this is to pass
in an entire hash, as in: C<< $cdbmaker->insert(%hash); >>.

A simpler interface to B<cdb> file creation is provided by
C<CDB_File::create %t, $final, $tmp>.  This creates a B<cdb> file named
C<$final> containing the contents of C<%t>.  As before,  C<$tmp> must
name a temporary file which can be atomically renamed to C<$final>.
C<CDB_File::create> may be imported.

=head2 UTF8 support.

When CDB_File was created in 1997 (prior even to Perl 5.6), Perl SVs
didn't really deal with UTF8. In order to properly store mixed
bytes and utf8 data in the file, we would normally need to store a bit
for each string which clarifies the encoding of the key / values.
This would be useful since Perl hash keys are downgraded to bytes when
possible so as to normalize the hash key access regardless of encoding.

The CDB_File format is used outside of Perl and so must maintain file
format compatibility with those systems. As a result this module provides
a utf8 mode which must be enabled at database generation and then later
at read. Keys will always be stored as UTF8 strings which is the opposite
of how Perl stores the strings. This approach had to be taken to assure no
data corruption happened due to accidentally downgraded SVs before they
are stored or on retrieval.

You can enable utf8 mode by passing C<utf8 =E<gt> 1> to B<new>, B<tie>,
or B<create>. All returned SVs while in this mode will be encoded in utf8.
This feature is not available below 5.14 due to lack of Perl macro support.

B<NOTE:> read/write of databases not stored in utf8 mode will often be
incompatible with any non-ascii data.

=head1 EXAMPLES

These are all complete programs.

1. Convert a Berkeley DB (B-tree) database to B<cdb> format.

    use CDB_File;
    use DB_File;

    tie %h, DB_File, $ARGV[0], O_RDONLY, undef, $DB_BTREE or
            die "$0: can't tie to $ARGV[0]: $!\n";

    CDB_File::create %h, $ARGV[1], "$ARGV[1].$$" or
            die "$0: can't create cdb: $!\n";

2. Convert a flat file to B<cdb> format.  In this example, the flat
file consists of one key per line, separated by a colon from the value.
Blank lines and lines beginning with B<#> are skipped.

    use CDB_File;

    $cdb = new CDB_File("data.cdb", "data.$$") or
            die "$0: new CDB_File failed: $!\n";
    while (<>) {
            next if /^$/ or /^#/;
            chop;
            ($k, $v) = split /:/, $_, 2;
            if (defined $v) {
                    $cdb->insert($k, $v);
            } else {
                    warn "bogus line: $_\n";
            }
    }
    $cdb->finish or die "$0: CDB_File finish failed: $!\n";

3. Perl version of B<cdbdump>.

    use CDB_File;

    tie %data, 'CDB_File', $ARGV[0] or
            die "$0: can't tie to $ARGV[0]: $!\n";
    while (($k, $v) = each %data) {
            print '+', length $k, ',', length $v, ":$k->$v\n";
    }
    print "\n";

4. For really enormous data values, you can use C<handle>, C<datapos>,
and C<datalen>, in combination with C<sysseek> and C<sysread>, to
avoid reading the values into memory.  Here is the script F<bun-x.pl>,
which can extract uncompressed files and directories from a B<bun>
file.

    use CDB_File;

    sub unnetstrings {
        my($netstrings) = @_;
        my @result;
        while ($netstrings =~ s/^([0-9]+)://) {
                push @result, substr($netstrings, 0, $1, '');
                $netstrings =~ s/^,//;
        }
        return @result;
    }

    my $chunk = 8192;

    sub extract {
        my($file, $t, $b) = @_;
        my $head = $$b{"H$file"};
        my ($code, $type) = $head =~ m/^([0-9]+)(.)/;
        if ($type eq "/") {
                mkdir $file, 0777;
        } elsif ($type eq "_") {
                my ($total, $now, $got, $x);
                open OUT, ">$file" or die "open for output: $!\n";
                exists $$b{"D$code"} or die "corrupt bun file\n";
                my $fh = $t->handle;
                sysseek $fh, $t->datapos, 0;
                $total = $t->datalen;
                while ($total) {
                        $now = ($total > $chunk) ? $chunk : $total;
                        $got = sysread $fh, $x, $now;
                        if (not $got) { die "read error\n"; }
                        $total -= $got;
                        print OUT $x;
                }
                close OUT;
        } else {
                print STDERR "warning: skipping unknown file type\n";
        }
    }

    die "usage\n" if @ARGV != 1;

    my (%b, $t);
    $t = tie %b, 'CDB_File', $ARGV[0] or die "tie: $!\n";
    map { extract $_, $t, \%b } unnetstrings $b{""};

5. Although a B<cdb> file is constant, you can simulate updating it
in Perl.  This is an expensive operation, as you have to create a
new database, and copy into it everything that's unchanged from the
old database.  (As compensation, the update does not affect database
readers.  The old database is available for them, till the moment the
new one is C<finish>ed.)

    use CDB_File;

    $file = 'data.cdb';
    $new = new CDB_File($file, "$file.$$") or
            die "$0: new CDB_File failed: $!\n";

    # Add the new values; remember which keys we've seen.
    while (<>) {
            chop;
            ($k, $v) = split;
            $new->insert($k, $v);
            $seen{$k} = 1;
    }

    # Add any old values that haven't been replaced.
    tie %old, 'CDB_File', $file or die "$0: can't tie to $file: $!\n";
    while (($k, $v) = each %old) {
            $new->insert($k, $v) unless $seen{$k};
    }

    $new->finish or die "$0: CDB_File finish failed: $!\n";

=head1 REPEATED KEYS

Most users can ignore this section.

A B<cdb> file can contain repeated keys.  If the C<insert> method is
called more than once with the same key during the creation of a B<cdb>
file, that key will be repeated.

Here's an example.

    $cdb = new CDB_File ("$file.cdb", "$file.$$") or die ...;
    $cdb->insert('cat', 'gato');
    $cdb->insert('cat', 'chat');
    $cdb->finish;

Normally, any attempt to access a key retrieves the first value
stored under that key.  This code snippet always prints B<gato>.

    $catref = tie %catalogue, CDB_File, "$file.cdb" or die ...;
    print "$catalogue{cat}";

However, all the usual ways of iterating over a hash---C<keys>,
C<values>, and C<each>---do the Right Thing, even in the presence of
repeated keys.  This code snippet prints B<cat cat gato chat>.

    print join(' ', keys %catalogue, values %catalogue);

And these two both print B<cat:gato cat:chat>, although the second is
more efficient.

    foreach $key (keys %catalogue) {
            print "$key:$catalogue{$key} ";
    }

    while (($key, $val) = each %catalogue) {
            print "$key:$val ";
    }

The C<multi_get> method retrieves all the values associated with a key.
It returns a reference to an array containing all the values.  This code
prints B<gato chat>.

    print "@{$catref->multi_get('cat')}";

C<multi_get> always returns an array reference.  If the key was not
found in the database, it will be a reference to an empty array.  To
test whether the key was found, you must test the array, and not the
reference.

    $x = $catref->multiget($key);
    warn "$key not found\n" unless $x; # WRONG; message never printed
    warn "$key not found\n" unless @$x; # Correct

The C<fetch_all> method returns a hashref of all keys with the first
value in the cdb.  This is useful for quickly loading a cdb file where
there is a 1:1 key mapping.  In practice it proved to be about 400%
faster then iterating a tied hash.

    # Slow
    my %copy = %tied_cdb;

    # Much Faster
    my $copy_hashref = $catref->fetch_all();

=head1 RETURN VALUES

The routines C<tie>, C<new>, and C<finish> return B<undef> if the
attempted operation failed; C<$!> contains the reason for failure.

=head1 DIAGNOSTICS

The following fatal errors may occur.  (See L<perlfunc/eval> if
you want to trap them.)

=over 4

=item Modification of a CDB_File attempted

You attempted to modify a hash tied to a B<CDB_File>.

=item CDB database too large

You attempted to create a B<cdb> file larger than 4 gigabytes.

=item [ Write to | Read of | Seek in ] CDB_File failed: <error string>

If B<error string> is B<Protocol error>, you tried to C<use CDB_File> to
access something that isn't a B<cdb> file.  Otherwise a serious OS level
problem occurred, for example, you have run out of disk space.

=back

=head1 PERFORMANCE

Sometimes you need to get the most performance possible out of a
library. Rumour has it that perl's tie() interface is slow. In order
to get around that you can use CDB_File in an object oriented
fashion, rather than via tie().

  my $cdb = CDB_File->TIEHASH('/path/to/cdbfile.cdb');

  if ($cdb->EXISTS('key')) {
      print "Key is: ", $cdb->FETCH('key'), "\n";
  }

For more information on the methods available on tied hashes see
L<perltie>.

=head1 THE ALGORITHM

This algorithm is described at L<http://cr.yp.to/cdb/cdb.txt> It is
small enough that it is included inline in the event that the
internet loses the page:

=head2 A structure for constant databases

Copyright (c) 1996 D. J. Bernstein, L<djb@pobox.com>

A cdb is an associative array: it maps strings ('keys'') to strings
('data'').

A cdb contains 256 pointers to linearly probed open hash tables. The
hash tables contain pointers to (key,data) pairs. A cdb is stored in
a single file on disk:

    +----------------+---------+-------+-------+-----+---------+
    | p0 p1 ... p255 | records | hash0 | hash1 | ... | hash255 |
    +----------------+---------+-------+-------+-----+---------+

Each of the 256 initial pointers states a position and a length. The
position is the starting byte position of the hash table. The length
is the number of slots in the hash table.

Records are stored sequentially, without special alignment. A record
states a key length, a data length, the key, and the data.

Each hash table slot states a hash value and a byte position. If the
byte position is 0, the slot is empty. Otherwise, the slot points to
a record whose key has that hash value.

Positions, lengths, and hash values are 32-bit quantities, stored in
little-endian form in 4 bytes. Thus a cdb must fit into 4 gigabytes.

A record is located as follows. Compute the hash value of the key in
the record. The hash value modulo 256 is the number of a hash table.
The hash value divided by 256, modulo the length of that table, is a
slot number. Probe that slot, the next higher slot, and so on, until
you find the record or run into an empty slot.

The cdb hash function is C<h = ((h << 5) + h) ^ c>, with a starting
hash of 5381.


=head1 BUGS

The C<create()> interface could be done with C<TIEHASH>.

=head1 SEE ALSO

cdb(3)

=head1 AUTHOR

Tim Goodwin, <tjg@star.le.ac.uk>.  B<CDB_File> began on 1997-01-08.

Work provided through 2008 by Matt Sergeant, <matt@sergeant.org>

Now maintained  by Todd Rinaldo, <toddr@cpan.org>

=cut

XSLoader::load( 'CDB_File', $VERSION );

sub CLEAR {
    require Carp;
    Carp::croak("Modification of a CDB_File attempted");
}

sub DELETE {
    goto &CLEAR;
}

sub STORE {
    goto &CLEAR;
}

# Must be preloaded for the prototype.

sub create(\%$$;$$) {
    my ( $RHdata, $fn, $fntemp, $option_key, $is_utf8 ) = @_;

    die("utf8 CDB_Files are not supported below Perl 5.14") if $option_key && $option_key eq 'utf8' && $is_utf8 && $] < "5.014";

    my $cdb = CDB_File->new( $fn, $fntemp, $option_key || '', $is_utf8 || 0 ) or return undef;
    {
        $cdb->insert(%$RHdata);
    }
    $cdb->finish;
    return 1;
}

1;