1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321
|
.\" dirfile-encoding.5. The dirfile-encoding man page.
.\"
.\" Copyright (C) 2008, 2009, 2010, 2012, 2013, 2014, 2015 D. V. Wiebe
.\"
.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.\"
.\" This file is part of the GetData project.
.\"
.\" Permission is granted to copy, distribute and/or modify this document
.\" under the terms of the GNU Free Documentation License, Version 1.2 or
.\" any later version published by the Free Software Foundation; with no
.\" Invariant Sections, with no Front-Cover Texts, and with no Back-Cover
.\" Texts. A copy of the license is included in the `COPYING.DOC' file
.\" as part of this distribution.
.\"
.TH dirfile-encoding 5 "15 October 2015" "Standards Version 9" "DATA FORMATS"
.SH NAME
dirfile-encoding \(em dirfile database encoding schemes
.SH DESCRIPTION
The
.I Dirfile Standards
indicate that
.B RAW
fields defined in the database are accompanied by binary files containing the
field data in the specified simple data type. In certain situations, it may be
advantageous to convert the binary files in the database into a more convenient
form. This is accomplished by
.I encoding
the binary file into the alternate form. A common use-case for encoding a
binary file is to compress it to save disk space. Only data is modified by an
encoding scheme. Database metadata is never encoded.
Support for encoding schemes is optional. An implementation need not support
any particular encoding scheme, or may only support certain operations with it,
but should expect to encounter unknown encoding schemes and fail gracefully in
such situations.
Additionally, how a particular encoding is implemented is not specified by the
Dirfile Standards, but, for purposes of interoperability, all dirfile
implementations are encouraged to support the encoding implementation used by
the GetData dirfile reference implementation, elaborated below.
An encoding scheme is local to the particular
.I format specification fragment
in which it is indicated. This allows a single dirfile to have binary files
which are stored using multiple encodings, by having them defined in multiple
fragments.
The rest of this manual page discusses specifics of the encoding framework
implemented in the GetData library, and does not constitute part of the
Dirfile Standards.
.SH THE GETDATA ENCODING FRAMEWORK
The GetData library provides an encoding framework which abstracts binary file
I/O, allowing for generic support for a wide variety of encoding schemes.
Functions which may make use of the encoding framework are:
.IP
.BR gd_add "(3), " gd_add_raw "(3), " gd_add_spec (3),
.BR gd_alter_encoding "(3), " gd_alter_endianness (3),
.BR gd_alter_frameoffset "(3), " gd_alter_entry (3),
.BR gd_alter_raw "(3), " gd_alter_spec "(3), " gd_flush (3),
.BR gd_getdata "(3), " gd_malter_spec "(3), " gd_move (3),
.BR gd_nframes "(3), " gd_putdata "(3), " gd_raw_close (3),
.BR gd_rename (3),
and
.BR gd_sync (3).
.P
Most of the encodings supported by GetData are implemented through external
libraries which handle the actual file I/O and data translation. All such
libraries are optional; a build of the library which omits an external library
will lack support for the associated encoding scheme. In this case, GetData
will still properly identify the encoding scheme, but attempts to use GetData
for file I/O via the encoding will fail with the
.B GD_E_UNSUPPORTED
error code.
GetData discovers the encoding scheme of a particular RAW field by noting the
filename extension of files associated with the field. Binary files which form
an unencoded dirfile have no file extension. The file extension used by the
other encodings are noted below. Encoding discovery proceeds by searching for
files with the known list of file extensions (in an unspecified order) and
stopping when the first successful match is made. Because of this, when the a
field has multiple data files with different, supported file extensions which
could legitimately be associated with it, the encoding scheme discovered by
GetData is not well defined.
In addition to raw (unencoded) data, GetData supports nine other encoding
schemes:
.I text
encoding,
.I bzip2
encoding,
.I flac
encoding,
.I gzip
encoding,
.I lzma
encoding,
.I sie
(sample-index encoding),
.I slim
encoding,
.I zzip
encoding, and
.I zzslim
encoding, all discussed below.
.PP
The text encoding and the sample-index encoding are implemented by GetData
natively and need no external library. As a result, they are always present in
the library.
.SS Out-of-place writes
Some of the encodings listed below only support writing via out-of-place writes;
that is, raw files are written in a temporary location and only moved into place
when closed. As a result, writing to these encodings requires making a copy of
the whole binary data file. A further side effect of this is that a third-party
trying to concurrently read a Dirfile which is being written to using one of
these encodings usually doesn't work.
Within GetData, reading from a field so encoded after writing to it will cause
writing to the temporary file to be finished and then the file moved into place
before the read occurs, which may take some time to do. Encodings which perform
out-of-place writes are:
.IR bzip2 ", " flac ", " gzip ", and " lzma .
.SS BZip2 Encoding
The BZip2 Encoding reads compressed raw binary files using the Burrows-Wheeler
block sorting text compression algorithm and Huffman coding, as implemented in
the bzip2 format. GetData's BZip2 Encoding scheme is implemented through the
.I bzip2 compression library
written by Julian Seward. All operations are supported by the BZip2 Encoding,
but writing occurs out-of-place. See the
.B Out-of-place writes
section above for details.
GetData caches an uncompressed megabyte of data at a time to speed access times.
A call to
.BR gd_nframes (3)
requires decompression of the entire binary file to determine its uncompressed
size, and may take some time to complete.
The file extension of the BZip2 Encoding is
.BR .bz2 .
.SS FLAC Encoding
The FLAC Encoding compresses raw binary files using the Free Lossless Audio
Codec. GetData's FLAC Encoding scheme is implemented through the libFLAC
reference implementation developed by Josh Coalson and the Xiph.Org Foundation.
All operations are supported by the FLAC Encoding, but writing occurs
out-of-place. See the
.B Out-of-place writes
section above for details.
The FLAC format only permits samples up to 32-bits, but the libFLAC reference
codec can only handle samples up to 24-bits. GetData gets around this by
slicing data that is wider than 16-bits into multiple channels (2, 4, or 8,
depending on width). For big-ended data, the most-significant 16-bits are
in channel 0, the second 16-bits in channel 1, &c. For little-ended data, this
is reversed, with the least significant word in channel 0.
The sample rate specified in the FLAC header is ignored and may be any valid
value. FLAC files written by GetData use a sample rate of 1 Hz. The file
extension of the FLAC Encoding is
.BR .flac .
The Ogg container format is not supported.
.SS GZip Encoding
The GZip Encoding compresses raw binary files using Lempel-Ziv coding (LZ77) as
implemented in the gzip format. GetData's GZip Encoding scheme is implemented
through the
.I zlib compression library
written by Jean-loup Gailly and Mark Adler. All operations are supported by the
GZip Encoding, but writing occurs out-of-place. See the
.B Out-of-place writes
section above for details.
To speed the operation of
.BR gd_nframes (3),
the GZip Encoding takes the uncompressed size of the file the gzip footer, which
contains the file's uncompressed size in bytes, modulo 2**32. As a result,
using a field with an (uncompressed) binary file size larger than 4\~GiB as the
reference field will result in the wrong number of frames being reported.
The file extension of the GZip Encoding is
.BR .gz .
.SS LZMA Encoding
The LZMA Encoding reads compressed raw binary files using the Lempel-Ziv Markov
Chain Algorithm (LZMA) as implemented in the xz container format. GetData's
LZMA Encoding scheme is implemented through the
.IR "lzma library" ,
part of the
.I XZ Utils
suite written by Lasse Collin, Ville Koskinen, and Igor Pavlov. All operations
are supported by the LZMA Encoding, but writing occurs out-of-place. See the
.B Out-of-place writes
section above for details. Writing is supported only for the .xz container
format, and not for the obsolete .lzma format, which can still be read.
GetData caches an uncompressed megabyte of data at a time to speed access times.
A call to
.BR gd_nframes (3)
requires decompression of the entire binary file to determine its uncompressed
size, and may take some time to complete. The file extension of the LZMA
Encoding is
.BR .xz ,
or
.BR .lzma .
.SS Sample-Index Encoding
The Sample-Index Encoding (SIE) compresses raw binary data by replacing runs of
repeated data, similar to run-length encoding. SIE files contain binary records
consisting of a 64-bit sample number followed by a datum (the size and format of
which is determined by the RAW field's data type in the format metadata). The
sample number indicates the last sample of the field which has the specified
value. The first sample with the value is the sample immediately following the
data in the previous record, or sample number zero, for the first record.
Sample numbers are relative to any
.B /FRAMEOFFSET
specified in the Dirfile metadata. All operations are supported by the
Sample-Index Encoding. The file extension of the Sample-Index Encoding is
.BR .sie .
.SS Slim Encoding
The Slim Encoding reads compressed raw binary files using the slimlib
compression library written by Joseph Fowler. The slimlib library was developed
at Princeton University to compress dirfile-like data. GetData's Slim Encoding
framework currently lacks write capabilities; as a result, the Slim Encoding
does not support function which modify binary files. The file extension of the
Slim Encoding is
.BR .slm .
Using the Slim Encoding with GetData may result in unexpected, but manageable,
memory usage. See the
.BR gd_getdata (3)
manual page for details.
.SS Text Encoding
The Text Encoding replaces the binary data files with 7-bit ASCII files
containing a decimal text encoding of the data, one sample per line. All
operations are supported by the Text Encoding. The file extension of the
Text Encoding is
.BR .txt .
.SS ZZip Encoding
The ZZip Encoding reads compressed raw binary files using the DEFLATE algorithm
as implemented in the PKWARE ZIP archive container format. GetData's ZZip
Encoding scheme is implemented through the
.I zzip library
written by Tomi Ollila and Guido Draheim. The ZZip Encoding framework
currently lacks write capabilities; as a result the ZZip Encoding does not
support functions which modify binary data.
.PP
Unlike most encoding schemes, the ZZip encoding merges all binary data files
defined in a given fragment into a single ZIP archive. The name of this
archive is
.I raw.zip
by default, but a different name may be specified using the second parameter to
the
.B /ENCODING
directive. For example,
.IP
.B /ENCODING zzip
archive
.PP
indicates that the ZIP archive is called
.IR archive.zip .
The file extension of the ZZip Encoding is
.BR .zip .
.SS ZZSlim Encoding
The ZZSlim Encoding is a convolution of the Slim Encoding and the ZZip Encoding.
To create ZZSlim Encoded files, first the raw data are compressed using the
slim library, and then these slim-compressed files are archived (and compressed
again) into a ZIP archive. As with the ZZip Encoding, the ZIP archive is
.I raw.zip
by default, but a different name may be specified with the
.B /ENCODING
directive.
.PP
Notably, since the archives have the same name as ZZip Encoded data, automatic
encoding detection on ZZSlim Encoded data always fails: they are incorrectly
identified as simply ZZip Encoded. As a result, an
.B /ENCODING
directive in the format file or else a
.B GD_ZZSLIM_ENCODED
flag passed to
.BR gd_open (3)
is required to read ZZSlim encoded data. The file extension of the ZZSlim
Encoding is
.BR .zip .
Using the ZZSlim Encoding with GetData may result in unexpected, but manageable,
memory usage. See the
.BR gd_getdata (3)
manual page for details.
.SH AUTHOR
This manual page was written by D. V. Wiebe
.nh
<dvw@ketiltrout.net>.
.hy 1
.SH SEE ALSO
.BR bzip2 (1),
.BR flac (1),
.BR gzip (1),
.BR xz (1),
.BR zlib (3),
.BR dirfile (5),
.BR dirfile\-format (5)
|