1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594
|
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<title>Filters</title>
<!-- #BeginLibraryItem "/ed_libs/styles_UG.lbi" -->
<!--
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Copyright by The HDF Group. *
* Copyright by the Board of Trustees of the University of Illinois. *
* All rights reserved. *
* *
* This file is part of HDF5. The full HDF5 copyright notice, including *
* terms governing use, modification, and redistribution, is contained in *
* the files COPYING and Copyright.html. COPYING can be found at the root *
* of the source code distribution tree; Copyright.html can be found at the *
* root level of an installed copy of the electronic HDF5 document set and *
* is linked from the top-level documents page. It can also be found at *
* http://hdfgroup.org/HDF5/doc/Copyright.html. If you do not have *
* access to either file, you may request a copy from help@hdfgroup.org. *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
-->
<!-- #EndLibraryItem --></head>
<body bgcolor="#FFFFFF">
<!-- #BeginLibraryItem "/ed_libs/NavBar_UG.lbi" --><hr>
<center>
<table border=0 width=98%>
<tr><td valign=top align=left>
<a href="../index.html">HDF5 documents and links</a> <br>
<a href="../H5.intro.html">Introduction to HDF5</a> <br>
<a href="../RM/RM_H5Front.html">HDF5 Reference Manual</a> <br>
<a href="../UG/index.html">HDF5 User's Guide for Release 1.8</a> <br>
<!--
<a href="Glossary.html">Glossary</a><br>
-->
</td>
<td valign=top align=right>
And in this document, the
<a href="../H5.user.html"><strong>HDF5 User's Guide from Release 1.4.5:</strong></a>
<br>
<a href="Files.html">Files</a>
<a href="Datasets.html">Datasets</a>
<a href="Datatypes.html">Datatypes</a>
<a href="Dataspaces.html">Dataspaces</a>
<a href="Groups.html">Groups</a>
<br>
<a href="References.html">References</a>
<a href="Attributes.html">Attributes</a>
<a href="Properties.html">Property Lists</a>
<a href="Errors.html">Error Handling</a>
<br>
<a href="Filters.html">Filters</a>
<a href="Caching.html">Caching</a>
<a href="Chunking.html">Chunking</a>
<a href="MountingFiles.html">Mounting Files</a>
<br>
<a href="Performance.html">Performance</a>
<a href="Debugging.html">Debugging</a>
<a href="Environment.html">Environment</a>
<a href="../ddl.html">DDL</a>
</td></tr>
</table>
</center>
<hr>
<!-- #EndLibraryItem --><h1>Filters in HDF5</h1>
<b>Note: Transient pipelines described in this document have not
been implemented.</b>
<h2>1. Introduction</h2>
<p>HDF5 allows chunked data<sup><a href="#fn1">1</a></sup>
to pass through user-defined filters
on the way to or from disk. The filters operate on chunks of an
<code>H5D_CHUNKED</code> dataset can be arranged in a pipeline
so output of one filter becomes the input of the next filter.
<p>Each filter has a two-byte identification number (type
<code>H5Z_filter_t</code>) allocated by NCSA and can also be
passed application-defined integer resources to control its
behavior. Each filter also has an optional ASCII comment
string.
<p>
<center>
<table align=center width="80%">
<caption alignment=top>
<b>Values for <code>H5Z_filter_t</code></b>
</caption>
<tr>
<th width="30%">Value</th>
<th width="70%">Description</th>
</tr>
<tr valign=top>
<td><code>0-255</code></td>
<td>These values are reserved for filters predefined and
registered by the HDF5 library and of use to the general
public. They are described in a separate section
below.</td>
</tr>
<tr valign=top>
<td><code>256-511</code></td>
<td>Filter numbers in this range are used for testing only
and can be used temporarily by any organization. No
attempt is made to resolve numbering conflicts since all
definitions are by nature temporary.</td>
</tr>
<tr valign=top>
<td><code>512-65535</code></td>
<td>Reserved for future assignment. Please contact the
<a href="mailto:hdf5dev@hdfgroup.org">HDF5 development
team</a> to reserve a value or range of values for
use by your filters.</td>
</table>
</center>
<h2>2. Defining and Querying the Filter Pipeline</h2>
<p>Two types of filters can be applied to raw data I/O: permanent
filters and transient filters. The permanent filter pipeline is
defned when the dataset is created while the transient pipeline
is defined for each I/O operation. During an
<code>H5Dwrite()</code> the transient filters are applied first
in the order defined and then the permanent filters are applied
in the order defined. For an <code>H5Dread()</code> the
opposite order is used: permanent filters in reverse order, then
transient filters in reverse order. An <code>H5Dread()</code>
must result in the same amount of data for a chunk as the
original <code>H5Dwrite()</code>.
<p>The permanent filter pipeline is defined by calling
<code>H5Pset_filter()</code> for a dataset creation property
list while the transient filter pipeline is defined by calling
that function for a dataset transfer property list.
<dl>
<dt><code>herr_t H5Pset_filter (hid_t <em>plist</em>,
H5Z_filter_t <em>filter</em>, unsigned int <em>flags</em>,
size_t <em>cd_nelmts</em>, const unsigned int
<em>cd_values</em>[])</code>
<dd>This function adds the specified <em>filter</em> and
corresponding properties to the end of the transient or
permanent output filter pipeline (depending on whether
<em>plist</em> is a dataset creation or dataset transfer
property list). The <em>flags</em> argument specifies certain
general properties of the filter and is documented below. The
<em>cd_values</em> is an array of <em>cd_nelmts</em> integers
which are auxiliary data for the filter. The integer values
will be stored in the dataset object header as part of the
filter information.
<br><br>
<dt><code>int H5Pget_nfilters (hid_t <em>plist</em>)</code>
<dd>This function returns the number of filters defined in the
permanent or transient filter pipeline depending on whether
<em>plist</em> is a dataset creation or dataset transfer
property list. In each pipeline the filters are numbered from
0 through <em>N</em>-1 where <em>N</em> is the value returned
by this function. During output to the file the filters of a
pipeline are applied in increasing order (the inverse is true
for input). Zero is returned if there are no filters in the
pipeline and a negative value is returned for errors.
<br><br>
<dt><code>H5Z_filter_t H5Pget_filter (hid_t <em>plist</em>,
int <em>filter_number</em>, unsigned int *<em>flags</em>,
size_t *<em>cd_nelmts</em>, unsigned int
*<em>cd_values</em>, size_t namelen, char name[])</code>
<dd>This is the query counterpart of
<code>H5Pset_filter()</code> and returns information about a
particular filter number in a permanent or transient pipeline
depending on whether <em>plist</em> is a dataset creation or
dataset transfer property list. On input, <em>cd_nelmts</em>
indicates the number of entries in the <em>cd_values</em>
array allocated by the caller while on exit it contains the
number of values defined by the filter. The
<em>filter_number</em> should be a value between zero and
<em>N</em>-1 as described for <code>H5Pget_nfilters()</code>
and the function will return failure (a negative value) if the
filter number is out of range. If <em>name</em> is a pointer
to an array of at least <em>namelen</em> bytes then the filter
name will be copied into that array. The name will be null
terminated if the <em>namelen</em> is large enough. The
filter name returned will be the name appearing in the file or
else the name registered for the filter or else an empty string.
</dl>
<p>The flags argument to the functions above is a bit vector of
the following fields:
<p>
<center>
<table align=center width="80%">
<caption align=top>
<b>Values for the <em>flags</em> argument</b>
</caption>
<tr>
<th width="30%">Value</th>
<th width="70%">Description</th>
</tr>
<tr valign=top>
<td><code>H5Z_FLAG_OPTIONAL</code></td>
<td>If this bit is set then the filter is optional. If
the filter fails (see below) during an
<code>H5Dwrite()</code> operation then the filter is
just excluded from the pipeline for the chunk for which
it failed; the filter will not participate in the
pipeline during an <code>H5Dread()</code> of the chunk.
This is commonly used for compression filters: if the
compression result would be larger than the input then
the compression filter returns failure and the
uncompressed data is stored in the file. If this bit is
clear and a filter fails then the
<code>H5Dwrite()</code> or <code>H5Dread()</code> also
fails.</td>
</tr>
</table>
</center>
<h2>3. Defining Filters</h2>
<p>Each filter is bidirectional, handling both input and output to
the file, and a flag is passed to the filter to indicate the
direction. In either case the filter reads a chunk of data from
a buffer, usually performs some sort of transformation on the
data, places the result in the same or new buffer, and returns
the buffer pointer and size to the caller. If something goes
wrong the filter should return zero to indicate a failure.
<p>During output, a filter that fails or isn't defined and is
marked as optional is silently excluded from the pipeline and
will not be used when reading that chunk of data. A required
filter that fails or isn't defined causes the entire output
operation to fail. During input, any filter that has not been
excluded from the pipeline during output and fails or is not
defined will cause the entire input operation to fail.
<p>Filters are defined in two phases. The first phase is to
define a function to act as the filter and link the function
into the application. The second phase is to register the
function, associating the function with an
<code>H5Z_filter_t</code> identification number and a comment.
<dl>
<dt><code>typedef size_t (*H5Z_func_t)(unsigned int
<em>flags</em>, size_t <em>cd_nelmts</em>, const unsigned int
<em>cd_values</em>[], size_t <em>nbytes</em>, size_t
*<em>buf_size</em>, void **<em>buf</em>)</code>
<dd>The <em>flags</em>, <em>cd_nelmts</em>, and
<em>cd_values</em> are the same as for the
<code>H5Pset_filter()</code> function with the additional flag
<code>H5Z_FLAG_REVERSE</code> which is set when the filter is
called as part of the input pipeline. The input buffer is
pointed to by <em>*buf</em> and has a total size of
<em>*buf_size</em> bytes but only <em>nbytes</em> are valid
data. The filter should perform the transformation in place if
possible and return the number of valid bytes or zero for
failure. If the transformation cannot be done in place then
the filter should allocate a new buffer with
<code>malloc()</code> and assign it to <em>*buf</em>,
assigning the allocated size of that buffer to
<em>*buf_size</em>. The old buffer should be freed
by calling <code>free()</code>.
<br><br>
<dt><code>herr_t H5Zregister (H5Z_filter_t <em>filter_id</em>,
const char *<em>comment</em>, H5Z_func_t
<em>filter</em>)</code>
<dd>The <em>filter</em> function is associated with a filter
number and a short ASCII comment which will be stored in the
hdf5 file if the filter is used as part of a permanent
pipeline during dataset creation.
</dl>
<h2>4. Predefined Filters</h2>
<p>If <code>zlib</code> version 1.1.2 or later was found
during configuration then the library will define a filter whose
<code>H5Z_filter_t</code> number is
<code>H5Z_FILTER_DEFLATE</code>. Since this compression method
has the potential for generating compressed data which is larger
than the original, the <code>H5Z_FLAG_OPTIONAL</code> flag
should be turned on so such cases can be handled gracefully by
storing the original data instead of the compressed data. The
<em>cd_nvalues</em> should be one with <em>cd_value[0]</em>
being a compression agression level between zero and nine,
inclusive (zero is the fastest compression while nine results in
the best compression ratio).
<p>A convenience function for adding the
<code>H5Z_FILTER_DEFLATE</code> filter to a pipeline is:
<dl>
<dt><code>herr_t H5Pset_deflate (hid_t <em>plist</em>, unsigned
<em>aggression</em>)</code>
<dd>The deflate compression method is added to the end of the
permanent or transient filter pipeline depending on whether
<em>plist</em> is a dataset creation or dataset transfer
property list. The <em>aggression</em> is a number between
zero and nine (inclusive) to indicate the tradeoff between
speed and compression ratio (zero is fastest, nine is best
ratio).
</dl>
<p>Even if the <code>zlib</code> isn't detected during
configuration the application can define
<code>H5Z_FILTER_DEFLATE</code> as a permanent filter. If the
filter is marked as optional (as with
<code>H5Pset_deflate()</code>) then it will always fail and be
automatically removed from the pipeline. Applications that read
data will fail only if the data is actually compressed; they
won't fail if <code>H5Z_FILTER_DEFLATE</code> was part of the
permanent output pipeline but was automatically excluded because
it didn't exist when the data was written.
<p><code>zlib</code> can be acquired from
<code><a href="http://www.cdrom.com/pub/infozip/zlib/">http://www.cdrom.com/pub/infozip/zlib/</a></code>.
<h2>5. Example</h2>
<p>This example shows how to define and register a simple filter
that adds a checksum capability to the data stream.
<p>The function that acts as the filter always returns zero
(failure) if the <code>md5()</code> function was not detected at
configuration time (left as an excercise for the reader).
Otherwise the function is broken down to an input and output
half. The output half calculates a checksum, increases the size
of the output buffer if necessary, and appends the checksum to
the end of the buffer. The input half calculates the checksum
on the first part of the buffer and compares it to the checksum
already stored at the end of the buffer. If the two differ then
zero (failure) is returned, otherwise the buffer size is reduced
to exclude the checksum.
<p>
<center>
<table border align=center width="100%">
<tr>
<td>
<p><code><pre>
size_t
md5_filter(unsigned int flags, size_t cd_nelmts,
const unsigned int cd_values[], size_t nbytes,
size_t *buf_size, void **buf)
{
#ifdef HAVE_MD5
unsigned char cksum[16];
if (flags & H5Z_REVERSE) {
/* Input */
assert(nbytes>=16);
md5(nbytes-16, *buf, cksum);
/* Compare */
if (memcmp(cksum, (char*)(*buf)+nbytes-16, 16)) {
return 0; /*fail*/
}
/* Strip off checksum */
return nbytes-16;
} else {
/* Output */
md5(nbytes, *buf, cksum);
/* Increase buffer size if necessary */
if (nbytes+16>*buf_size) {
*buf_size = nbytes + 16;
*buf = realloc(*buf, *buf_size);
}
/* Append checksum */
memcpy((char*)(*buf)+nbytes, cksum, 16);
return nbytes+16;
}
#else
return 0; /*fail*/
#endif
}
</pre></code>
</td>
</tr>
</table>
</center>
<p>Once the filter function is defined it must be registered so
the HDF5 library knows about it. Since we're testing this
filter we choose one of the <code>H5Z_filter_t</code> numbers
from the reserved range. We'll randomly choose 305.
<p>
<center>
<table border align=center width="100%">
<tr>
<td>
<p><code><pre>
#define FILTER_MD5 305
herr_t status = H5Zregister(FILTER_MD5, "md5 checksum", md5_filter);
</pre></code>
</td>
</tr>
</table>
</center>
<p>Now we can use the filter in a pipeline. We could have added
the filter to the pipeline before defining or registering the
filter as long as the filter was defined and registered by time
we tried to use it (if the filter is marked as optional then we
could have used it without defining it and the library would
have automatically removed it from the pipeline for each chunk
written before the filter was defined and registered).
<p>
<center>
<table border align=center width="100%">
<tr>
<td>
<p><code><pre>
hid_t dcpl = H5Pcreate(H5P_DATASET_CREATE);
hsize_t chunk_size[3] = {10,10,10};
H5Pset_chunk(dcpl, 3, chunk_size);
H5Pset_filter(dcpl, FILTER_MD5, 0, 0, NULL);
hid_t dset = H5Dcreate(file, "dset", H5T_NATIVE_DOUBLE, space, dcpl);
</pre></code>
</td>
</tr>
</table>
</center>
<h2>6. Filter Diagnostics</h2>
<p>If the library is compiled with debugging turned on for the H5Z
layer (usually as a result of <code>configure
--enable-debug=z</code>) then filter statistics are printed when
the application exits normally or the library is closed. The
statistics are written to the standard error stream and include
two lines for each filter that was used: one for input and one
for output. The following fields are displayed:
<p>
<center>
<table align=center width="80%">
<tr>
<th width="30%">Field Name</th>
<th width="70%">Description</th>
</tr>
<tr valign=top>
<td>Method</td>
<td>This is the name of the method as defined with
<code>H5Zregister()</code> with the charaters
"< or ">" prepended to indicate
input or output.</td>
</tr>
<tr valign=top>
<td>Total</td>
<td>The total number of bytes processed by the filter
including errors. This is the maximum of the
<em>nbytes</em> argument or the return value.
</tr>
<tr valign=top>
<td>Errors</td>
<td>This field shows the number of bytes of the Total
column which can be attributed to errors.</td>
</tr>
<tr valign=top>
<td>User, System, Elapsed</td>
<td>These are the amount of user time, system time, and
elapsed time in seconds spent in the filter function.
Elapsed time is sensitive to system load. These times
may be zero on operating systems that don't support the
required operations.</td>
</tr>
<tr valign=top>
<td>Bandwidth</td>
<td>This is the filter bandwidth which is the total
number of bytes processed divided by elapsed time.
Since elapsed time is subject to system load the
bandwidth numbers cannot always be trusted.
Furthermore, the bandwidth includes bytes attributed to
errors which may significanly taint the value if the
function is able to detect errors without much
expense.</td>
</tr>
</table>
</center>
<p>
<center>
<table border align=center width="100%">
<caption align=bottom>
<b>Example: Filter Statistics</b>
</caption>
<tr>
<td>
<p><code><pre>
H5Z: filter statistics accumulated over life of library:
Method Total Errors User System Elapsed Bandwidth
------ ----- ------ ---- ------ ------- ---------
>deflate 160000 40000 0.62 0.74 1.33 117.5 kBs
<deflate 120000 0 0.11 0.00 0.12 1.000 MBs
</pre></code>
</td>
</tr>
</table>
</center>
<hr>
<p><a name="fn1">Footnote 1:</a> Dataset chunks can be compressed
through the use of filters. Developers should be aware that
reading and rewriting compressed chunked data can result in holes
in an HDF5 file. In time, enough such holes can increase the
file size enough to impair application or library performance
when working with that file. See
“<a href="Performance.html#Freespace">Freespace Management</a>”
in the chapter
“<a href="Performance.html">Performance Analysis and Issues</a>.”</p>
<!-- #BeginLibraryItem "/ed_libs/NavBar_UG.lbi" --><hr>
<center>
<table border=0 width=98%>
<tr><td valign=top align=left>
<a href="../index.html">HDF5 documents and links</a> <br>
<a href="../H5.intro.html">Introduction to HDF5</a> <br>
<a href="../RM/RM_H5Front.html">HDF5 Reference Manual</a> <br>
<a href="../UG/index.html">HDF5 User's Guide for Release 1.8</a> <br>
<!--
<a href="Glossary.html">Glossary</a><br>
-->
</td>
<td valign=top align=right>
And in this document, the
<a href="../H5.user.html"><strong>HDF5 User's Guide from Release 1.4.5:</strong></a>
<br>
<a href="Files.html">Files</a>
<a href="Datasets.html">Datasets</a>
<a href="Datatypes.html">Datatypes</a>
<a href="Dataspaces.html">Dataspaces</a>
<a href="Groups.html">Groups</a>
<br>
<a href="References.html">References</a>
<a href="Attributes.html">Attributes</a>
<a href="Properties.html">Property Lists</a>
<a href="Errors.html">Error Handling</a>
<br>
<a href="Filters.html">Filters</a>
<a href="Caching.html">Caching</a>
<a href="Chunking.html">Chunking</a>
<a href="MountingFiles.html">Mounting Files</a>
<br>
<a href="Performance.html">Performance</a>
<a href="Debugging.html">Debugging</a>
<a href="Environment.html">Environment</a>
<a href="../ddl.html">DDL</a>
</td></tr>
</table>
</center>
<hr>
<!-- #EndLibraryItem --><address>
THG Help Desk: <img src="../Graphics/help.png" align=top height=16>
<br>
Describes HDF5 Release 1.4.5, February 2003
</address>
<!-- Created: Fri Apr 17 13:39:35 EDT 1998 -->
<!-- hhmts start -->
Last modified: 2 August 2001
<!-- hhmts end -->
</body>
</html>
|