1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
|
---
title: "HDF5 Compression Filters"
author:
- name: Mike L. Smith
affiliation: de.NBI & EMBL Heidelberg
package: rhdf5filters
output:
BiocStyle::html_document
vignette: |
%\VignetteIndexEntry{HDF5 Compression Filters}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Motivation
One of the advantages of using HDF5 is that data stored on disk can be compressed, reducing both the space required to store them and the time needed to read those data. This data compression is applied as part of the HDF5 "filter pipeline" that modifies data during I/O operations. HDF5 includes several filter algorithms as standard, and the version of the HDF5 library found in `r BiocStyle::Biocpkg("Rhdf5lib")` is additionally compiled with support for the *deflate* and *szip* compression filters which rely on third-party compression libraries. Collectively HDF5 refer to these as the "internal" filters. It is possible to use any combination of these (including none) when writing data using `r BiocStyle::Biocpkg("rhdf5")`. The default filter pipeline is shown in Figure \@ref(fig:filter-pipeline).
```{r filter-pipeline, echo = FALSE, fig.cap="The default compression pipeline used by rhdf5"}
knitr::include_graphics("filter_pipeline.png")
```
This pipeline approach has been designed so that filters can be chained together (as in the diagram above) or easily substituted for alternative filters. This allows tailoring the compression approach to best suit the data or application.
It may be case that for a specific usecase an alternative, third-party, compression algorithm would be the most appropriate to use. Such filters, which are not part of the standard HDF5 distribution, are referred to as "external" filters. In order to allow their use without requiring either the HDF5 library or applications to be built with support for all possible filters HDF5 is able to use dynamically loaded filters. These are compiled independently from the HDF5 library, but are available to an application at run time.
This package currently distributes external HDF5 filters employing [**bzip2**](https://sourceware.org/bzip2/), [LZF](http://oldhome.schmorp.de/marc/liblzf.html), [VBZ](https://github.com/nanoporetech/vbz_compression), [Zstandard](https://github.com/facebook/zstd) and the [**Blosc**](https://blosc.org/) meta-compressor. In total `r BiocStyle::Biocpkg("rhdf5filters")` provides access to 11^[zlib & Zstandard compression are available as standalone filters and also available via Blosc.] compression filters than can be applied to HDF5 datasets. The full list of filters currently provided by the package is:
- bzip2
- blosclz
- lz4
- lz4hc
- snappy
- Zstandard
- zlib
- LZF
- VBZ
# Usage
## With **rhdf5**
`r BiocStyle::Biocpkg("rhdf5filters")` is principally designed to be used via the `r BiocStyle::Biocpkg("rhdf5")` package, where several functions are able to utilise the compression filters. For completeness those functions are described here and are also documented in the `r BiocStyle::Biocpkg("rhdf5")` vignette.
### Writing data
The function `h5createDataset()` within `r BiocStyle::Biocpkg("rhdf5")` takes the argument `filter` which specifies which compression filter should be used when a new dataset is created.
Also available in `r BiocStyle::Biocpkg("rhdf5")` are the functions `H5Pset_bzip2()`, ``H5Pset_lzf()` and `H5Pset_blosc()`. These are not part of the standard HDF5 interface, but are modeled on the `H5Pset_deflate()` function and allow the *bzip2*, *lzf* and *blosc* filters to be set on dataset create property lists.
### Reading data
As long as `r BiocStyle::Biocpkg("rhdf5filters")` is installed, `r BiocStyle::Biocpkg("rhdf5")` will be able to transparently read data compressed using any of the filters available in the package without requiring any action on your part.
## With external applications
The dynamic loading design of the HDF5 compression filters means that you can use the versions distributed with `r BiocStyle::Biocpkg("rhdf5filters")` with other applications, including other R packages that interface HDF5 as well as external applications not written in R e.g. HDFVIEW. The function `hdf5_plugin_path()` will return the location of in your packages library where the compiled plugins are stored. You can pass this location to the HDF5 function `H5PLprepend()` or you can the set the environment variable `HDF5_PLUGIN_PATH`, and other applications will be able to dynamically load the compression plugins found there if needed.
```{r, plugin-path, eval = TRUE}
rhdf5filters::hdf5_plugin_path()
```
### **h5dump** example
The next example demonstrates how the filters distributed by `r BiocStyle::Biocpkg("rhdf5filters")` can be used by external applications to decompress data. Do do this we'll use the version of **h5dump** installed on the system^[If **h5dump** is not found on your system these example will fail.] and a file distributed with this package that has been compressed using the *blosc* filter.
```{r, warning = FALSE, eval = FALSE}
## blosc compressed file
blosc_file <- system.file("h5examples/h5ex_d_blosc.h5",
package = "rhdf5filters")
```
Now we use `system2()` to call the system version of **h5dump** and capture the output, which is then printed below. The most important parts to note are the `FILTERS` section, which shows the dataset was indeed compressed with *blosc*, and `DATA`, where the error shows that **h5dump** is currently unable to read the dataset.
```{r, h5dump-1, warning = FALSE, eval = FALSE}
h5dump_out <- system2('h5dump',
args = c('-p', '-d /dset', blosc_file),
stdout = TRUE, stderr = TRUE)
cat(h5dump_out, sep = "\n")
```
```{r, h5dump-1-out, eval = TRUE, echo = FALSE}
cat(
'HDF5 "rhdf5filters/h5examples/h5ex_d_blosc.h5" {
DATASET "/dset" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 30, 10, 20 ) / ( 30, 10, 20 ) }
STORAGE_LAYOUT {
CHUNKED ( 10, 10, 20 )
SIZE 3347 (7.171:1 COMPRESSION)
}
FILTERS {
USER_DEFINED_FILTER {
FILTER_ID 32001
COMMENT blosc
PARAMS { 2 2 4 8000 4 1 0 }
}
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
DATA {h5dump error: unable to print data
}
}
}'
)
```
Next we set `HDF5_PLUGIN_PATH` to the location where `r BiocStyle::Biocpkg("rhdf5filters")` has stored the filters and re-run the call to **h5dump**. Printing the output^[The dataset is quite large, so we only show a few lines here.] no longer returns an error in the `DATA` section, indicating that the *blosc* filter plugin was found and used by **h5dump**.
```{r h5dump-2, eval = FALSE}
## set environment variable to hdf5filter location
Sys.setenv("HDF5_PLUGIN_PATH" = rhdf5filters::hdf5_plugin_path())
h5dump_out <- system2('h5dump',
args = c('-p', '-d /dset', '-w 50', blosc_file),
stdout = TRUE, stderr = TRUE)
## find the data entry and print the first few lines
DATA_line <- grep(h5dump_out, pattern = "DATA \\{")
cat( h5dump_out[ (DATA_line):(DATA_line+2) ], sep = "\n" )
```
```{r h5dump-2-out, eval = TRUE, echo = FALSE}
cat(
' DATA {
(0,0,0): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
(0,0,11): 11, 12, 13, 14, 15, 16, 17, 18,'
)
```
# Compiling the compression libraries
In order to compile the compression filters **rhdf5filters** needs to link them
against libraries for the relevant compression tools. These libraries include `bzip2`,
`blosc` and `zstd`. By default **rhdf5filters** will look for system installations
of those libraries and link against them. If they are not found then versions
that are bundled with the package will be compiled and used instead. It is possible
to use system libraries for some filters and the bundled versions for others -
**rhdf5filters** should be able to determine this at compile time.
If you wish to force the use of the bundled libraries, perhaps to ensure linking
against the static versions, this can be done by providing the configure argument
`"--with-bundled-libraries"` during package installation e.g.
```r
BiocManager::install('rhdf5filters',
configure.args = "--with-bundled-libraries")
```
# Session info {.unnumbered}
```{r sessionInfo, echo=FALSE}
sessionInfo()
```
|