File: README.md

package info (click to toggle)
indexed-gzip 0.8.6-1
  • links: PTS, VCS
  • area: main
  • in suites: buster, sid
  • size: 412 kB
  • sloc: ansic: 1,495; python: 601; makefile: 28
file content (271 lines) | stat: -rw-r--r-- 7,135 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# indexed_gzip


[![Build Status](https://travis-ci.org/pauldmccarthy/indexed_gzip.svg?branch=master)](https://travis-ci.org/pauldmccarthy/indexed_gzip)

[![PyPi version](https://img.shields.io/pypi/v/indexed_gzip.svg)](https://pypi.python.org/pypi/indexed_gzip/)


 *Fast random access of gzip files in Python*


 * [Overview](#overview)
 * [Installation](#installation)
 * [Usage](#usage)
 * [Using with `nibabel`](#using-with-nibabel)
 * [Index import/export](#index-import-export)
 * [Write support](#write-support)
 * [Performance](#performance)
 * [Acknowledgements](#acknowledgements)
 * [License](#license)


## Overview


The `indexed_gzip` project is a Python extension which aims to provide a
drop-in replacement for the built-in Python `gzip.GzipFile` class, the
`IndexedGzipFile`.


`indexed_gzip` was written to allow fast random access of compressed
[NIFTI](http://nifti.nimh.nih.gov/) image files (for which GZIP is the
de-facto compression standard), but will work with any GZIP file.
`indexed_gzip` is easy to use with `nibabel` (http://nipy.org/nibabel/).


The standard `gzip.GzipFile` class exposes a random access-like interface (via
its `seek` and `read` methods), but every time you seek to a new point in the
uncompressed data stream, the `GzipFile` instance has to start decompressing
from the beginning of the file, until it reaches the requested location.


An `IndexedGzipFile` instance gets around this performance limitation by
building an index, which contains *seek points*, mappings between
corresponding locations in the compressed and uncompressed data streams. Each
seek point is accompanied by a chunk (32KB) of uncompressed data which is used
to initialise the decompression algorithm, allowing us to start reading from
any seek point. If the index is built with a seek point spacing of 1MB, we
only have to decompress (on average) 512KB of data to read from any location
in the file.


## Installation


`indexed_gzip` is available on [PyPi](https://pypi.python.org/pypi) - to
install, simply type:
```sh
pip install indexed_gzip
```

To compile `indexed_gzip`, make sure you have [cython](http://cython.org/)
installed (and `numpy` if you want to compile the tests), and then run:
```sh
python setup.py develop
```


To run the tests, type the following; you will need `numpy` and `pytest`
installed:
```sh
pytest
```

## Usage


You can use the `indexed_gzip` module directly:


```python
import indexed_gzip as igzip

# You can create an IndexedGzipFile instance
# by specifying a file name, or an open file
# handle. For the latter use, the file handle
# must be opened in read-only binary mode.
# Write support is currently non-existent.
myfile = igzip.IndexedGzipFile('big_file.gz')

some_offset_into_uncompressed_data = 234195

# The index will be automatically
# built on-demand when seeking or
# reading.
myfile.seek(some_offset_into_uncompressed_data)
data = myfile.read(1048576)
```


## Using with `nibabel`


You can use `indexed_gzip` with `nibabel`. `nibabel` >= 2.3.0 will
automatically use `indexed_gzip` if it is present:


```python
import nibabel as nib

image = nib.load('big_image.nii.gz')
```


If you are using `nibabel` 2.2.x, you need to explicitly set the `keep_file_open`
flag:


```python
import nibabel as nib

image = nib.load('big_image.nii.gz', keep_file_open='auto')
```


To use `indexed_gzip` with `nibabel` 2.1.0 or older, you need to do a little
more work:


```python
import nibabel      as nib
import indexed_gzip as igzip

# Here we are using 4MB spacing between
# seek points, and using a larger read
# buffer (than the default size of 16KB).
fobj = igzip.IndexedGzipFile(
    filename='big_image.nii.gz',
    spacing=4194304,
    readbuf_size=131072)

# Create a nibabel image using
# the existing file handle.
fmap = nib.Nifti1Image.make_file_map()
fmap['image'].fileobj = fobj
image = nib.Nifti1Image.from_file_map(fmap)

# Use the image ArrayProxy to access the
# data - the index will automatically be
# built as data is accessed.
vol3 = image.dataobj[:, :, :, 3]
```


## Index import/export


If you have a large file, you may wish to pre-generate the index once, and
save it out to an index file:


```python
import indexed_gzip as igzip

# Load the file, pre-generate the
# index, and save it out to disk.
fobj = igzip.IndexedGzipFile('big_file.gz')
fobj.build_full_index()
fobj.export_index('big_file.gzidx')
```


The next time you open the same file, you can load in the index:


```python
import indexed_gip as igzip
fobj = igzip.IndexedGzipFile('big_file.gz', index_file='big_file.gzidx')
```


## Write support


`indexed_gzip` does not currently have any support for writing. Currently if you
wish to write to a file, you will need to save the file by alternate means (e.g.
via `gzip` or `nibabel`), and then re-create a new `IndexedGzipFile` instance.
For example:


```python

import nibabel as nib

# Load the entire image into memory
image = nib.load('big_image.nii.gz')
data = image.get_data()

# Make changes to the data
data[:, :, :, 5] *= 100

# Save the image using nibabel
nib.save(data, 'big_image.nii.gz')

# Re-load the image
image = nib.load('big_image.nii.gz')
```


## Performance


A small [test script](indexed_gzip/tests/benchmark.py) is included with
`indexed_gzip`; this script compares the performance of the `IndexedGzipFile`
class with the `gzip.GzipFile` class. This script does the following:

  1. Generates a test file.

  2. Generates a specified number of seek locations, uniformly spaced
     throughout the test file.

  3. Randomly shuffles these locations

  4. Seeks to each location, and reads a chunk of data from the file.


This plot shows the results of this test for a few compresed files of varying
sizes, with 500 seeks:


![Indexed gzip performance](./performance.png)


## Acknowledgements


The `indexed_gzip` project is based upon the `zran.c` example (written by Mark
Alder) which ships with the [zlib](http://www.zlib.net/) source code.


`indexed_gzip` was originally inspired by Zalan Rajna's (@zrajna)
[zindex](https://github.com/zrajna/zindex) project:

    Z. Rajna, A. Keskinarkaus, V. Kiviniemi and T. Seppanen
    "Speeding up the file access of large compressed NIfTI neuroimaging data"
    Engineering in Medicine and Biology Society (EMBC), 2015 37th Annual
    International Conference of the IEEE, Milan, 2015, pp. 654-657.

    https://sourceforge.net/projects/libznzwithzindex/


Initial work on `indexed_gzip` took place at
[Brainhack](http://www.brainhack.org/) Paris, at the Institut Pasteur,
24th-26th February 2016, with the support of the
[FMRIB Centre](https://www.ndcn.ox.ac.uk/divisions/fmrib/), at the
University of Oxford, UK.


Many thanks to the following contributors (listed chronologically):

 - Zalan Rajna (@zrajna): bug fixes (#2)
 - Martin Craig (@mcraig-ibme): porting `indexed_gzip` to Windows (#3)
 - Chris Markiewicz (@effigies): Option to drop file handles (#6)
 - Omer Ozarslan (@ozars): Index import/export (#8)


## License


`indexed_gzip` inherits the [zlib](http://www.zlib.net) license, available for
perusal in the [LICENSE](LICENSE) file.