File: introduction.rst

package info (click to toggle)
python-hdf5storage 0.1.19-5
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 896 kB
  • sloc: python: 3,504; makefile: 132
file content (469 lines) | stat: -rw-r--r-- 18,057 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
.. currentmodule:: hdf5storage

============
Introduction
============

Getting Started
===============

Most of the functionality that one will use is contained in the main
module ::

    import hdf5storage

Lower level functionality needed mostly for extending this package to
work with more datatypes are in its submodules.

The main functions in this module are :py:func:`write` and
:py:func:`read` which write a single Python variable to an HDF5 file or
read the specified contents at one location in an HDF5 file and convert
to Python types.

HDF5 files are structured much like a Unix filesystem, so everything can
be referenced with a POSIX style path, which look like
``'/pyth/hf'``. Unlike a Windows path, back slashes (``'/'``) are used
as directory separators instead of forward slashes (``'\'``) and the
base of the file system is just ``'/'`` instead of something like
``'C:\'``. In the language of HDF5, what we call directories and files
in filesystems are called groups and datasets.

:py:func:`write` has many options for controlling how the data is
stored, and what metadata is stored, but we can ignore that for now. If
we have a variable named ``foo`` that we want to write to an HDF5 file
named ``data.h5``, we would write it by ::

    hdf5storage.write(foo, path='/foo', filename='data.h5')

And then we can read it back from the file with the :py:func:`read`
function, which returns the read data. Here, we will put the data we
read back into the variable ``bar`` ::

    bar = hdf5storage.read(path='/foo', filename='data.h5')

Writing And Reading Several Python Variables at Once
====================================================

To write and read more than one Python variable, one could use
:py:func:`write` and :py:func:`read` for each variable individually.
This can incur a major performance penalty, especially for large HDF5
files, since each call opens and closes the HDF5 file (sometimes more
than once).

Version ``0.1.10`` added a way to do this without incuring this
performance penalty by adding two new functions: :py:func:`writes` and
:py:func:`reads`.

They can write and read more than one Python variable at once, though
they can still work with a single variable. In fact, :py:func:`write`
and :py:func:`read` are now wrappers around them. :py:func:`savemat`
and :py:func:`loadmat` currently use them for the improved performance.

.. versionadded:: 0.1.10
   
   Ability to write and read more than one Python variable at a time
   without opening and closing the HDF5 file each time.

Main Options Controlling Writing/Reading Data
=============================================

There are many individual options that control how data is written and
read to/from file. These can be set by passing an :py:class:`Options`
object to :py:func:`write` and :py:func:`read` by ::

    options = hdf5storage.Options(...)
    hdf5storage.write(... , options=options)
    hdf5storage.read(... , options=options)

or passing the individual keyword arguments used by the
:py:class:`Options` constructor to :py:func:`write` and
:py:func:`read`. The two methods cannot be mixed (the functions will
give precedence to the given :py:class:`Options` object).

.. note::

   Functions in the various submodules only support the
   :py:class:`Options` object method of passing options.

The two main options are :py:attr:`Options.store_python_metadata` and
:py:attr:`Options.matlab_compatible`. A more minor option is
:py:attr:`Options.oned_as`.


.. versionadded:: 0.1.9

   Support for the transparent compression of data has been added. It
   is enabled by default, compressing all python objects resulting in
   HDF5 Datasets larger than 16 KB with the GZIP/Deflate algorithm.


store_python_metadata
---------------------

``bool``

Setting this options causes metadata to be written so that the written
objects can be read back into Python accurately. As HDF5 does not
natively support many Python data types (essentially only Numpy types),
most Python data types have to be converted before being written. If
metadata isn't also written, the data cannot be read back to its
original form and will instead be read back as the Python type most
closely resembling how it is stored, which will be a Numpy type of some
sort.

.. note

   This option is especially important when we consider that when
   ``matlab_compatible == True``, many additional conversions and
   manipulations will be done to the data that cannot be reversed
   without this metadata.

matlab_compatible
-----------------

``bool``

Setting this option causes the writing of HDF5 files be done in a way
compatible with MATLAB v7.3 MAT files. This consists of writing some
file metadata so that MATLAB recognizes the file, adding specific
metadata to every stored object so that MATLAB recognizes them, and
transforming the data to be in the form that MATLAB expects for certain
types (for example, MATLAB expects everything to be at least a 2D array
and strings to be stored in UTF-16 but with no doublets).

.. note::

   There are many individual small options in the :py:class:`Options`
   class that this option sets to specific values. Setting
   ``matlab_compatible`` automatically sets them, while changing their
   values to something else automatically turns ``matlab_compatible``
   off.

action_for_matlab_incompatible
------------------------------

{``'ignore'``, ``'discard'``, ``'error'``}

The action to perform when doing MATLAB compatibility
(``matlab_compatible == True``) but a type
being written is not MATLAB compatible. The actions are to write the
data anyways ('ignore'), don't write the incompatible data ('discard'),
or throw a :py:exc:`lowlevel.TypeNotMatlabCompatibleError`
exception. The default is 'error'.

oned_as
-------

{'row', 'column'}

This option is only actually relevant when
``matlab_compatible == True``. MATLAB only supports 2D and higher
dimensionality arrays, but Numpy supports 1D arrays. So, 1D arrays have
to be made 2 dimensional making them either into row vectors or column
vectors. This option sets which they become when imported into MATLAB.


compress
--------

.. versionadded:: 0.1.9

``bool``

Whether to use compression when writing data. Enabled (``True``) by default. See :ref:`Compression` for more information.


Convenience Functions for MATLAB MAT Files
==========================================

Two functions are provided for reading and writing to MATLAB MAT files
in a convenient way. They are :py:func:`savemat` and :py:func:`loadmat`,
which are modelled after the SciPy functions of the same name
(:py:func:`scipy.io.savemat` and :py:func:`scipy.io.loadmat`), which
work with non-HDF5 based MAT files. They take not only the same options,
but dispatch calls automatically to the SciPy versions when instructed
to write to a non-HDF5 based MAT file, or read a MAT file that is not
HDF5 based. SciPy must be installed to take advantage of this
functionality.

:py:func:`savemat` takes a ``dict`` having data (values) and the names
to give each piece of data (keys), and writes them to a MATLAB
compatible MAT file. The `format` keyword sets the MAT file format, with
``'7.3'`` being the HDF5 based format supported by this package and
``'5'`` and ``'4'`` being the non HDF5 based formats supported by
SciPy. If you want the data to be able to be read accurately back into
Python, you should set ``store_python_metadata=True``. Writing a couple
variables to a file looks like ::

    hdf5storage.savemat('data.mat', {'foo': 2.3, 'bar': (1+2j)}, format='7.3', oned_as='column', store_python_metadata=True)

Then, to read variables back, we can either explicitly name the
variables we want ::

    out = hdf5storage.loadmat('data.mat', variable_names=['foo', 'bar'])

or grab all variables by either not giving the `variable_names` option
or setting it to ``None``. ::

    out = hdf5storage.loadmat('data.mat')


Example: Write And Readback Including Different Metadata
========================================================

Making The Data
---------------

Make a ``dict`` containing many different types in it that we want to
store to disk in an HDF5 file. The initialization method depends on
the Python version.

Python 3
^^^^^^^^

The ``dict`` keys must be ``str`` (the unicode string type).

    >>> import numpy as np
    >>> import hdf5storage
    >>> a = {'a': True,
    ...      'b': None,
    ...      'c': 2,
    ...      'd': -3.2,
    ...      'e': (1-2.3j),
    ...      'f': 'hello',
    ...      'g': b'goodbye',
    ...      'h': ['list', 'of', 'stuff', [30, 2.3]],
    ...      'i': np.zeros(shape=(2,), dtype=[('bi', 'uint8')]),
    ...      'j':{'aa': np.bool_(False),
    ...           'bb': np.uint8(4),
    ...           'cc': np.uint32([70, 8]),
    ...           'dd': np.int32([]),
    ...           'ee': np.float32([[3.3], [5.3e3]]),
    ...           'ff': np.complex128([[3.4, 3], [9+2j, 0]]),
    ...           'gg': np.array(['one', 'two', 'three'], dtype='str'),
    ...           'hh': np.bytes_(b'how many?'),
    ...           'ii': np.object_(['text', np.int8([1, -3, 0])])}}

Python 2
^^^^^^^^

The same thing but in Python 2 where the ``dict`` keys must be
``unicode``. The other datatypes are translated from the Python 3
example appropriately. The rest of the examples on this page are run
identically in Python 2 and 3, but the outputs are listed as is
returned in Python 3.

    >>> import numpy as np
    >>> import hdf5storage
    >>> a = {u'a': True,
    ...      u'b': None,
    ...      u'c': 2,
    ...      u'd': -3.2,
    ...      u'e': (1-2.3j),
    ...      u'f': u'hello',
    ...      u'g': 'goodbye',
    ...      u'h': [u'list', u'of', u'stuff', [30, 2.3]],
    ...      u'i': np.zeros(shape=(2,), dtype=[('bi', 'uint8')]),
    ...      u'j':{u'aa': np.bool_(False),
    ...            u'bb': np.uint8(4),
    ...            u'cc': np.uint32([70, 8]),
    ...            u'dd': np.int32([]),
    ...            u'ee': np.float32([[3.3], [5.3e3]]),
    ...            u'ff': np.complex128([[3.4, 3], [9+2j, 0]]),
    ...            u'gg': np.array([u'one', u'two', u'three'], dtype='unicode'),
    ...            u'hh': np.str_('how many?'),
    ...            u'ii': np.object_([u'text', np.int8([1, -3, 0])])}}

Using No Metadata
-----------------

Write it to a file at the ``'/a'`` directory, but include no Python or
MATLAB metadata. Then, read it back and notice that many objects come
back quite different from what was written. Namely, everything was
converted to Numpy types. This even included the dictionaries which were
converted to structured ``np.ndarray``s. This happens because all
other types (other than ``dict``) must be converted to these types
before being written to the HDF5 file, and without metadata, the
conversion cannot be reversed (while ``dict`` isn't converted, it has
the same form and thus cannot be extracted reversibly).

    >>> hdf5storage.write(data=a, path='/a', filename='data.h5',
    ...                   store_python_metadata=False,
    ...                   matlab_compatible=False)
    >>> hdf5storage.read(path='/a', filename='data.h5')
    array([ (True,
             [],
             2,
             -3.2,
             (1-2.3j),
             b'hello',
             b'goodbye',
             [array(b'list', dtype='|S4'),
              array(b'of', dtype='|S2'),
              array(b'stuff', dtype='|S5'),
              array([array(30), array(2.3)], dtype=object)],
             [(0,), (0,)],
             [(False,
               4,
               array([70,  8], dtype=uint32),
               array([], dtype=int32),
               array([[  3.29999995e+00], [  5.30000000e+03]], dtype=float32),
               array([[ 3.4+0.j,  3.0+0.j], [ 9.0+2.j,  0.0+0.j]]),
               array([111, 110, 101,   0,   0, 116, 119, 111,   0,   0, 116, 104, 114,
                      101, 101], dtype=uint32),
               b'how many?',
               array([array(b'text', dtype='|S4'),
                      array([ 1, -3,  0], dtype=int8)],
                     dtype=object))])], 
          dtype=[('a', '?'),
                 ('b', '<f8', (0,)),
                 ('c', '<i8'),
                 ('d', '<f8'),
                 ('e', '<c16'),
                 ('f', 'S5'),
                 ('g', 'S7'), ('h', 'O', (4,)),
                 ('i', [('bi', 'u1')], (2,)),
                 ('j', [('aa', '?'),
                        ('bb', 'u1'),
                        ('cc', '<u4', (2,)),
                        ('dd', '<i4', (0,)),
                        ('ee', '<f4', (2, 1)),
                        ('ff', '<c16', (2, 2)),
                        ('gg', '<u4', (15,)),
                        ('hh', 'S9'),
                        ('ii', 'O', (2,))],
                  (1,))])


Including Python Metadata
-------------------------

Do the same thing, but now include Python metadata
(``store_python_metadata == True``). This time, everything is read back
the same (or at least, it should) as it was written.

    >>> hdf5storage.write(data=a, path='/a', filename='data_typeinfo.h5',
    ...                   store_python_metadata=True,
    ...                   matlab_compatible=False)
    >>> hdf5storage.read(path='/a', filename='data_typeinfo.h5')
    {'a': True,
     'b': None,
     'c': 2,
     'd': -3.2,
     'e': (1-2.3j),
     'f': 'hello',
     'g': b'goodbye',
     'h': ['list', 'of', 'stuff', [30, 2.3]],
     'i': array([(0,), (0,)], 
          dtype=[('bi', 'u1')]),
     'j': {'aa': False,
      'bb': 4,
      'cc': array([70,  8], dtype=uint32),
      'dd': array([], dtype=int32),
      'ee': array([[  3.29999995e+00],
           [  5.30000000e+03]], dtype=float32),
      'ff': array([[ 3.4+0.j,  3.0+0.j],
           [ 9.0+2.j,  0.0+0.j]]),
      'gg': array(['one', 'two', 'three'], 
          dtype='<U5'),
      'hh': b'how many?',
      'ii': array(['text', array([ 1, -3,  0], dtype=int8)], dtype=object)}}

Including MATLAB Metadata
-------------------------

Do the same thing, but this time including only MATLAB metadata
(``matlab_compatible == True``). This time, the data that is read back
is different from what was written, but in a different way than when no
metadata was used. The biggest differences are that everything was
turned into an at least 2D array, all arrays are transposed, and all
string types got converted to ``numpy.str_``. This happens because
MATLAB can only work with 2D and higher arrays, uses Fortran array
ordering instead of C ordering like Python does, and strings are stored
in a subset of UTF-16 (no doublets) in the version 7.3 MAT files.

    >>> hdf5storage.write(data=a, path='/a', filename='data.mat',
    ...                   store_python_metadata=False,
    ...                   matlab_compatible=True)
    >>> hdf5storage.read(path='/a', filename='data.mat')
    array([ ([[True]],
             [[]],
             [[2]],
             [[-3.2]],
             [[(1-2.3j)]],
             [['hello']],
             [['goodbye']],
             [[array([['list']], dtype='<U4'),
               array([['of']], dtype='<U2'),
               array([['stuff']], dtype='<U5'),
               array([[array([[30]]), array([[ 2.3]])]], dtype=object)]],
             [[(array([[0]], dtype=uint8),)],
              [(array([[0]], dtype=uint8),)]],
             [(array([[False]], dtype=bool),
               array([[4]], dtype=uint8),
               array([[70,  8]], dtype=uint32),
               array([], shape=(1, 0), dtype=int32),
               array([[  3.29999995e+00], [  5.30000000e+03]], dtype=float32),
               array([[ 3.4+0.j,  3.0+0.j], [ 9.0+2.j,  0.0+0.j]]),
               array([['one\x00\x00two\x00\x00three']], dtype='<U15'),
               array([['how many?']], dtype='<U9'),
               array([[array([['text']], dtype='<U4'),
                       array([[ 1, -3,  0]], dtype=int8)]], dtype=object))])], 
          dtype=[('a', '?', (1, 1)),
                 ('b', '<f8', (1, 0)),
                 ('c', '<i8', (1, 1)),
                 ('d', '<f8', (1, 1)),
                 ('e', '<c16', (1, 1)),
                 ('f', '<U5', (1, 1)),
                 ('g', '<U7', (1, 1)),
                 ('h', 'O', (1, 4)),
                 ('i', [('bi', 'u1', (1, 1))], (2, 1)),
                 ('j', [('aa', '?', (1, 1)),
                        ('bb', 'u1', (1, 1)),
                        ('cc', '<u4', (1, 2)),
                        ('dd', '<i4', (1, 0)),
                        ('ee', '<f4', (2, 1)),
                        ('ff', '<c16', (2, 2)),
                        ('gg', '<U15', (1, 1)),
                        ('hh', '<U9', (1, 1)),
                        ('ii', 'O', (1, 2))],
                  (1,))])


Including both Python And MATLAB Metadata
-----------------------------------------

Do the same thing, but now include both Python metadata
(``store_python_metadata == True``) and MATLAB metadata
(``matlab_compatible == True``). This time, everything is read back
the same (or at least, it should) as it was written. The Python metadata
allows the transformations done by making the stored data MATLAB
compatible reversible.

    >>> hdf5storage.write(data=a, path='/a', filename='data_typeinfo.mat',
    ...                   store_python_metadata=True,
    ...                   matlab_compatible=True)
    >>> hdf5storage.read(path='/a', filename='data_typeinfo.mat')
    {'a': True,
     'b': None,
     'c': 2,
     'd': -3.2,
     'e': (1-2.3j),
     'f': 'hello',
     'g': b'goodbye',
     'h': ['list', 'of', 'stuff', [30, 2.3]],
     'i': array([(0,), (0,)], 
          dtype=[('bi', 'u1')]),
     'j': {'aa': False,
      'bb': 4,
      'cc': array([70,  8], dtype=uint32),
      'dd': array([], dtype=int32),
      'ee': array([[  3.29999995e+00],
           [  5.30000000e+03]], dtype=float32),
      'ff': array([[ 3.4+0.j,  3.0+0.j],
           [ 9.0+2.j,  0.0+0.j]]),
      'gg': array(['one', 'two', 'three'], 
          dtype='<U5'),
      'hh': b'how many?',
      'ii': array(['text', array([ 1, -3,  0], dtype=int8)], dtype=object)}}