File: plot_generic_data_chunk_tutorial.py

package info (click to toggle)
hdmf 3.14.5-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 19,380 kB
  • sloc: python: 34,738; makefile: 303; sh: 35
file content (136 lines) | stat: -rw-r--r-- 7,158 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
"""

.. _genericdci-tutorial:

GenericDataChunkIterator Tutorial
==================================

This is a tutorial for interacting with :py:class:`~hdmf.data_utils.GenericDataChunkIterator` objects. This tutorial
is written for beginners and does not describe the full capabilities and nuances
of the functionality. This tutorial is designed to give
you basic familiarity with how :py:class:`~hdmf.data_utils.GenericDataChunkIterator` works and help you get started
with creating a specific instance for your data format or API access pattern.

Introduction
------------
The :py:class:`~hdmf.data_utils.GenericDataChunkIterator` class represents a semi-abstract
version of a :py:class:`~hdmf.data_utils.AbstractDataChunkIterator` that automatically handles the selection
of buffer regions
and resolves communication of compatible chunk regions within a H5DataIO wrapper. It does not,
however, know how data (values) or metadata (data type, full shape) ought to be directly
accessed. This is by intention to be fully agnostic to a range of indexing methods and
format-independent APIs, rather than make strong assumptions about how data ranges are to be sliced.

Constructing a simple child class
---------------------------------
We will begin with a simple example case of data access to a standard Numpy array.
To create a :py:class:`~hdmf.data_utils.GenericDataChunkIterator` that accomplishes this,
we begin by defining our child class.
"""

# sphinx_gallery_thumbnail_path = 'figures/gallery_thumbnail_generic_data_chunk_tutorial.png'
import numpy as np

from hdmf.data_utils import GenericDataChunkIterator


class NumpyArrayDataChunkIterator(GenericDataChunkIterator):
    def __init__(self, array: np.ndarray, **kwargs):
        self.array = array
        super().__init__(**kwargs)

    def _get_data(self, selection):
        return self.array[selection]

    def _get_maxshape(self):
        return self.array.shape

    def _get_dtype(self):
        return self.array.dtype


# To instantiate this class on an array to allow iteration over buffer_shapes,
my_array = np.random.randint(low=0, high=10, size=(12, 6), dtype="int16")
my_custom_iterator = NumpyArrayDataChunkIterator(array=my_array)

# and this iterator now behaves as a standard Python generator (i.e., it can only be exhausted once)
# that returns DataChunk objects for each buffer.
for buffer in my_custom_iterator:
    print(buffer.data)

###############################################################################
# Intended use for advanced data I/O
# ----------------------------------
# Of course, the real use case for this class is intended for when the amount of data stored on a
# hard drive is larger than what can be read into RAM. Hence the goal is to read only an amount of
# data with a size in gigabytes (GB) at or below the `buffer_gb` argument (defaults to 1 GB).

# This design can be seen if we increase the amount of data in our example code
my_array = np.random.randint(low=0, high=10, size=(20000, 5000), dtype="int32")
my_custom_iterator = NumpyArrayDataChunkIterator(array=my_array, buffer_gb=0.2)

for j, buffer in enumerate(my_custom_iterator, start=1):
    print(f"Buffer number {j} returns data from selection: {buffer.selection}")

###############################################################################
# .. note::
#   Technically, in this example the total data is still fully loaded into RAM from the initial Numpy array.
#   A more accurate use case would be achieved from writing the test_array to a temporary file on your system
#   and loading it back with np.memmap, which is a subtype of Numpy arrays that do not immediately load the data.

###############################################################################
# Writing to an HDF5 file with full control of shape arguments
# ------------------------------------------------------------
# The true intention of returning data selections of this form, and within a DataChunk object,
# is to write these piecewise to an HDF5 dataset.

# This is where the importance of the underlying `chunk_shape` comes in, and why it is critical to performance
# that it perfectly subsets the `buffer_shape`.
import h5py

maxshape = (20000, 5000)
buffer_shape = (10000, 2500)
chunk_shape = (1000, 250)

my_array = np.random.randint(low=0, high=10, size=maxshape, dtype="int32")
my_custom_iterator = NumpyArrayDataChunkIterator(array=my_array, buffer_shape=buffer_shape, chunk_shape=chunk_shape)
out_file = "my_temporary_test_file.hdf5"
with h5py.File(name=out_file, mode="w") as f:
    dset = f.create_dataset(name="test", shape=maxshape, dtype="int16", chunks=my_custom_iterator.chunk_shape)
    for buffer in my_custom_iterator:
        dset[buffer.selection] = buffer.data
# Remember to remove the temporary file after running this and exploring the contents!

###############################################################################
# .. note::
#   Here we explicitly set the `chunks` value in the HDF5 dataset object; however, a nice part of the design of this
#   iterator is that when wrapped in a ``hdmf.backends.hdf5.h5_utils.H5DataIO`` that is called within a
#   ``hdmf.backends.hdf5.h5tools.HDF5IO`` context with a corresponding ``hdmf.container.Container``, these details
#   will be automatically parsed.

###############################################################################
# .. note::
#   There is some overlap here in nomenclature between HDMF and HDF5. The term *chunk* in both
#   HDMF and HDF5 refer to a subset of dataset, however, in HDF5 a chunk is a piece of dataset on disk,
#   whereas in the context of the  :py:class:`~hdmf.data_utils.DataChunk` iteration is a block of data in memory.
#   As such, the
#   requirements on the shape and size of chunks are different. In HDF5 these chunks are pieces
#   of a dataset that get compressed and cached together, and they should usually be small in size for
#   optimal performance  (typically 1 MB or less). In contrast, a :py:class:`~hdmf.data_utils.DataChunk` in
#   HDMF acts as a block of data for writing data to dataset, and spans multiple HDF5 chunks to improve performance.
#   This is achieved by avoiding repeat
#   updates to the same ``Chunk`` in the HDF5 file, :py:class:`~hdmf.data_utils.DataChunk` objects for write
#   should align with ``Chunks`` in the HDF5 file, i.e., the ``DataChunk.selection``
#   should fully cover one or more ``Chunks`` in the HDF5 file to avoid repeat updates to the same
#   ``Chunks`` in the HDF5 file. This is what the `buffer` of the :py:class`~hdmf.data_utils.GenericDataChunkIterator`
#   does, which upon each iteration returns a single
#   :py:class:`~hdmf.data_utils.DataChunk` object (by default > 1 GB) that perfectly spans many HDF5 chunks
#   (by default < 1 MB) to help reduce the number of small I/O operations
#   and help improve performance. In practice, the `buffer` should usually be even larger than the default, i.e,
#   as much free RAM as can be safely used.

###############################################################################
# Remove the test file
import os
if os.path.exists(out_file):
    os.remove(out_file)