1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382
|
.. toctree::
:maxdepth: 2
Storing data
============
When storing data, we have two main requirements: 1. We want to be able
to store **n-dimensional** data structures. 2. The data structures must
be **self-explanatory**, that is, they must contain sufficient
information to draw a basic plot of the data.
|sampled_plot|
.. _sampled_plot:
Considering the simple plot above, we can list all
information that it shows and by extension, that needs to be stored in
order to reproduce it.
- the data (voltage measurements)
- the y-axis labeling, i.e. label (voltage) and unit (mV)
- the x-axis labeling, i.e. label (time) and unit (s)
- the x-position for each data point
- a title/legend
In this, and in most cases, it would be inefficient to store x-, and
y-position for each plotted point. The voltage measurements have been done
in regular (time) intervals. Thus, we rather need to store the measured
values and a definition of the x-axis consisting of an offset, the
sampling interval, a label, and a unit.
This is exactly the approach chosen in *NIX*. For each dimension of the
data a **dimension descriptor** must be given. In *NIX* we define three
dimension descriptors:
1. *SampledDimension*: Used if a dimension is sampled at regular
intervals.
2. *RangeDimension*: Used if a dimension is sampled at irregular
intervals. The instances at which the data has been sampled are stored as *ticks*. These ticks may also be stored in a *DataArray* or *DataFrame*, the Range Dimension can link to it.
3. *SetDimension*: Used for dimensions that represent categories rather
than physical quantities.
The DataArray
-------------
The *DataArray* is the most central entity of the *NIX* data model. As
almost all other *NIX*-entities it requires a *name* and a *type*. Both
are not restricted but names must be **unique** within a *Block*. *type*
information can be used to introduce semantic meaning and
domain-specificity. Upon creation, a unique ID will be assigned to the
*DataArray*.
The *DataArray* stores the actual data together with label and unit. In
addition, the *DataArray* needs a dimension descriptor for each data
dimension. The following snippet shows how to create a *DataArray* and
store data in it.
.. literalinclude:: examples/regularlySampledData.py
:caption: Storing data in a DataArray (:download:`example code <examples/regularlySampledData.py>`).
:language: python
:lines: 57 - 61
As promised, the *DataArray* contains all information to create a basic plot (see the figure above).
.. literalinclude:: examples/regularlySampledData.py
:language: python
:lines: 29 - 40
:emphasize-lines: 2, 3, 7-8
The highlighted lines emphasize how information from the dimension descriptor (a *SampledDimension*) and the *DataArray* itself are used for labeling the plot.
In the example shown above, the *NIX* library will figure out the
dimensionality of the data, the shape of the data and its type. The data
type and the dimensionality (i.e. the number of dimensions) are fixed
once the *DataArray* has been created. The actual size of the
*DataArray* can be changed during the life-time of the entity.
In case you need more control, *DataArrays* can be created empty for
later filling e.g. during data acquisition.
.. code-block:: python
data_array = block.create_data_array("sinewave", "nix.sampled", dtype=nixio.DataType.Double, shape=(100));
The resulting *DataArray* will have an initial size (100 elements) which
will be automatically resized, if required. The data type is set to
``nixio.DataType.Double``. The *NIX* library will further try to convert passed data to the
defined data type, if possible. **Note:** Data type and rank (i.e. the number of dimensions) cannot be altered after the *DataArray* has been created.
Data is then set by calling
.. code-block:: python
array.write_direct(voltage)
Data is read from a DataArray by accessing it in numpy style.
.. code-block:: python
# read the full data
data = array[:]
# reading only the first 100 points
data = array[:100]
# replacing data
array[:100] = np.zeros(100)
*DataArrays* can also be extended by ``appending`` data.
.. code-block:: python
array = block.create_data_array("test data", "test", data=np.random.randn(100))
print(array.shape)
array.append(np.ones(100))
print(array.shape)
Dimensions
----------
Within the *DataArray* we can store n-dimensional data. For each
dimension we must provide a *dimension descriptor*. The following
introduces the individual descriptors.
SampledDimension
****************
.. figure:: ./images/regular_sampled.png
:alt: 1-D regular sampled data
Here we have the same situation as before, the data has been sampled in
regular intervals. That is, the time between successive data points is
always the same. The x-axis can be fully described with just a few
parameters:
1. sampling interval
2. offset
3. label
4. unit
The *SampledDimension* entity is used in such situations and needs to be
added to the *DataArray* entity upon creation:
.. literalinclude:: examples/regularlySampledData.py
:language: python
:lines: 57 - 60
:emphasize-lines: 4
:caption: A dimension in which the data has been gathered in regular intervals is described using a *SampledDimension* (:download:`example code <examples/regularlySampledData.py>`
SetDimension
************
.. figure:: ./images/set_dimension.png
:alt: simple plot with categories
If we need to store data that falls into categories, i.e. the
x-positions are not numeric or the dimension does not have a natural
order, a *SetDimension* is used. It stores a label for each entry along
the described dimension.
.. literalinclude:: examples/categoryData.py
:language: python
:lines: 16 - 17
:caption: A categorical data dimension is described with a *SetDimension* which (optionally) stores labels for each category (:download:`example code <examples/categoryData.py>`)
RangeDimension
**************
.. figure:: ./images/irregular_sampled.png
:alt: 1-D irregularly sampled data
A signal similar to what we had before is recorded but this time the temporal
distance between the measurements is not regular. Storing this kind
of data is not as efficient as in the regularly sampled case since we need to store the time of each measurement and the measured value. The following information needs to be stored to describe the dimension:
1. x-positions of the data points, i.e. *ticks*
2. label
3. unit
In this kind of dimension we store a *range* of ticks, therefore the
name *RangeDimension*. It needs to be added to the *DataArray* when it
is created.
.. literalinclude:: examples/irregularlySampledData.py
:caption: Storing irregularly sampled data (:download:`example code <examples/irregularlySampledData.py>`).
:language: python
:lines: 45 - 58
:emphasize-lines: 14
**Note:** The *ticks* of a *RangeDimension* must be numeric and ascending.
The *RangeDimension* can do more. Consider the case that the times of an event are stored:
.. figure:: ./images/alias_range.png
:alt: 1-D event data
For example these might be the times of action potentials (aka spikes) recorded in a nerve. In such a case it is basically the x-values that are of interest. It would be inefficient to store them twice, first as values in the *DataArray* and then again as ticks in the dimension descriptor.
In such cases the *RangeDimension* is set up to link to the *DataArray* itself.
.. literalinclude:: examples/rangeDimensionLink.py
:lines: 12 - 17
:language: python
:caption: A *RangeDimension* can link to a *DataArray* as a source of the ticks (:download:`exmaple code <examples/rangeDimensionLink.py>`).
:emphasize-lines: 6
In the highlighted line we use a convenience function to establish the link between the dimension descriptor and the data stored via the dimensions of its own DataArray.
This function is a shortcut for:
.. code-block:: python
rdim = data_array.append_range_dimension()
rdim.create_link(data_array, index=[-1])
In the same way one can set up the *RangeDimension* to use a *DataFrame*. The ``index`` defines which slice of the data stored in the *DataArray/Frame* holds the ticks. The slice must be a 1-D vector within the linked data. If, for example, the *DataArray* is 2-D and the ticks are in the 3rd column then the index would be ``[-1, 3]``. The ``-1`` marks the dimension along which to look for the ticks. When linking *DataFrames* the index notes the column.
Advanced storing
----------------
Data compression
****************
By default data is stored uncompressed. If you want to use data
compression this can be enabled by providing the
``nixio.Compression.DeflateNormal`` flag during file-opening:
.. code-block:: python
import nixio
f = nixio.File.open("test.nix", nixio.FileMode.Overwrite, compression=nixio.Compression.DeflateNormal)
By doing this, all data will be stored with compression enabled, if
not explicitly stated otherwise. At any time you can select or deselect
compression by providing a ``nixio.Compression`` flag during *DataArray*
creation. Available flags are:
- ``nixio.Compression.Auto``: compression as defined during
file-opening.
- ``nixio.Compression.DeflateNormal``: use compression (fixed level).
- ``nixio.CompressionNo``: no compression.
.. code-block:: python
data_array = b.create_data_array("some data", "nix.sampled", data, compression=nixio.Compression.DeflateNormal);
Note the following:
1. Compression comes with a little cost of read-write performance.
2. Data compression is fixed once the *DataArray* has been created, it
cannot be changed afterwards.
3. Opening and extending a compressed *DataArray* is easily possible
even if the file has not been opened with the
``nixio.Compression.DeflateNormal`` flag.
Supported DataTypes
*******************
*DataArrays* can store a multitude of different data types. The
supported data types are defined in the ``nixio.DataType`` enumeration:
- ``nixio.DataType.Bool``: 1 bit boolean value.
- ``nixio.DataType.Char``: 8 bit charater.
- ``nixio.DataType.Float``: floating point number.
- ``nixio.DataType.Double``: double precision floating point number.
- ``nixio.DataType.Int8``: 8 bit integer, signed.
- ``nixio.DataType.Int16``: 16 bit integer, signed.
- ``nixio.DataType.Int32``: 32 bit integer, signed.
- ``nixio.DataType.Int64``: 64 bit integer, signed.
- ``nixio.DataType.UInt8``: 8 bit unsigned int.
- ``nixio.DataType.UInt16``: 16 bit unsigned int.
- ``nixio.DataType.UInt32``: 32 bit unsigned int.
- ``nixio.DataType.UInt64``: 64 bit unsigned int.
- ``nixio.DataType.String``: string value.
- ``nixio.DataType.Opaque``: data type for binary data.
The data type of a *DataArray* must be specified at creation time and
cannot be changed. In many cases, the *NIX* library will try to handle
data types transparently and cast data to the data type specified for
the *DataArray* in which it is supposed to be stored.
Extending datasets on the fly
*****************************
The dimensionality (aka rank) and the stored *DataType* of a *DataArray*
are fixed. The actual size of the stored dataset, however, can
changed. This is often used when you acquire data continuously e.g. while
recording during an experiment. In nixpy the resizing is handled transparently.
The workflow would be:
1. Preparations: Open a nix-file in ``nixio.FileMode.ReadWrite`` or
``nixio.FileMode.Overwrite``. Create or open the *DataArray*.
2. Acquire more data.
3. Append the acquired data to the data array.
4. Acquire more data.
The following code shows how this works.
.. literalinclude:: examples/continuousRecording.py
:caption: Extending *DataArray* size e.g while recording continuously (:download:`example code <examples/continuousRecording.py>`)
:lines: 49-65
:language: python
**Note!** Selecting the initial shape defines the chunk size used to
write the data to file. Choose it appropriately for the expected size
increment. Selecting a size that is too small can severly affect
efficiency.
.. figure:: ./images/multiple_channels.png
:alt: recording multiple channels
.. |sampled_plot| image:: ./images/regular_sampled.png
The DataFrame
-------------
The DataFrame stores tabular data in a rows-and-columns format. Each Column has a name, a data type, and, optionally, a unit and a definition.
.. literalinclude:: examples/tabulardata.py
:lines: 8 - 25
:caption: A DataFrame is created by providing the column definition which consists of a name and the data type that should be stored in each column. Units can be added later on. It is best to use OrderdDicts to make sure, that the column order is as defined (standard dictionaries do not guarantee the order of the items). :download:`example code <examples/tabulardata.py>`
For a quick overview one can use the ``DataFrame.print_table()`` method
.. code-block:: text
column: name id time amplitude frequency
unit: s mV Hz
[0]: alpha e28e0735-1 20.18 5.0 100.0
[1]: beta 74942839-1 20.09 5.5 101.0
[2]: gamma 0886ef1a-2 20.05 5.1 100.0
[3]: delta 0ac53fd2-7 20.15 5.3 150.0
[4]: epsilon a54b35df-1 20.23 5.7 200.0
[5]: fi 70eb952c-0 20.07 5.2 300.0
[6]: zeta 130ad527-d 20.12 5.1 39.0
[7]: eta 042a460c-e 20.27 5.1 600.0
[8]: theta 1f2c1df8-f 20.15 5.6 400.0
[9]: iota c4598a1e-b 20.08 5.1 200.0
The DataFrame offers several methods to get basic information about the table:
.. literalinclude:: examples/tabulardata.py
:lines: 30-33
Leading to the following output:
.. code-block:: text
size, aka number of rows: 10
column names: ('name', 'id', 'time', 'amplitude', 'frequency')
column definition: [('name', dtype('O'), None), ('id', dtype('O'), None), ('time', dtype('<f8'), 's'), ('amplitude', dtype('<f8'), 'mV'), ('frequency', dtype('<f8'), 'Hz')]
column units: [None None 's' 'mV' 'Hz']
Reading of the table data is possible for individual cells, entire columns, or selected rows
.. literalinclude:: examples/tabulardata.py
:lines: 35-41
Which yields the following outputs:
.. code-block:: text
single cell by position: alpha
single cell by name and row index: alpha
Entire ID column: ['f0299d50-b' 'c74d3b7c-0' 'e27bb6f7-2' '8bac251b-7' 'aecc18d4-a'
'268c8dea-f' '88de352a-d' '4bdac82c-b' '5671ed6e-b' '0a7bf5e4-9']
Two columns, id and name, joined: [('f0299d50-b', 'alpha') ('c74d3b7c-0', 'beta') ('e27bb6f7-2', 'gamma')
('8bac251b-7', 'delta') ('aecc18d4-a', 'epsilon') ('268c8dea-f', 'fi')
('88de352a-d', 'zeta') ('4bdac82c-b', 'eta') ('5671ed6e-b', 'theta')
('0a7bf5e4-9', 'iota')]
Entire rows with indices 0 and 2: [('alpha', 'f0299d50-b', 20.18, 5. , 100.)
('gamma', 'e27bb6f7-2', 20.05, 5.1, 100.)]
|