# A python 3 api for ODB2 data

ODB2 is a format for encoding tabular observation data. It is formed of a sequence of Frames. Each Frame consists of a header followed by a compressed binary format for encoded data. An ODB2 file may contain any number of frames which may or may not be *compatible* with each other (that is, having the same columns and data types).

The ODB-API package was the supported package for encoding and decoding ODB2 data at ECMWF. This package will shortly be replaced by the newly developed ODC package, and will soon be deprecated.

The ODB-API package contained two different python APIs. These did slightly different things:

  * One thinly wrapped the C/Fortran API in ODB-API to present a row-based view of ODB2 data.
  * One presented an sqlite-like interface to interrogate the contents of ODB2 files

Neither of these two APIs was especially satisfactory. They were slow, buggy, and inconsistent and they had very poor compatibility with existing python tooling for handling data sets.

Both of these APIs have been deprecated, and in the ODC package they have been removed. We present here a first attempt at a new, much simplified, python API for handling ODB2 data. This has a number of properties

  * It is a pure python implementation - the `pyodc` module
  * In the same manner as `pickle`/`cpickle`, there is a `codc` module. This module comes with *much* better performance.
  * It is designed to interact with the `pandas` package. Data is encoded to, and decoded from pandas `DataFrame` objects.
  * It is a thin decoder and encoder. We have no functionality for SQL-like queries.
  * There is an API for exploring the structure and contents of an ODB2 file without decoding it.

## Preconfiguration of notebook

To use this notebook you need only to have a cloned copy of the repo it contains and to run the following cell.

If you wish to use the `codc` module, you need to start `ipython notebook` with `libodccore.so` in the `LD_PRELOAD` path, and swap the import statements over in the next cell.

In [1]:
import pandas as pd
import datetime as datetime
from itertools import islice, cycle
import sys
import os
sys.path.insert(0, os.path.abspath(''))
import pyodc as odc
#import codc as odc

## Exploring ODB2 data

We anticipate that almost all of the use of the pyodc module will be for encoding and decoding ODB2 data. The module also contains a number of classes that facilitate the exploration of the ODB2 data. In particular the `Reader` class provides the access point to further details.

We are not going to elaborate on the use of that class in this document, unless there turns out to be sufficient demand for it. However, we make use of elements of it to demonstrate the functionality.

## Simple encoding and decoding ODB2 data

The encoding API is designed to be as straightforward as possible. Given a pandas `DataFrame` the data should simply be encoded. We provide here a very simple, cut-down, example.

In [2]:
df = pd.DataFrame({
    'expver': ['0001'] * 10,
    'date@hdr': [int(datetime.datetime.now().strftime("%Y%m%d"))] * 10,
    'statid@hdr': ['stat{:02d}'.format(x) for x in range(10)],
    'wigos@hdr': ['0-12345-0-678{:02d}'.format(x+90) for x in range(10)],
    'obsvalue@body': [12.3456 * x for x in range(10)],
    'integer_missing': list(islice(cycle([1234, 4321, None]), 10)),
    'double_missing': list(islice(cycle([12.34, 43.21, None]), 10)),
})

print(df)

  expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0   0001  20200602     stat00  0-12345-0-67890         0.0000   
1   0001  20200602     stat01  0-12345-0-67891        12.3456   
2   0001  20200602     stat02  0-12345-0-67892        24.6912   
3   0001  20200602     stat03  0-12345-0-67893        37.0368   
4   0001  20200602     stat04  0-12345-0-67894        49.3824   
5   0001  20200602     stat05  0-12345-0-67895        61.7280   
6   0001  20200602     stat06  0-12345-0-67896        74.0736   
7   0001  20200602     stat07  0-12345-0-67897        86.4192   
8   0001  20200602     stat08  0-12345-0-67898        98.7648   
9   0001  20200602     stat09  0-12345-0-67899       111.1104   

   integer_missing  double_missing  
0           1234.0           12.34  
1           4321.0           43.21  
2              NaN             NaN  
3           1234.0           12.34  
4           4321.0           43.21  
5              NaN             NaN  
6           1234.0       

In [3]:
odc.encode_odb(df, 'example-file1.odb')

We can now see that this data has been correctly encoded by decoding it again directly.

In [4]:
df_decoded = odc.read_odb('example-file1.odb', single=True)
print(df_decoded)

  expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0   0001  20200602     stat00  0-12345-0-67890         0.0000   
1   0001  20200602     stat01  0-12345-0-67891        12.3456   
2   0001  20200602     stat02  0-12345-0-67892        24.6912   
3   0001  20200602     stat03  0-12345-0-67893        37.0368   
4   0001  20200602     stat04  0-12345-0-67894        49.3824   
5   0001  20200602     stat05  0-12345-0-67895        61.7280   
6   0001  20200602     stat06  0-12345-0-67896        74.0736   
7   0001  20200602     stat07  0-12345-0-67897        86.4192   
8   0001  20200602     stat08  0-12345-0-67898        98.7648   
9   0001  20200602     stat09  0-12345-0-67899       111.1104   

   integer_missing  double_missing  
0           1234.0           12.34  
1           4321.0           43.21  
2              NaN             NaN  
3           1234.0           12.34  
4           4321.0           43.21  
5              NaN             NaN  
6           1234.0       

Both the encoding and decoding of ODB2 data work on file-like objects as well as on files. These have the advantage that you can encode multiple frames of data into the same file sequentially.

In this case, we create an ODB file with frames of two different structures to demonstrate what can be done as a result.

In [5]:
df2 = pd.DataFrame({
    'expver': ['0002'] * 10,
    'date@hdr': [int(datetime.datetime.now().strftime("%Y%m%d"))] * 10,
    'statid@hdr': ['stat{:02d}'.format(20-x) for x in range(10)],
    'obsvalue@body': [12.3456 * x for x in range(10)],
})

with open('example-file2.odb', 'wb') as f:
    odc.encode_odb(df, f)
    odc.encode_odb(df2, f)

The trivial decoder will now result in a `DataFrame` with a substantial number of missing values. In a later section we will see how to extract these different sections separately.

In [6]:
with open('example-file2.odb', 'rb') as f:
    df_decoded = odc.read_odb(f, single=True)
print(df_decoded)

   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0    0001  20200602     stat00  0-12345-0-67890         0.0000   
1    0001  20200602     stat01  0-12345-0-67891        12.3456   
2    0001  20200602     stat02  0-12345-0-67892        24.6912   
3    0001  20200602     stat03  0-12345-0-67893        37.0368   
4    0001  20200602     stat04  0-12345-0-67894        49.3824   
5    0001  20200602     stat05  0-12345-0-67895        61.7280   
6    0001  20200602     stat06  0-12345-0-67896        74.0736   
7    0001  20200602     stat07  0-12345-0-67897        86.4192   
8    0001  20200602     stat08  0-12345-0-67898        98.7648   
9    0001  20200602     stat09  0-12345-0-67899       111.1104   
10   0002  20200602     stat20             None         0.0000   
11   0002  20200602     stat19             None        12.3456   
12   0002  20200602     stat18             None        24.6912   
13   0002  20200602     stat17             None        37.0368   
14   0002 

## Configuring the details of encoding

For most operational ODB2 data, the encoding used is a lossy. In particular, we encode most values as 4-byte REAL values rather than 8-byte DOUBLES.

Typically the encoder will automatically select a datatype and corresponding encoder to use. This datatype can be overridden by supplying a types dictionary.

In [7]:
odc.encode_odb(df, 'example-file3.odb', types={'obsvalue@body': odc.REAL})

We can see on interrogating the frame headers that the data type has changed for the newly encoded file

In [8]:
r1 = odc.Reader('example-file1.odb', aggregated=False)
r3 = odc.Reader('example-file3.odb', aggregated=False)

In [9]:
print("original:", r1.frames[0].column_dict['obsvalue@body'].dtype)
print("updated: ", r3.frames[0].column_dict['obsvalue@body'].dtype)

original: DataType.DOUBLE
updated:  DataType.REAL


And when we decode the data we can see that the precision of the data has been appropriately reduced

In [10]:
df_decoded = odc.read_odb('example-file3.odb', single=True)
print(df_decoded)

  expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0   0001  20200602     stat00  0-12345-0-67890       0.000000   
1   0001  20200602     stat01  0-12345-0-67891      12.345600   
2   0001  20200602     stat02  0-12345-0-67892      24.691200   
3   0001  20200602     stat03  0-12345-0-67893      37.036800   
4   0001  20200602     stat04  0-12345-0-67894      49.382401   
5   0001  20200602     stat05  0-12345-0-67895      61.728001   
6   0001  20200602     stat06  0-12345-0-67896      74.073601   
7   0001  20200602     stat07  0-12345-0-67897      86.419197   
8   0001  20200602     stat08  0-12345-0-67898      98.764801   
9   0001  20200602     stat09  0-12345-0-67899     111.110397   

   integer_missing  double_missing  
0           1234.0           12.34  
1           4321.0           43.21  
2              NaN             NaN  
3           1234.0           12.34  
4           4321.0           43.21  
5              NaN             NaN  
6           1234.0       

## Configuring the frame structure

ODB2 data is broken down into frames. By default a maximum of 10000 rows of data will be encoded into each frame. If more than 10000 rows are supplied then the data will be split into a sequence of frames with at maximum 10000 rows.

This threshold can be modified by use of the `rows_per_frame` parameter

In [11]:
odc.encode_odb(df, 'example-file4.odb', rows_per_frame=3)

Examining the structure of this frame clearly shows that the file now contains multiple frames

In [12]:
r4 = odc.Reader('example-file4.odb', aggregated=False)

print('original frames:', r1.frames)
print('updated  frames:', r4.frames)

print('original row counts:', [f.nrows for f in r1.frames])
print('updated  row counts:', [f.nrows for f in r4.frames])

original frames: [<pyodc.frame.Frame object at 0x7f447a047d60>]
updated  frames: [<pyodc.frame.Frame object at 0x7f4479fcba30>, <pyodc.frame.Frame object at 0x7f4479fcbd90>, <pyodc.frame.Frame object at 0x7f4479fcba60>, <pyodc.frame.Frame object at 0x7f447a047790>]
original row counts: [10]
updated  row counts: [3, 3, 3, 1]


Despite these differences, the data is the same if decoded

In [13]:
df_decoded = odc.read_odb('example-file4.odb', single=True)
print(df_decoded)

  expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0   0001  20200602     stat00  0-12345-0-67890         0.0000   
1   0001  20200602     stat01  0-12345-0-67891        12.3456   
2   0001  20200602     stat02  0-12345-0-67892        24.6912   
3   0001  20200602     stat03  0-12345-0-67893        37.0368   
4   0001  20200602     stat04  0-12345-0-67894        49.3824   
5   0001  20200602     stat05  0-12345-0-67895        61.7280   
6   0001  20200602     stat06  0-12345-0-67896        74.0736   
7   0001  20200602     stat07  0-12345-0-67897        86.4192   
8   0001  20200602     stat08  0-12345-0-67898        98.7648   
9   0001  20200602     stat09  0-12345-0-67899       111.1104   

   integer_missing  double_missing  
0           1234.0           12.34  
1           4321.0           43.21  
2              NaN             NaN  
3           1234.0           12.34  
4           4321.0           43.21  
5              NaN             NaN  
6           1234.0       

## Decoding a subset of the data

Especially for large ODB2 files it can be very valuable to not decode all of the data. The decode functions accept a list or tuple specifying the `columns` to decode.

This is especially helpful when the structure of ODB2 frames in a file is not constant, but all of the frames supply the data that is desired.

In [14]:
df_decoded = odc.read_odb("example-file2.odb", single=True, columns=('statid@hdr', 'obsvalue@body'))
print(df_decoded)

   statid@hdr  obsvalue@body
0      stat00         0.0000
1      stat01        12.3456
2      stat02        24.6912
3      stat03        37.0368
4      stat04        49.3824
5      stat05        61.7280
6      stat06        74.0736
7      stat07        86.4192
8      stat08        98.7648
9      stat09       111.1104
10     stat20         0.0000
11     stat19        12.3456
12     stat18        24.6912
13     stat17        37.0368
14     stat16        49.3824
15     stat15        61.7280
16     stat14        74.0736
17     stat13        86.4192
18     stat12        98.7648
19     stat11       111.1104


## Decoding a sequence of frames

If an ODB file is extremely large, it is undesirable to attempt to decode it into memory in its entirety. Further, if the frames within the file are not *compatible* it may be better to consider each of the frames separately.

By default the `read_odb` function returns an iterable sequence that lazily decodes ODB2 frames as they are needed.

In [15]:
for idx, df_decoded in enumerate(odc.read_odb('example-file2.odb')):
    if idx > 0: print()
    print("Decoded data frame:", idx)
    print(df_decoded)

Decoded data frame: 0
  expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0   0001  20200602     stat00  0-12345-0-67890         0.0000   
1   0001  20200602     stat01  0-12345-0-67891        12.3456   
2   0001  20200602     stat02  0-12345-0-67892        24.6912   
3   0001  20200602     stat03  0-12345-0-67893        37.0368   
4   0001  20200602     stat04  0-12345-0-67894        49.3824   
5   0001  20200602     stat05  0-12345-0-67895        61.7280   
6   0001  20200602     stat06  0-12345-0-67896        74.0736   
7   0001  20200602     stat07  0-12345-0-67897        86.4192   
8   0001  20200602     stat08  0-12345-0-67898        98.7648   
9   0001  20200602     stat09  0-12345-0-67899       111.1104   

   integer_missing  double_missing  
0           1234.0           12.34  
1           4321.0           43.21  
2              NaN             NaN  
3           1234.0           12.34  
4           4321.0           43.21  
5              NaN             NaN  
6  

## Aggregated or non-aggregated reading

There are two different reasons that decoding to a sequence of dataframes may be useful.

1. To page data through memory without consuming more resources than exist
2. To handle a sequence of frames that do not have the same structure.

Conceptually in the first case, a sequence of frames may be considered to be one frame that has been split for technical reasons. The library is able to logically group these frames together into one logical, aggregated frame (and, indeed, it does this by default). Decoding aggregated logical frames in one step significantly improves performance of the decoder if offloading to ODC.

Note that frames do not have to have columns in the same *order* to be considered compatible.

Both the `Reader` and `read_odb` functionality take two arguments:

* `aggregated` - (default True) enables or disables aggregation of compatible frames.
* `max_aggregated` - (default None) sets a maximum number of rows to be combined into one logical frame before the library will split them anyway (for pagination purposes).

To demonstrate, first we build a decoder with several real and a smaller number of logical frames

In [16]:
with open('example-file5.odb', 'wb') as f:
    odc.encode_odb(df, f, rows_per_frame=3)
    odc.encode_odb(df2, f, rows_per_frame=3)

We can interrogate the structure using two different readers

In [17]:
r5a = odc.Reader('example-file5.odb')
r5b = odc.Reader('example-file5.odb', aggregated=False)

print('aggregated row counts:', [f.nrows for f in r5a.frames])
print('separate   row counts:', [f.nrows for f in r5b.frames])

aggregated row counts: [10, 10]
separate   row counts: [3, 3, 3, 1, 3, 3, 3, 1]


By default we decode data in an aggregated fashion

In [18]:
for idx, df_decoded in enumerate(odc.read_odb('example-file5.odb')):
    if idx > 0: print()
    print("Decoded data frame:", idx)
    print(df_decoded)

Decoded data frame: 0
  expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0   0001  20200602     stat00  0-12345-0-67890         0.0000   
1   0001  20200602     stat01  0-12345-0-67891        12.3456   
2   0001  20200602     stat02  0-12345-0-67892        24.6912   
0   0001  20200602     stat03  0-12345-0-67893        37.0368   
1   0001  20200602     stat04  0-12345-0-67894        49.3824   
2   0001  20200602     stat05  0-12345-0-67895        61.7280   
0   0001  20200602     stat06  0-12345-0-67896        74.0736   
1   0001  20200602     stat07  0-12345-0-67897        86.4192   
2   0001  20200602     stat08  0-12345-0-67898        98.7648   
0   0001  20200602     stat09  0-12345-0-67899       111.1104   

   integer_missing  double_missing  
0           1234.0           12.34  
1           4321.0           43.21  
2              NaN             NaN  
0           1234.0           12.34  
1           4321.0           43.21  
2              NaN             NaN  
0  

But we can also decode the real frames separately

In [19]:
for idx, df_decoded in enumerate(odc.read_odb('example-file5.odb', aggregated=False)):
    if idx > 0: print()
    print("Decoded data frame:", idx)
    print(df_decoded)

Decoded data frame: 0
  expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0   0001  20200602     stat00  0-12345-0-67890         0.0000   
1   0001  20200602     stat01  0-12345-0-67891        12.3456   
2   0001  20200602     stat02  0-12345-0-67892        24.6912   

   integer_missing  double_missing  
0           1234.0           12.34  
1           4321.0           43.21  
2              NaN             NaN  

Decoded data frame: 1
  expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0   0001  20200602     stat03  0-12345-0-67893        37.0368   
1   0001  20200602     stat04  0-12345-0-67894        49.3824   
2   0001  20200602     stat05  0-12345-0-67895        61.7280   

   integer_missing  double_missing  
0           1234.0           12.34  
1           4321.0           43.21  
2              NaN             NaN  

Decoded data frame: 2
  expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0   0001  20200602     stat06  0-12345-0-67896   