1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
|
Quick Start
************
This is a run-through example for how to use this package. We scan a set of netCDF4/HDF5 files,
and create a single ensemble, virtual dataset, which can be read in parallel from remote
using ``zarr``.
Single file JSONs
=================
This will create a ``.json`` file for each of the files defined in ``urllist``. In this case,
we simply keep the resultant reference sets in memory, but we could have written them into
JSON files. Writing to files is useful, so that we can access the individual datasets, or
redo the combine (which is the next step, below).
.. code-block:: python
import kerchunk.hdf
import fsspec
urls = ["s3://" + p for p in [
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010000.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010100.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010200.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010300.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010400.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010500.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010600.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010700.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010800.CHRTOUT_DOMAIN1.comp',
'noaa-nwm-retro-v2.0-pds/full_physics/2017/201704010900.CHRTOUT_DOMAIN1.comp'
]]
so = dict(
anon=True, default_fill_cache=False, default_cache_type='first'
)
singles = []
for u in urls:
with fsspec.open(u, **so) as inf:
h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)
singles.append(h5chunks.translate())
Multi-file JSONs
================
This code uses the output generated above to create a single ensemble dataset, with
one set of references pointing to all of the chunks in the individual files.
.. code-block:: python
from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
singles,
remote_protocol="s3",
remote_options={'anon': True},
concat_dims=["time"]
)
out = mzz.translate()
Again, ``out`` could be written to a JSON file by providing arguments to
``translate()``. Crucially, there is no restriction on where
this lives, it can be anywhere that fsspec can read from.
Using the output
================
This is what a user of the generated dataset would do. This person does not need to have
``kerchunk`` installed, or even ``h5py`` (the library we used to initially scan the files).
.. code-block:: python
import xarray as xr
ds = xr.open_dataset(
"reference://", engine="zarr",
backend_kwargs={
"storage_options": {
"fo": out,
"remote_protocol": "s3",
"remote_options": {"anon": True}
},
"consolidated": False
}
)
# do analysis...
ds.velocity.mean()
Since the invocation for xarray to read this data is a little involved, we recommend
declaring the data set in an ``intake`` catalog. Alternatively, you might split the command
into multiple lines by first constructing the filesystem or mapper (you will see this in some
examples).
Note that, if the combining was done previously and saved to a JSON file, then the path to
it should replace ``out``, above, along with a ``target_options`` for any additional
arguments fsspec might to access it
Example/Tutorial Notebook
=========================
A set of tutorials notebooks, presented at the Earth Science Information Partners 2022 Winter Meeting, can be found at the following link, along with links to run the code on free cloud-based notebook environments: https://github.com/lsterzinger/2022-esip-kerchunk-tutorial
|