1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
|
Making Data Packages
====================
Intake can used to create :term:`Data packages`, so that you can easily distribute
your catalogs - others can just "install data". Since you may also want to distribute
custom catalogues, perhaps with visualisations, and driver code, packaging these things
together is a great convenience. Indeed, packaging gives you the opportunity to
version-tag your distribution and to declare the requirements needed to be able to
use the data. This is a common pattern for distributing code for python and other
languages, but not commonly seen for data artifacts.
The current version of Intake allows making data packages using standard python
tools (to be installed, for example, using ``pip``).
The previous, now deprecated, technique is still described below, under
:ref:`condapack` and is specific to the `conda` packaging system.
Python packaging solution
-------------------------
Intake allows you to register data artifacts (catalogs and data sources) in the
metadata of a python package. This means, that when you install that package, intake
will automatically know of the registered items, and they will appear within the
"builtin" catalog ``intake.cat``.
Here we assume that you understand what is meant by a python package (i.e., a
folder containing ``__init__.py`` and other code, config and data files).
Furthermore, you should familiarise yourself with what is required for
bundling such a package into a *distributable* package (one with a ``setup.py``)
by reading the `official packaging documentation`_
.. _official packaging documentation: https://packaging.python.org/tutorials/packaging-projects/
The `intake examples`_ contains a full tutorial for packaging and distributing
intake data and/or catalogs for ``pip`` and ``conda``, see the directory
"data_package/".
.. _intake examples: https://github.com/intake/intake-examples
Entry points definition
'''''''''''''''''''''''
Intake uses the concept of `entry points` to define the entries that are defined
by a given package. Entry points provide a mechanism to register metadata about a
package at install time, so that it can easily be found by other packages such as Intake.
Entry points was originally a `separate package`_, but is included in the standard
library as of python 3.8 (you will not need to install it, as Intake requires it).
All you need to do to register an entry in ``intake.cat`` is:
- define a data source somewhere in your package. This object can
be of any ttype that makes sense to Intake, including Catalogs, and sources
that have drivers defined in the very same package. Obviously, if you can have
catalogs, you can populate these however you wish, including with more catalogs.
You need not be restricted to simply loading in YAML files.
- include a block in your call to ``setp`` in ``setup.py`` with code something like
.. code-block:: python
entry_points={
'intake.catalogs': [
'sea_cat = intake_example_package:cat',
'sea_data = intake_example_package:data'
]
}
Here only the lines with "sea_cat" and "sea_data" are specific to the example
package, the rest is required boilerplate. Each of those two lines defines a name
for the data entry (before the "=" sign) and the location to load from, in
module:object format.
- install the package using ``pip``, ``python setup.py``, or package it for ``conda``
.. _separate package: https://github.com/takluyver/entrypoints
Intake's process
''''''''''''''''
When Intake is imported, it investigates all registered entry points with the
``"intake.catalogs"`` group. It will go through and assign each name to the
given location of the final object. In the above example, ``intake.cat.sea_cat``
would be associated with the ``cat`` object in the ``intake_example_package``
package, and so on.
Note that Intake does **not** immediately import the given package or module, because imports
can sometimes be expensive, and if you have a lot of data packages, it might cause
a slow-down every time that Intake is imported. Instead, a placeholder entry is
created, and whenever the entry is accessed, that's when the particular package
will be imported.
.. code-block:: python
In [1]: import intake
In [2]: intake.cat.sea_cat # does not import yet
Out[2]: <Entry containing Catalog named sea_cat>
In [3]: cat = intake.cat.sea_cat() # imports now
In [4]: cat # this data source happens to be a catalog
Out[4]: <Intake catalog: sea>
(note here the parentheses - this explicitly initialises the source, and normally
you don't have to do this)
.. _condapack:
Pure conda solution
-------------------
This packaging method is deprecated, but still available.
Combined with the `Conda Package Manger <https://conda.io/docs/>`_, Intake
makes it possible to create :term:`Data packages` which can be installed and upgraded just like
software packages. This offers several advantages:
* Distributing Catalogs and Drivers becomes as easy as ``conda install``
* Data packages can be versioned, improving reproducibility in some cases
* Data packages can depend on the libraries required for reading
* Data packages can be self-describing using Intake catalog files
* Applications that need certain Catalogs can include data packages in their dependency list
In this tutorial, we give a walk-through to enable you to distribute any
Catalogs to others, so that they can access the data using Intake without worrying about where it
resides or how it should be loaded.
Implementation
''''''''''''''
The function ``intake.catalog.default.load_combo_catalog`` searches for YAML catalog files in a number
of place at import. All entries in these catalogs are flattened and placed in the "builtin"
``intake.cat``.
The places searched are:
* a platform-specific user directory as given by the `appdirs`_ package
* in the environment's "/share/intake" data directory, where the location of the current environment
is found from virtualenv or conda environment variables
* in directories listed in the "INTAKE_PATH" environment variable or "catalog_path" config parameter
.. _appdirs: https://github.com/ActiveState/appdirs
Defining a Package
''''''''''''''''''
The steps involved in creating a data package are:
1. Identifying a dataset, which can be accessed via a URL or included directly as one or more files in the package.
2. Creating a package containing:
* an intake catalog file
* a ``meta.yaml`` file (description of the data, version, requirements, etc.)
* a script to copy the data
3. Building the package using the command ``conda build``.
4. Uploading the package to a package repository such as `Anaconda Cloud <https://anaconda.org>`_ or your own private
repository.
Data packages are standard conda packages that install an Intake catalog file into the user's conda environment
(``$CONDA_PREFIX/share/intake``). A data package does not necessarily imply there are data files inside the package.
A data package could describe remote data sources (such as files in S3) and take up very little space on disk.
These packages are considered ``noarch`` packages, so that one package can be installed on any platform, with any
version of Python (or no Python at all). The easiest way to create such a package is using a
`conda build <https://conda.io/docs/commands/build/conda-build.html>`_ recipe.
Conda-build recipes are stored in a directory that contains a files like:
* ``meta.yaml`` - description of package metadata
* ``build.sh`` - script for building/installing package contents (on Linux/macOS)
* other files needed by the package (catalog files and data files for data packages)
An example that packages up data from a Github repository would look like this:
.. code-block:: yaml
# meta.yaml
package:
version: '1.0.0'
name: 'data-us-states'
source:
git_rev: v1.0.0
git_url: https://github.com/CivilServiceUSA/us-states
build:
number: 0
noarch: generic
requirements:
run:
- intake
build: []
about:
description: Data about US states from CivilServices (https://civil.services/)
license: MIT
license_family: MIT
summary: Data about US states from CivilServices
The key parts of a data package recipe (different from typical conda recipes) is the ``build`` section:
.. code-block:: yaml
build:
number: 0
noarch: generic
This will create a package that can be installed on any platform, regardless of the platform where the package is
built. If you need to rebuild a package, the build number can be incremented to ensure users get the latest version when they conda update.
The corresponding ``build.sh`` file in the recipe looks like this:
.. code-block:: bash
#!/bin/bash
mkdir -p $CONDA_PREFIX/share/intake/civilservices
cp $SRC_DIR/data/states.csv $PREFIX/share/intake/civilservices
cp $RECIPE_DIR/us_states.yaml $PREFIX/share/intake/
The ``$SRC_DIR`` variable refers to any source tree checked out (from Github or other service), and the
``$RECIPE_DIR`` refers to the directory where the ``meta.yaml`` is located.
Finishing out this example, the catalog file for this data source looks like this:
.. code-block:: yaml
sources:
states:
description: US state information from [CivilServices](https://civil.services/)
driver: csv
args:
urlpath: '{{ CATALOG_DIR }}/civilservices/states.csv'
metadata:
origin_url: 'https://github.com/CivilServiceUSA/us-states/blob/v1.0.0/data/states.csv'
The ``{{ CATALOG_DIR }}`` Jinja2 variable is used to construct a path relative to where the catalog file was installed.
To build the package, you must have conda-build installed:
.. code-block:: bash
conda install conda-build
Building the package requires no special arguments:
.. code-block:: bash
conda build my_recipe_dir
Conda-build will display the path of the built package, which you will need to upload it.
If you want your data package to be publicly available on `Anaconda Cloud <https://anaconda.org>`_, you can install
the anaconda-client utility:
.. code-block:: bash
conda install anaconda-client
Then you can register your Anaconda Cloud credentials and upload the package:
.. code-block:: bash
anaconda login
anaconda upload /Users/intake_user/anaconda/conda-bld/noarch/data-us-states-1.0.0-0.tar.bz2
Best Practices
--------------
Versioning
''''''''''
* Versions for data packages should be used to indicate changes in the data values or schema. This allows applications
to easily pin to the specific data version they depend on.
* Putting data files into a package ensures reproducibility by allowing a version number to be associated with files
on disk. This can consume quite a bit of disk space for the user, however. Large data files are not generally
included in pip or conda packages so, if possible, you should reference the data assets in an external place where they
can be loaded.
Packaging
'''''''''
* Packages that refer to remote data sources (such as databases and REST APIs) need to think about authentication.
Do not include authentication credentials inside a data package. They should be obtained from the environment.
* Data packages should depend on the Intake plugins required to read the data, or Intake itself.
* You may well want to break any driver code code out into a separate package so that it can be updated
independent of the data. The data package would then depend on the driver package.
Nested catalogs
'''''''''''''''
As noted above, entries will appear in the users' builtin
catalog as ``intake.cat.*``. In the case that the catalog has multiple entries, it may be desirable
to put the entries below a namespace as ``intake.cat.data_package.*``. This can be achieved by having
one catalog containing the (several) data sources, with only a single top-level entry pointing to
it. This catalog could be defined in a YAML file, created using any other catalog driver, or constructed
in the code, e.g.:
.. code-block:: python
from intake.catalog import Catalog
from intake.catalog.local import LocalCatalogEntry as Entry
cat = intake.catalog.Catalog()
cat._entries = {name: Entry(name, descr, driver='package.module.driver',
args={"urlpath": url})
for name, url in my_input_list}
If your package contains many sources of different types, you may even nest the catalogs, i.e.,
have a top-level whose contents are also catalogs.
.. code-block:: python
e = Entry('first_cat', 'sample', driver='catalog')
e._default_source = cat
top_level = Catalog()
top_level._entries = {'fist_cat': e, ...}
where your entry point might look something like: ``"my_cat = my_package:top_level"``. You could achieve the same
with multiple YAML files.
|