1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
|
Quickstart
==========
This guide will show you how to get started using Intake to read data, and give you a flavour
of how Intake feels to the :term:`Data User`.
It assumes you are working in either a conda or a virtualenv/pip environment. For notebooks with
executable code, see the :doc:`examples`. This walk-through can be run from a notebook or interactive
python session.
Installation
------------
If you are using `Anaconda`_ or Miniconda, install Intake with the following commands::
conda install -c conda-forge intake
If you are using virtualenv/pip, run the following command::
pip install intake
Note that this will install with the minimum of optional requirements. If you want a more complete
install, use `intake[complete]` instead.
.. _Anaconda: https://www.anaconda.com/download/
Creating Sample Data
--------------------
Let's begin by creating a sample data set and catalog. At the command line, run the ``intake example`` command.
This will create an example data :term:`Catalog` and two CSV data files. These files contains some basic facts about the 50
US states, and the catalog includes a specification of how to load them.
Loading a Data Source
---------------------
:term:`Data sources<Data-set>` can be created directly with the ``open_*()`` functions in the ``intake``
module. To read our example data::
>>> import intake
>>> ds = intake.open_csv('states_*.csv')
>>> print(ds)
<intake.source.csv.CSVSource object at 0x1163882e8>
Each open function has different arguments, specific for the data format or service being used.
Reading Data
------------
Intake reads data into memory using :term:`containers<Container>` you are already familiar with:
* Tables: Pandas DataFrames
* Multidimensional arrays: NumPy arrays
* Semistructured data: Python lists of objects (usually dictionaries)
To find out what kind of container a data source will produce, inspect the ``container`` attribute::
>>> ds.container
'dataframe'
The result will be ``dataframe``, ``ndarray``, or ``python``. (New container types will be added in
the future.)
For data that fits in memory, you can ask Intake to load it directly::
>>> df = ds.read()
>>> df.head()
state slug code nickname ...
0 Alabama alabama AL Yellowhammer State
1 Alaska alaska AK The Last Frontier
2 Arizona arizona AZ The Grand Canyon State
3 Arkansas arkansas AR The Natural State
4 California california CA Golden State
Many data sources will also have quick-look plotting available. The attribute ``.plot`` will list
a number of built-in plotting methods, such as ``.scatter()``, see :doc:`plotting`.
Intake data sources can have *partitions*. A partition refers to a contiguous chunk of data that can be loaded
independent of any other partition. The partitioning scheme is entirely up to the plugin author. In
the case of the CSV plugin, each ``.csv`` file is a partition.
To read data from a data source one chunk at a time, the ``read_chunked()`` method returns an iterator::
>>> for chunk in ds.read_chunked(): print('Chunk: %d' % len(chunk))
...
Chunk: 24
Chunk: 26
Working with Dask
-----------------
Working with large datasets is much easier with a parallel, out-of-core computing library like
`Dask <https://dask.pydata.org/en/latest/>`_. Intake can create Dask containers (like ``dask.dataframe``)
from data sources that will load their data only when required::
>>> ddf = ds.to_dask()
>>> ddf
Dask DataFrame Structure:
admission_date admission_number capital_city capital_url code constitution_url facebook_url landscape_background_url map_image_url nickname population population_rank skyline_background_url slug state state_flag_url state_seal_url twitter_url website
npartitions=2
object int64 object object object object object object object object int64 int64 object object object object object object object
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: from-delayed, 4 tasks
The Dask containers will be partitioned in the same way as the Intake data source, allowing different chunks
to be processed in parallel. Please read the Dask documentation to understand the differences when
working with Dask collections (Bag, Array or Data-frames).
Opening a Catalog
-----------------
A :term:`Catalog` is an inventory of data sources, with the type and arguments prescribed for each, and
arbitrary metadata about each source.
In the simplest case, a catalog can be described by a file in YAML format, a
":term:`Catalog file`". In real usage, catalogues can be defined in a number of ways, such as remote
files, by
connecting to a third-party data service (e.g., SQL server) or through an Intake :term:`Server` protocol, which
can implement any number of ways to search and deliver data sources.
The ``intake example`` command, above, created a catalog file
with the following :term:`YAML`-syntax content:
.. code-block:: yaml
sources:
states:
description: US state information from [CivilServices](https://civil.services/)
driver: csv
args:
urlpath: '{{ CATALOG_DIR }}/states_*.csv'
metadata:
origin_url: 'https://github.com/CivilServiceUSA/us-states/blob/v1.0.0/data/states.csv'
To load a :term:`Catalog` from a :term:`Catalog file`::
>>> cat = intake.open_catalog('us_states.yml')
>>> list(cat)
['states']
This catalog contains one data source, called ``states``. It can be accessed by attribute::
>>> cat.states.to_dask()[['state','slug']].head()
state slug
0 Alabama alabama
1 Alaska alaska
2 Arizona arizona
3 Arkansas arkansas
4 California california
Placing data source specifications into a catalog like this enables declaring data sets in a single canonical place,
and not having to use boilerplate code in each notebook/script that makes use of the data. The catalogs can also
reference one-another, be stored remotely, and include extra metadata such as a set of named quick-look plots that
are appropriate for the particular data source. Note that catalogs are **not** restricted
to being stored in YAML files, that just happens to be the simplest way to display them.
Many catalog entries will also contain "user_parameter" blocks, which are indications of options explicitly
allowed by the catalog author, or for validation or the values passed. The user can customise how a data
source is accessed by providing values for the user_parameters, overriding the arguments specified in
the entry, or passing extra keyword arguments to be passed to the driver. The keywords that should
be passed are limited to the user_parameters defined and the inputs expected by the specific
driver - such usage is expected only from those already familiar with the specifics of the given
format. In the following example, the user overrides the "csv_kwargs" keyword, which is described
in the documentation for :func:`CSVSource <intake.source.csv.CSVSource>` and gets passed down to the CSV reader::
# pass extra kwargs understood by the csv driver
>>> intake.cat.states(csv_kwargs={'header': None, 'skiprows': 1}).read().head()
0 1 ... 17
0 Alabama alabama ... https://twitter.com/alabamagov
1 Alaska alaska ... https://twitter.com/alaska
Note that, if you are *creating* such catalogs, you may well start by trying the ``open_csv`` command,
above, and then use ``print(ds.yaml())``. If you do this now, you will see that the output is very
similar to the catalog file we have provided.
Installing Data Source Packages
-------------------------------
Intake makes it possible to create :term:`Data packages` (``pip`` or ``conda``)
that install data sources into a
global catalog. For example, we can
install a data package containing the same data we have been working with::
conda install -c intake data-us-states
:term:`Conda` installs the catalog file in this package to ``$CONDA_PREFIX/share/intake/us_states.yml``.
Now, when we import
``intake``, we will see the data from this package appear as part of a global catalog called ``intake.cat``. In this
particular case we use Dask to do the reading (which can handle larger-than-memory data and parallel
processing), but ``read()`` would work also::
>>> import intake
>>> intake.cat.states.to_dask()[['state','slug']].head()
state slug
0 Alabama alabama
1 Alaska alaska
2 Arizona arizona
3 Arkansas arkansas
4 California california
The global catalog is a union of all catalogs installed in the conda/virtualenv environment and also any catalogs
installed in user-specific locations.
Adding Data Source Packages using the Intake path
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Intake checks the Intake config file for ``catalog_path`` or the environment variable ``"INTAKE_PATH"`` for a colon
separated list of paths (semicolon on windows) to search for catalog files.
When you import ``intake`` we will see all entries from all of the catalogues referenced as part of a global catalog
called ``intake.cat``.
Using the GUI
-------------
A graphical data browser is available in the Jupyter notebook environment or standalone web-server.
It will show the
contents of any installed catalogs, plus allows for selecting local and remote catalogs,
to browse and select entries from these. See :doc:`gui`.
|