1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
|
.. py:currentmodule:: Orange.data.io
################################
Loading and saving data (``io``)
################################
:obj:`Orange.data.Table` supports loading from several file formats:
* Comma-separated values (\*.csv) file,
* Tab-separated values (\*.tab, \*.tsv) file,
* Excel spreadsheet (\*.xls, \*.xlsx),
* Python pickle.
In addition, the text-based files (CSV, TSV) can be compressed with gzip,
bzip2 or xz (e.g. \*.csv.gz).
Header Format
=============
The data in CSV, TSV, and Excel files can be described in an extended
three-line header format, or a condensed single-line header format.
Three-line header format
------------------------
A three-line header consists of:
1. **Feature names** on the first line. Feature names can include any combination
of characters.
2. **Feature types** on the second line. The type is determined automatically,
or, if set, can be any of the following:
* ``discrete`` (or ``d``) — imported as :obj:`Orange.data.DiscreteVariable`,
* a space-separated **list of discrete values**, like "``male female``",
which will result in :obj:`Orange.data.DiscreteVariable` with those values
and in that order. If the individual values contain a space character, it
needs to be escaped (prefixed) with, as common, a backslash ('\\') character.
* ``continuous`` (or ``c``) — imported as :obj:`Orange.data.ContinuousVariable`,
* ``string`` (or ``s``, or ``text``) — imported as :obj:`Orange.data.StringVariable`,
* ``time`` (or ``t``) — imported as :obj:`Orange.data.TimeVariable`, if the
values parse as `ISO 8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ date/time formats,
3. **Flags** (optional) on the third header line. Feature's flag can be empty,
or it can contain, space-separated, a consistent combination of:
* ``class`` (or ``c``) — feature will be imported as a class variable.
Most algorithms expect a single class variable.
* ``meta`` (or ``m``) — feature will be imported as a meta-attribute, just
describing the data instance but not actually used for learning,
* ``weight`` (or ``w``) — the feature marks the weight of examples (in
algorithms that support weighted examples),
* ``ignore`` (or ``i``) — feature will not be imported,
* ``<key>=<value>`` are custom attributes recognized in specific contexts, for instance ``color``, which defines the color palette when the variable is visualized, or ``type=image`` which signals that the variable contains a path to an image.
Example of iris dataset in Orange's three-line format
(:download:`iris.tab <../../../../Orange/datasets/iris.tab>`).
.. literalinclude:: ../../../../Orange/datasets/iris.tab
:lines: 1-7
Single-line header format
-------------------------
Single-line header consists of feature names prefixed by an optional "``<flags>#``"
string, i.e. flags followed by a hash ('#') sign. The flags can be a consistent
combination of:
* ``c`` for class feature (also known as a target variable or dependent variable),
* ``i`` for feature to be ignored,
* ``m`` for meta attributes (not used in learning),
* ``C`` for features that are continuous (numeric),
* ``D`` for features that are discrete (categorical),
* ``T`` for features that represent date and/or time in one of the ISO 8601
formats,
* ``S`` for string features.
If some (all) names or flags are omitted, the names, types, and flags are
discerned automatically, and correctly (most of the time).
|