File: data.io.rst

package info (click to toggle)
orange3 3.40.0-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 15,908 kB
  • sloc: python: 162,745; ansic: 622; makefile: 322; sh: 93; cpp: 77
file content (82 lines) | stat: -rw-r--r-- 3,500 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
.. py:currentmodule:: Orange.data.io

################################
Loading and saving data (``io``)
################################

:obj:`Orange.data.Table` supports loading from several file formats:

* Comma-separated values (\*.csv) file,
* Tab-separated values (\*.tab, \*.tsv) file,
* Excel spreadsheet (\*.xls, \*.xlsx),
* Python pickle.

In addition, the text-based files (CSV, TSV) can be compressed with gzip,
bzip2 or xz (e.g. \*.csv.gz).


Header Format
=============

The data in CSV, TSV, and Excel files can be described in an extended
three-line header format, or a condensed single-line header format.


Three-line header format
------------------------

A three-line header consists of:

1. **Feature names** on the first line. Feature names can include any combination
   of characters.

2. **Feature types** on the second line. The type is determined automatically,
   or, if set, can be any of the following:

   * ``discrete`` (or ``d``) — imported as :obj:`Orange.data.DiscreteVariable`,
   * a space-separated **list of discrete values**, like "``male female``",
     which will result in :obj:`Orange.data.DiscreteVariable` with those values
     and in that order. If the individual values contain a space character, it
     needs to be escaped (prefixed) with, as common, a backslash ('\\') character.
   * ``continuous`` (or ``c``) — imported as :obj:`Orange.data.ContinuousVariable`,
   * ``string`` (or ``s``, or ``text``) — imported as :obj:`Orange.data.StringVariable`,
   * ``time`` (or ``t``) — imported as :obj:`Orange.data.TimeVariable`, if the
     values parse as `ISO 8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ date/time formats,

3. **Flags** (optional) on the third header line. Feature's flag can be empty,
   or it can contain, space-separated, a consistent combination of:

   * ``class`` (or ``c``) — feature will be imported as a class variable.
     Most algorithms expect a single class variable.
   * ``meta`` (or ``m``) — feature will be imported as a meta-attribute, just
     describing the data instance but not actually used for learning,
   * ``weight`` (or ``w``) — the feature marks the weight of examples (in
     algorithms that support weighted examples),
   * ``ignore`` (or ``i``) — feature will not be imported,
   * ``<key>=<value>`` are custom attributes recognized in specific contexts, for instance ``color``, which defines the color palette when the variable is visualized, or ``type=image`` which signals that the variable contains a path to an image.

Example of iris dataset in Orange's three-line format
(:download:`iris.tab <../../../../Orange/datasets/iris.tab>`).

.. literalinclude:: ../../../../Orange/datasets/iris.tab
   :lines: 1-7


Single-line header format
-------------------------

Single-line header consists of feature names prefixed by an optional "``<flags>#``"
string, i.e. flags followed by a hash ('#') sign. The flags can be a consistent
combination of:

* ``c`` for class feature (also known as a target variable or dependent variable),
* ``i`` for feature to be ignored,
* ``m`` for meta attributes (not used in learning),
* ``C`` for features that are continuous (numeric),
* ``D`` for features that are discrete (categorical),
* ``T`` for features that represent date and/or time in one of the ISO 8601
  formats,
* ``S`` for string features.

If some (all) names or flags are omitted, the names, types, and flags are
discerned automatically, and correctly (most of the time).