File: data.storage.rst

package info (click to toggle)
orange3 3.40.0-2
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 15,912 kB
  • sloc: python: 162,745; ansic: 622; makefile: 322; sh: 93; cpp: 77
file content (195 lines) | stat: -rw-r--r-- 7,960 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
.. currentmodule:: Orange.data.storage

##########################
Data Storage (``storage``)
##########################

:obj:`Orange.data.storage.Storage` is an abstract class representing a data object
in which rows represent data instances (examples, in machine learning
terminology) and columns represent variables (features, attributes, classes,
targets, meta attributes).

Data is divided into three parts that represent independent variables (`X`),
dependent variables (`Y`) and meta data (`metas`). If practical, the class
should expose those parts as properties. In the associated domain
(:obj:`Orange.data.Domain`), the three parts correspond to lists of variable
descriptors `attributes`, `class_vars` and `metas`.

Any of those parts may be missing, dense, sparse or sparse boolean. The
difference between the later two is that the sparse data can be seen as a list
of pairs (variable, value), while in the latter the variable (item) is present
or absent, like in market basket analysis. The actual storage of sparse data
depends upon the storage type.

There is no uniform constructor signature: every derived class provides one or
more specific constructors.

There are currently two derived classes :obj:`Orange.data.Table` and
:obj:`Orange.data.sql.Table`, the former storing the data in-memory, in numpy
objects, and the latter in SQL (currently, only PostreSQL is supported).

Derived classes must implement at least the methods for getting rows and the
number of instances (`__getitem__` and `__len__`). To make storage fast enough
to be practically useful, it must also reimplement a number of filters,
preprocessors and aggregators. For instance, method
`_filter_values(self, filter)` returns a new storage which only contains the
rows that match the criteria given in the filter. :obj:`Orange.data.Table`
implements an efficient method based on numpy indexing, and
:obj:`Orange.data.sql.Table`, which "stores" a table as an SQL query, converts
the filter into a WHERE clause.

.. attribute:: domain (:obj:`Orange.data.Domain`)

    The domain describing the columns of the data


Data access
-----------

.. method:: __getitem__(self, index)

    Return one or more rows of the data.

    - If the index is an int, e.g. `data[7]`; the corresponding row is
      returned as an instance of :obj:`~Orange.data.instance.Instance`. Concrete
      implementations of `Storage` use specific derived classes for instances.

    - If the index is a slice or a sequence of ints (e.g. `data[7:10]` or
      `data[[7, 42, 15]]`, indexing returns a new storage with the selected
      rows.

    - If there are two indices, where the first is an int (a row number) and
      the second can be interpreted as columns, e.g. `data[3, 5]` or
      `data[3, 'gender']` or `data[3, y]` (where `y` is an instance of
      :obj:`~Orange.data.Variable`), a single value is returned as an instance
      of :obj:`~Orange.data.Value`.

    - In all other cases, the first index should be a row index, a slice or
      a sequence, and the second index, which represent a set of columns,
      should be an int, a slice, a sequence or a numpy array. The result is
      a new storage with a new domain.

.. method:: .__len__(self)

    Return the number of data instances (rows)


Inspection
----------

.. method:: Storage.X_density, Storage.Y_density, Storage.metas_density

    Indicates whether the attributes, classes and meta attributes are dense
    (`Storage.DENSE`) or sparse (`Storage.SPARSE`). If they are sparse and
    all values are 0 or 1, it is marked as (`Storage.SPARSE_BOOL`). The
    Storage class provides a default DENSE. If the data has no attibutes,
    classes or meta attributes, the corresponding method should re


Filters
-------

Storage should define the following methods to optimize the filtering
operations as allowed by the underlying data structure.
:obj:`Orange.data.Table` executes them directly through numpy (or bottleneck
or related) methods, while :obj:`Orange.data.sql.Table` appends them to the
WHERE clause of the query that defines the data.

These methods should not be called directly but through the classes defined in
:obj:`Orange.data.filter`. Methods in :obj:`Orange.data.filter` also provide
the slower fallback functions for the functions not defined in the storage.

.. method:: _filter_is_defined(self, columns=None, negate=False)

    Extract rows without undefined values.

    :param columns: optional list of columns that are checked for unknowns
    :type columns: sequence of ints, variable names or descriptors
    :param negate: invert the selection
    :type negate: bool
    :return: a new storage of the same type or :obj:`~Orange.data.Table`
    :rtype: Orange.data.storage.Storage


.. method:: _filter_has_class(self, negate=False)

    Return rows with known value of the target attribute. If there are multiple
    classes, all must be defined.

    :param negate: invert the selection
    :type negate: bool
    :return: a new storage of the same type or :obj:`~Orange.data.Table`
    :rtype: Orange.data.storage.Storage


.. method:: _filter_same_value(self, column, value, negate=False)

    Select rows based on a value of the given variable.

    :param column: the column that is checked
    :type column: int, str or Orange.data.Variable
    :param value: the value of the variable
    :type value: int, float or str
    :param negate: invert the selection
    :type negate: bool
    :return: a new storage of the same type or :obj:`~Orange.data.Table`
    :rtype: Orange.data.storage.Storage


.. method:: _filter_values(self, filter)

    Apply a the given filter to the data.

    :param filter: A filter for selecting the rows
    :type filter: Orange.data.Filter
    :return: a new storage of the same type or :obj:`~Orange.data.Table`
    :rtype: Orange.data.storage.Storage


Aggregators
-----------

Similarly to filters, storage classes should provide several methods for fast
computation of statistics. These methods are not called directly but by modules
within :obj:`Orange.statistics`.

.. method:: _compute_basic_stats(
    self, columns=None, include_metas=False, compute_variance=False)

    Compute basic statistics for the specified variables: minimal and maximal
    value, the mean and a varianca (or a zero placeholder), the number
    of missing and defined values.

    :param columns: a list of columns for which the statistics is computed;
        if `None`, the function computes the data for all variables
    :type columns: list of ints, variable names or descriptors of type
        :obj:`Orange.data.Variable`
    :param include_metas: a flag which tells whether to include meta attributes
        (applicable only if `columns` is `None`)
    :type include_metas: bool
    :param compute_variance: a flag which tells whether to compute the variance
    :type compute_variance: bool
    :return: a list with tuple (min, max, mean, variance, #nans, #non-nans)
        for each variable
    :rtype: list

.. method:: _compute_distributions(self, columns=None)

    Compute the distribution for the specified variables. The result is a list
    of pairs containing the distribution and the number of rows for which the
    variable value was missing.

    For discrete variables, the distribution is represented as a vector with
    absolute frequency of each value. For continuous variables, the result is
    a 2-d array of shape (2, number-of-distinct-values); the first row contains
    (distinct) values of the variables and the second has their absolute
    frequencies.

    :param columns: a list of columns for which the distributions are computed;
        if `None`, the function runs over all variables
    :type columns: list of ints, variable names or descriptors of type
        :obj:`Orange.data.Variable`
    :return: a list of distributions
    :rtype: list of numpy arrays

.. automethod:: Orange.data.storage.Storage._compute_contingency