File: indexing.rst

package info (click to toggle)
astropy 5.2.1-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 41,972 kB
  • sloc: python: 219,331; ansic: 147,297; javascript: 13,556; lex: 8,496; sh: 3,319; xml: 1,622; makefile: 185
file content (314 lines) | stat: -rw-r--r-- 8,950 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
.. |add_index| replace:: :func:`~astropy.table.Table.add_index`
.. |index_mode| replace:: :func:`~astropy.table.Table.index_mode`

.. _table-indexing:

Table Indexing
**************

Once a |Table| has been created, it is possible to create indices on one or
more columns of the table. An index internally sorts the rows of a table based
on the index column(s), allowing for element retrieval by column value and
improved performance for certain table operations.

Creating an Index
=================

.. EXAMPLE START: Creating Indexes on Table Columns

To create an index on a table, use the |add_index| method::

   >>> from astropy.table import Table
   >>> t = Table([(2, 3, 2, 1), (8, 7, 6, 5)], names=('a', 'b'))
   >>> t.add_index('a')

The optional argument ``unique`` may be specified to create an index with
uniquely valued elements.

To create a composite index on multiple columns, pass a list of columns
instead::

   >>> t.add_index(['a', 'b'])

In particular, the first index created using the
|add_index| method is considered the default index or the "primary key." To
retrieve an index from a table, use the `~astropy.table.Table.indices`
property::

   >>> t.indices['a']
   <SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
    a  rows
   --- ----
     1    3
     2    0
     2    2
     3    1>>
   >>> t.indices['a', 'b']
   <SlicedIndex original=True index=<Index columns=('a', 'b') data=<SortedArray length=4>
    a   b  rows
   --- --- ----
     1   5    3
     2   6    2
     2   8    0
     3   7    1>>

.. EXAMPLE END

Row Retrieval using Indices
===========================

.. EXAMPLE START: Retrieving Table Rows using Indices

Row retrieval can be accomplished using two table properties:
`~astropy.table.Table.loc` and `~astropy.table.Table.iloc`. The
`~astropy.table.Table.loc` property can be indexed either by column value,
range of column values (*including* the bounds), or a :class:`list` or
|ndarray| of column values::

   >>> t = Table([(1, 2, 3, 4), (10, 1, 9, 9)], names=('a', 'b'), dtype=['i8', 'i8'])
   >>> t.add_index('a')
   >>> t.loc[2]  # the row(s) where a == 2
   <Row index=1>
     a     b
   int64 int64
   ----- -----
       2     1
   >>> t.loc[[1, 4]]  # the row(s) where a in [1, 4]
   <Table length=2>
     a     b
   int64 int64
   ----- -----
       1    10
       4     9
   >>> t.loc[1:3]  # the row(s) where a in [1, 2, 3]
   <Table length=3>
     a     b
   int64 int64
   ----- -----
       1    10
       2     1
       3     9
   >>> t.loc[:]
   <Table length=4>
     a     b
   int64 int64
   ----- -----
       1    10
       2     1
       3     9
       4     9

Note that by default, `~astropy.table.Table.loc` uses the primary index, which
here is column ``'a'``. To use a different index, pass the indexed column name
before the retrieval data::

   >>> t.add_index('b')
   >>> t.loc['b', 8:10]
   <Table length=3>
     a     b
   int64 int64
   ----- -----
       3     9
       4     9
       1    10

The property `~astropy.table.Table.iloc` works similarly, except that the
retrieval information must be either an integer or a :class:`slice`, and
relates to the sorted order of the index rather than column values. For
example::

   >>> t.iloc[0] # smallest row by value 'a'
   <Row index=0>
     a     b
   int64 int64
   ----- -----
       1    10
   >>> t.iloc['b', 1:] # all but smallest value of 'b'
   <Table length=3>
     a     b
   int64 int64
   ----- -----
       3     9
       4     9
       1    10

.. EXAMPLE END

Effects on Performance
======================

Table operations change somewhat when indices are present, and there are a
number of factors to consider when deciding whether the use of indices will
improve performance. In general, indexing offers the following advantages:

* Table grouping and sorting based on indexed column(s) both become faster.
* Retrieving values by index is faster than custom searching.

There are certain caveats, however:

* Creating an index requires time and memory.
* Table modifications become slower due to automatic index updates.
* Slicing a table becomes slower due to index relabeling.

See `here
<https://nbviewer.jupyter.org/github/mdmueller/astropy-notebooks/blob/master/table/indexing-profiling.ipynb>`_
for an IPython notebook profiling various aspects of table indexing.

Index Modes
===========

The |index_mode| method allows for some flexibility in the behavior of table
indexing by allowing the user to enter a specific indexing mode via a context
manager. There are currently three indexing modes: ``'freeze'``,
``'copy_on_getitem'``, and ``'discard_on_copy'``.

.. EXAMPLE START: Table Indexing with the "freeze" Index Mode

The ``'freeze'`` mode prevents automatic index updates whenever a column of the
index is modified, and all indices refresh themselves after the context ends::

  >>> with t.index_mode('freeze'):
  ...    t['a'][0] = 0
  ...    print(t.indices['a']) # unmodified
  <SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
   a  rows
  --- ----
    1    0
    2    1
    3    2
    4    3>>
  >>> print(t.indices['a']) # modified
  <SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
   a  rows
  --- ----
    0    0
    2    1
    3    2
    4    3>>

.. EXAMPLE END

.. EXAMPLE START: Table Indexing with the "copy_on_getitem" Index Mode

The ``'copy_on_getitem'`` mode forces columns to copy and relabel their indices
upon slicing. In the absence of this mode, table slices will preserve
indices while column slices will not::

  >>> ca = t['a'][[1, 3]]
  >>> ca.info.indices
  []
  >>> with t.index_mode('copy_on_getitem'):
  ...     ca = t['a'][[1, 3]]
  ...     print(ca.info.indices)
  [<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=2>
   a  rows
  --- ----
    2    0
    4    1>>]

.. EXAMPLE END

.. EXAMPLE START: Table Indexing with the "discard_on_copy" Index Mode

The ``'discard_on_copy'`` mode prevents indices from being copied whenever a
column or table is copied::

  >>> t2 = Table(t)
  >>> t2.indices['a']
  <SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
   a  rows
  --- ----
    0    0
    2    1
    3    2
    4    3>>
  >>> with t.index_mode('discard_on_copy'):
  ...    t2 = Table(t)
  ...    print(t2.indices)
  []

.. EXAMPLE END

Updating Rows using Indices
===========================

.. EXAMPLE START: Updating Table Rows using Indices

Row updates can be accomplished by assigning the table property
`~astropy.table.Table.loc` a complete row or a list of rows::

   >>> t = Table([('w', 'x', 'y', 'z'), (10, 1, 9, 9)], names=('a', 'b'), dtype=['str', 'i8'])
   >>> t.add_index('a')
   >>> t.loc['x']
   <Row index=1>
    a     b
   str1 int64
   ---- -----
      x     1
   >>> t.loc['x'] = ['a', 12]
   >>> t
   <Table length=4>
    a     b
   str1 int64
   ---- -----
      w    10
      a    12
      y     9
      z     9
   >>> t.loc[['w', 'y']]
   <Table length=2>
    a     b
   str1 int64
   ---- -----
      w    10
      y     9
   >>> t.loc[['w', 'z']] = [['b', 23], ['c', 56]]
   >>> t
   <Table length=4>
    a     b
   str1 int64
   ---- -----
      b    23
      a    12
      y     9
      c    56

.. EXAMPLE END

Retrieving the Location of Rows using Indices
=============================================

.. EXAMPLE START: Retrieving the Location of Table Rows using Indices

Retrieval of the location of rows can be accomplished using a table property:
`~astropy.table.Table.loc_indices`. The `~astropy.table.Table.loc_indices`
property can be indexed either by column value, range of column values
(*including* the bounds), or a :class:`list` or |ndarray| of column values::

   >>> t = Table([('w', 'x', 'y', 'z'), (10, 1, 9, 9)], names=('a', 'b'), dtype=['str', 'i8'])
   >>> t.add_index('a')
   >>> t.loc_indices['x']
   1

.. EXAMPLE END

Engines
=======

When creating an index via |add_index|, the keyword argument ``engine`` may be
specified to use a particular indexing engine. The available engines are:

* `~astropy.table.SortedArray`, a sorted array engine using an underlying
  sorted |Table|.
* `~astropy.table.SCEngine`, a sorted list engine using the `Sorted Containers
  <https://pypi.org/project/sortedcontainers/>`_ package.
* `~astropy.table.BST`, a Python-based binary search tree engine (not recommended).

The SCEngine depends on the ``sortedcontainers`` dependency. The most important takeaway is that
`~astropy.table.SortedArray` (the default engine) is usually best, although
`~astropy.table.SCEngine` may be more appropriate for an index created on an
empty column since adding new values is quicker.

The `~astropy.table.BST` engine demonstrates a simple pure Python implementation
of a search tree engine, but the performance is poor for larger tables. This
is available in the code largely as an implementation reference.