File: v0.7.0.rst

package info (click to toggle)
pandas 2.2.3%2Bdfsg-9
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 66,784 kB
  • sloc: python: 422,228; ansic: 9,190; sh: 270; xml: 102; makefile: 83
file content (384 lines) | stat: -rw-r--r-- 11,774 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
.. _whatsnew_0700:

Version 0.7.0 (February 9, 2012)
--------------------------------

{{ header }}


New features
~~~~~~~~~~~~

- New unified :ref:`merge function <merging.join>` for efficiently performing
  full gamut of database / relational-algebra operations. Refactored existing
  join methods to use the new infrastructure, resulting in substantial
  performance gains (:issue:`220`, :issue:`249`, :issue:`267`)

- New :ref:`unified concatenation function <merging.concat>` for concatenating
  Series, DataFrame or Panel objects along an axis. Can form union or
  intersection of the other axes. Improves performance of ``Series.append`` and
  ``DataFrame.append`` (:issue:`468`, :issue:`479`, :issue:`273`)

- Can pass multiple DataFrames to
  ``DataFrame.append`` to concatenate (stack) and multiple Series to
  ``Series.append`` too

- :ref:`Can<basics.dataframe.from_list_of_dicts>` pass list of dicts (e.g., a
  list of JSON objects) to DataFrame constructor (:issue:`526`)

- You can now :ref:`set multiple columns <indexing.columns.multiple>` in a
  DataFrame via ``__getitem__``, useful for transformation (:issue:`342`)

- Handle differently-indexed output values in ``DataFrame.apply`` (:issue:`498`)

.. code-block:: ipython

   In [1]: df = pd.DataFrame(np.random.randn(10, 4))
   In [2]: df.apply(lambda x: x.describe())
   Out[2]:
                  0          1          2          3
   count  10.000000  10.000000  10.000000  10.000000
   mean    0.190912  -0.395125  -0.731920  -0.403130
   std     0.730951   0.813266   1.112016   0.961912
   min    -0.861849  -2.104569  -1.776904  -1.469388
   25%    -0.411391  -0.698728  -1.501401  -1.076610
   50%     0.380863  -0.228039  -1.191943  -1.004091
   75%     0.658444   0.057974  -0.034326   0.461706
   max     1.212112   0.577046   1.643563   1.071804

   [8 rows x 4 columns]

- :ref:`Add<advanced.reorderlevels>` ``reorder_levels`` method to Series and
  DataFrame (:issue:`534`)

- :ref:`Add<indexing.dictionarylike>` dict-like ``get`` function to DataFrame
  and Panel (:issue:`521`)

- :ref:`Add<basics.iterrows>` ``DataFrame.iterrows`` method for efficiently
  iterating through the rows of a DataFrame

- Add ``DataFrame.to_panel`` with code adapted from
  ``LongPanel.to_long``

- :ref:`Add <basics.reindexing>` ``reindex_axis`` method added to DataFrame

- :ref:`Add <basics.stats>` ``level`` option to binary arithmetic functions on
  ``DataFrame`` and ``Series``

- :ref:`Add <advanced.advanced_reindex>` ``level`` option to the ``reindex``
  and ``align`` methods on Series and DataFrame for broadcasting values across
  a level (:issue:`542`, :issue:`552`, others)

- Add attribute-based item access to
  ``Panel`` and add IPython completion (:issue:`563`)

- :ref:`Add <visualization.basic>` ``logy`` option to ``Series.plot`` for
  log-scaling on the Y axis

- :ref:`Add <io.formatting>` ``index`` and ``header`` options to
  ``DataFrame.to_string``

- :ref:`Can <merging.multiple_join>` pass multiple DataFrames to
  ``DataFrame.join`` to join on index (:issue:`115`)

- :ref:`Can <merging.multiple_join>` pass multiple Panels to ``Panel.join``
  (:issue:`115`)

- :ref:`Added <io.formatting>` ``justify`` argument to ``DataFrame.to_string``
  to allow different alignment of column headers

- :ref:`Add <groupby.attributes>` ``sort`` option to GroupBy to allow disabling
  sorting of the group keys for potential speedups (:issue:`595`)

- :ref:`Can <basics.dataframe.from_series>` pass MaskedArray to Series
  constructor (:issue:`563`)

- Add Panel item access via attributes
  and IPython completion (:issue:`554`)

- Implement ``DataFrame.lookup``, fancy-indexing analogue for retrieving values
  given a sequence of row and column labels (:issue:`338`)

- Can pass a :ref:`list of functions <groupby.aggregate.multifunc>` to
  aggregate with groupby on a DataFrame, yielding an aggregated result with
  hierarchical columns (:issue:`166`)

- Can call ``cummin`` and ``cummax`` on Series and DataFrame to get cumulative
  minimum and maximum, respectively (:issue:`647`)

- ``value_range`` added as utility function to get min and max of a dataframe
  (:issue:`288`)

- Added ``encoding`` argument to ``read_csv``, ``read_table``, ``to_csv`` and
  ``from_csv`` for non-ascii text (:issue:`717`)

- :ref:`Added <basics.stats>` ``abs`` method to pandas objects

- :ref:`Added <reshaping.pivot>` ``crosstab`` function for easily computing frequency tables

- :ref:`Added <indexing.set_ops>` ``isin`` method to index objects

- :ref:`Added <advanced.xs>` ``level`` argument to ``xs`` method of DataFrame.


API changes to integer indexing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One of the potentially riskiest API changes in 0.7.0, but also one of the most
important, was a complete review of how **integer indexes** are handled with
regard to label-based indexing. Here is an example:

.. code-block:: ipython

    In [3]: s = pd.Series(np.random.randn(10), index=range(0, 20, 2))
    In [4]: s
    Out[4]:
    0    -1.294524
    2     0.413738
    4     0.276662
    6    -0.472035
    8    -0.013960
    10   -0.362543
    12   -0.006154
    14   -0.923061
    16    0.895717
    18    0.805244
    Length: 10, dtype: float64

    In [5]: s[0]
    Out[5]: -1.2945235902555294

    In [6]: s[2]
    Out[6]: 0.41373810535784006

    In [7]: s[4]
    Out[7]: 0.2766617129497566

This is all exactly identical to the behavior before. However, if you ask for a
key **not** contained in the Series, in versions 0.6.1 and prior, Series would
*fall back* on a location-based lookup. This now raises a ``KeyError``:

.. code-block:: ipython

   In [2]: s[1]
   KeyError: 1

This change also has the same impact on DataFrame:

.. code-block:: ipython

   In [3]: df = pd.DataFrame(np.random.randn(8, 4), index=range(0, 16, 2))

   In [4]: df
       0        1       2       3
   0   0.88427  0.3363 -0.1787  0.03162
   2   0.14451 -0.1415  0.2504  0.58374
   4  -1.44779 -0.9186 -1.4996  0.27163
   6  -0.26598 -2.4184 -0.2658  0.11503
   8  -0.58776  0.3144 -0.8566  0.61941
   10  0.10940 -0.7175 -1.0108  0.47990
   12 -1.16919 -0.3087 -0.6049 -0.43544
   14 -0.07337  0.3410  0.0424 -0.16037

   In [5]: df.ix[3]
   KeyError: 3

In order to support purely integer-based indexing, the following methods have
been added:

.. csv-table::
    :header: "Method","Description"
    :widths: 40,60

        ``Series.iget_value(i)``, Retrieve value stored at location ``i``
        ``Series.iget(i)``, Alias for ``iget_value``
        ``DataFrame.irow(i)``, Retrieve the ``i``-th row
        ``DataFrame.icol(j)``, Retrieve the ``j``-th column
        "``DataFrame.iget_value(i, j)``", Retrieve the value at row ``i`` and column ``j``

API tweaks regarding label-based slicing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Label-based slicing using ``ix`` now requires that the index be sorted
(monotonic) **unless** both the start and endpoint are contained in the index:

.. code-block:: python

   In [1]: s = pd.Series(np.random.randn(6), index=list('gmkaec'))

   In [2]: s
   Out[2]:
   g   -1.182230
   m   -0.276183
   k   -0.243550
   a    1.628992
   e    0.073308
   c   -0.539890
   dtype: float64

Then this is OK:

.. code-block:: python

   In [3]: s.ix['k':'e']
   Out[3]:
   k   -0.243550
   a    1.628992
   e    0.073308
   dtype: float64

But this is not:

.. code-block:: ipython

   In [12]: s.ix['b':'h']
   KeyError 'b'

If the index had been sorted, the "range selection" would have been possible:

.. code-block:: python

   In [4]: s2 = s.sort_index()

   In [5]: s2
   Out[5]:
   a    1.628992
   c   -0.539890
   e    0.073308
   g   -1.182230
   k   -0.243550
   m   -0.276183
   dtype: float64

   In [6]: s2.ix['b':'h']
   Out[6]:
   c   -0.539890
   e    0.073308
   g   -1.182230
   dtype: float64

Changes to Series ``[]`` operator
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As as notational convenience, you can pass a sequence of labels or a label
slice to a Series when getting and setting values via ``[]`` (i.e. the
``__getitem__`` and ``__setitem__`` methods). The behavior will be the same as
passing similar input to ``ix`` **except in the case of integer indexing**:

.. code-block:: ipython

  In [8]: s = pd.Series(np.random.randn(6), index=list('acegkm'))

  In [9]: s
  Out[9]:
  a   -1.206412
  c    2.565646
  e    1.431256
  g    1.340309
  k   -1.170299
  m   -0.226169
  Length: 6, dtype: float64

  In [10]: s[['m', 'a', 'c', 'e']]
  Out[10]:
  m   -0.226169
  a   -1.206412
  c    2.565646
  e    1.431256
  Length: 4, dtype: float64

  In [11]: s['b':'l']
  Out[11]:
  c    2.565646
  e    1.431256
  g    1.340309
  k   -1.170299
  Length: 4, dtype: float64

  In [12]: s['c':'k']
  Out[12]:
  c    2.565646
  e    1.431256
  g    1.340309
  k   -1.170299
  Length: 4, dtype: float64

In the case of integer indexes, the behavior will be exactly as before
(shadowing ``ndarray``):

.. code-block:: ipython

  In [13]: s = pd.Series(np.random.randn(6), index=range(0, 12, 2))

  In [14]: s[[4, 0, 2]]
  Out[14]:
  4    0.132003
  0    0.410835
  2    0.813850
  Length: 3, dtype: float64

  In [15]: s[1:5]
  Out[15]:
  2    0.813850
  4    0.132003
  6   -0.827317
  8   -0.076467
  Length: 4, dtype: float64

If you wish to do indexing with sequences and slicing on an integer index with
label semantics, use ``ix``.

Other API changes
~~~~~~~~~~~~~~~~~

- The deprecated ``LongPanel`` class has been completely removed

- If ``Series.sort`` is called on a column of a DataFrame, an exception will
  now be raised. Before it was possible to accidentally mutate a DataFrame's
  column by doing ``df[col].sort()`` instead of the side-effect free method
  ``df[col].order()`` (:issue:`316`)

- Miscellaneous renames and deprecations which will (harmlessly) raise
  ``FutureWarning``

- ``drop`` added as an optional parameter to ``DataFrame.reset_index`` (:issue:`699`)

Performance improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- :ref:`Cythonized GroupBy aggregations <groupby.aggregate.builtin>` no longer
  presort the data, thus achieving a significant speedup (:issue:`93`).  GroupBy
  aggregations with Python functions significantly sped up by clever
  manipulation of the ndarray data type in Cython (:issue:`496`).
- Better error message in DataFrame constructor when passed column labels
  don't match data (:issue:`497`)
- Substantially improve performance of multi-GroupBy aggregation when a
  Python function is passed, reuse ndarray object in Cython (:issue:`496`)
- Can store objects indexed by tuples and floats in HDFStore (:issue:`492`)
- Don't print length by default in Series.to_string, add ``length`` option (:issue:`489`)
- Improve Cython code for multi-groupby to aggregate without having to sort
  the data (:issue:`93`)
- Improve MultiIndex reindexing speed by storing tuples in the MultiIndex,
  test for backwards unpickling compatibility
- Improve column reindexing performance by using specialized Cython take
  function
- Further performance tweaking of Series.__getitem__ for standard use cases
- Avoid Index dict creation in some cases (i.e. when getting slices, etc.),
  regression from prior versions
- Friendlier error message in setup.py if NumPy not installed
- Use common set of NA-handling operations (sum, mean, etc.) in Panel class
  also (:issue:`536`)
- Default name assignment when calling ``reset_index`` on DataFrame with a
  regular (non-hierarchical) index (:issue:`476`)
- Use Cythonized groupers when possible in Series/DataFrame stat ops with
  ``level`` parameter passed (:issue:`545`)
- Ported skiplist data structure to C to speed up ``rolling_median`` by about
  5-10x in most typical use cases (:issue:`374`)


.. _whatsnew_0.7.0.contributors:

Contributors
~~~~~~~~~~~~

.. contributors:: v0.6.1..v0.7.0