File: v0.10.1.rst

package info (click to toggle)
pandas 2.2.3%2Bdfsg-9
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 66,784 kB
sloc: python: 422,228; ansic: 9,190; sh: 270; xml: 102; makefile: 83
file content (268 lines) | stat: -rw-r--r-- 9,259 bytes
parent folder | download | duplicates (2)
.. _whatsnew_0101:

Version 0.10.1 (January 22, 2013)
---------------------------------

{{ header }}


This is a minor release from 0.10.0 and includes new features, enhancements,
and bug fixes. In particular, there is substantial new HDFStore functionality
contributed by Jeff Reback.

An undesired API breakage with functions taking the ``inplace`` option has been
reverted and deprecation warnings added.

API changes
~~~~~~~~~~~

- Functions taking an ``inplace`` option return the calling object as before. A
  deprecation message has been added
- Groupby aggregations Max/Min no longer exclude non-numeric data (:issue:`2700`)
- Resampling an empty DataFrame now returns an empty DataFrame instead of
  raising an exception (:issue:`2640`)
- The file reader will now raise an exception when NA values are found in an
  explicitly specified integer column instead of converting the column to float
  (:issue:`2631`)
- DatetimeIndex.unique now returns a DatetimeIndex with the same name and
- timezone instead of an array (:issue:`2563`)

New features
~~~~~~~~~~~~

- MySQL support for database (contribution from Dan Allan)

HDFStore
~~~~~~~~

You may need to upgrade your existing data files. Please visit the
**compatibility** section in the main docs.


.. ipython:: python
   :suppress:
   :okexcept:

   import os

   os.remove("store.h5")

You can designate (and index) certain columns that you want to be able to
perform queries on a table, by passing a list to ``data_columns``

.. ipython:: python

   store = pd.HDFStore("store.h5")
   df = pd.DataFrame(
       np.random.randn(8, 3),
       index=pd.date_range("1/1/2000", periods=8),
       columns=["A", "B", "C"],
   )
   df["string"] = "foo"
   df.loc[df.index[4:6], "string"] = np.nan
   df.loc[df.index[7:9], "string"] = "bar"
   df["string2"] = "cool"
   df

   # on-disk operations
   store.append("df", df, data_columns=["B", "C", "string", "string2"])
   store.select("df", "B>0 and string=='foo'")

   # this is in-memory version of this type of selection
   df[(df.B > 0) & (df.string == "foo")]

Retrieving unique values in an indexable or data column.

.. code-block:: python

   # note that this is deprecated as of 0.14.0
   # can be replicated by: store.select_column('df','index').unique()
   store.unique("df", "index")
   store.unique("df", "string")

You can now store ``datetime64`` in data columns

.. ipython:: python

    df_mixed = df.copy()
    df_mixed["datetime64"] = pd.Timestamp("20010102")
    df_mixed.loc[df_mixed.index[3:4], ["A", "B"]] = np.nan

    store.append("df_mixed", df_mixed)
    df_mixed1 = store.select("df_mixed")
    df_mixed1
    df_mixed1.dtypes.value_counts()

You can pass ``columns`` keyword to select to filter a list of the return
columns, this is equivalent to passing a
``Term('columns',list_of_columns_to_filter)``

.. ipython:: python

   store.select("df", columns=["A", "B"])

``HDFStore`` now serializes MultiIndex dataframes when appending tables.

.. code-block:: ipython

    In [19]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
       ....:                               ['one', 'two', 'three']],
       ....:                       labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
       ....:                               [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
       ....:                       names=['foo', 'bar'])
       ....:

    In [20]: df = pd.DataFrame(np.random.randn(10, 3), index=index,
       ....:                   columns=['A', 'B', 'C'])
       ....:

    In [21]: df
    Out[21]:
                      A         B         C
    foo bar
    foo one   -0.116619  0.295575 -1.047704
        two    1.640556  1.905836  2.772115
        three  0.088787 -1.144197 -0.633372
    bar one    0.925372 -0.006438 -0.820408
        two   -0.600874 -1.039266  0.824758
    baz two   -0.824095 -0.337730 -0.927764
        three -0.840123  0.248505 -0.109250
    qux one    0.431977 -0.460710  0.336505
        two   -3.207595 -1.535854  0.409769
        three -0.673145 -0.741113 -0.110891

    In [22]: store.append('mi', df)

    In [23]: store.select('mi')
    Out[23]:
                      A         B         C
    foo bar
    foo one   -0.116619  0.295575 -1.047704
        two    1.640556  1.905836  2.772115
        three  0.088787 -1.144197 -0.633372
    bar one    0.925372 -0.006438 -0.820408
        two   -0.600874 -1.039266  0.824758
    baz two   -0.824095 -0.337730 -0.927764
        three -0.840123  0.248505 -0.109250
    qux one    0.431977 -0.460710  0.336505
        two   -3.207595 -1.535854  0.409769
        three -0.673145 -0.741113 -0.110891

    # the levels are automatically included as data columns
    In [24]: store.select('mi', "foo='bar'")
    Out[24]:
                    A         B         C
    foo bar
    bar one  0.925372 -0.006438 -0.820408
        two -0.600874 -1.039266  0.824758

Multi-table creation via ``append_to_multiple`` and selection via
``select_as_multiple`` can create/select from multiple tables and return a
combined result, by using ``where`` on a selector table.

.. ipython:: python

   df_mt = pd.DataFrame(
       np.random.randn(8, 6),
       index=pd.date_range("1/1/2000", periods=8),
       columns=["A", "B", "C", "D", "E", "F"],
   )
   df_mt["foo"] = "bar"

   # you can also create the tables individually
   store.append_to_multiple(
       {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
   )
   store

   # individual tables were created
   store.select("df1_mt")
   store.select("df2_mt")

   # as a multiple
   store.select_as_multiple(
       ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt"
   )

.. ipython:: python
   :suppress:

   store.close()
   os.remove("store.h5")

**Enhancements**

- ``HDFStore`` now can read native PyTables table format tables

- You can pass ``nan_rep = 'my_nan_rep'`` to append, to change the default nan
  representation on disk (which converts to/from ``np.nan``), this defaults to
  ``nan``.

- You can pass ``index`` to ``append``. This defaults to ``True``. This will
  automagically create indices on the *indexables* and *data columns* of the
  table

- You can pass ``chunksize=an integer`` to ``append``, to change the writing
  chunksize (default is 50000). This will significantly lower your memory usage
  on writing.

- You can pass ``expectedrows=an integer`` to the first ``append``, to set the
  TOTAL number of expected rows that ``PyTables`` will expected. This will
  optimize read/write performance.

- ``Select`` now supports passing ``start`` and ``stop`` to provide selection
  space limiting in selection.

- Greatly improved ISO8601 (e.g., yyyy-mm-dd) date parsing for file parsers (:issue:`2698`)
- Allow ``DataFrame.merge`` to handle combinatorial sizes too large for 64-bit
  integer (:issue:`2690`)
- Series now has unary negation (-series) and inversion (~series) operators (:issue:`2686`)
- DataFrame.plot now includes a ``logx`` parameter to change the x-axis to log scale (:issue:`2327`)
- Series arithmetic operators can now handle constant and ndarray input (:issue:`2574`)
- ExcelFile now takes a ``kind`` argument to specify the file type (:issue:`2613`)
- A faster implementation for Series.str methods (:issue:`2602`)

**Bug Fixes**

- ``HDFStore`` tables can now store ``float32`` types correctly (cannot be
  mixed with ``float64`` however)
- Fixed Google Analytics prefix when specifying request segment (:issue:`2713`).
- Function to reset Google Analytics token store so users can recover from
  improperly setup client secrets (:issue:`2687`).
- Fixed groupby bug resulting in segfault when passing in MultiIndex (:issue:`2706`)
- Fixed bug where passing a Series with datetime64 values into ``to_datetime``
  results in bogus output values (:issue:`2699`)
- Fixed bug in ``pattern in HDFStore`` expressions when pattern is not a valid
  regex (:issue:`2694`)
- Fixed performance issues while aggregating boolean data (:issue:`2692`)
- When given a boolean mask key and a Series of new values, Series __setitem__
  will now align the incoming values with the original Series (:issue:`2686`)
- Fixed MemoryError caused by performing counting sort on sorting MultiIndex
  levels with a very large number of combinatorial values (:issue:`2684`)
- Fixed bug that causes plotting to fail when the index is a DatetimeIndex with
  a fixed-offset timezone (:issue:`2683`)
- Corrected business day subtraction logic when the offset is more than 5 bdays
  and the starting date is on a weekend (:issue:`2680`)
- Fixed C file parser behavior when the file has more columns than data
  (:issue:`2668`)
- Fixed file reader bug that misaligned columns with data in the presence of an
  implicit column and a specified ``usecols`` value
- DataFrames with numerical or datetime indices are now sorted prior to
  plotting (:issue:`2609`)
- Fixed DataFrame.from_records error when passed columns, index, but empty
  records (:issue:`2633`)
- Several bug fixed for Series operations when dtype is datetime64 (:issue:`2689`,
  :issue:`2629`, :issue:`2626`)


See the :ref:`full release notes
<release>` or issue tracker
on GitHub for a complete list.


.. _whatsnew_0.10.1.contributors:

Contributors
~~~~~~~~~~~~

.. contributors:: v0.10.0..v0.10.1