.. _whatsnew_0130: v0.13.0 (January 3, 2014) --------------------------- This is a major release from 0.12.0 and includes a number of API changes, several new features and enhancements along with a large number of bug fixes. Highlights include: - support for a new index type ``Float64Index``, and other Indexing enhancements - ``HDFStore`` has a new string based syntax for query specification - support for new methods of interpolation - updated ``timedelta`` operations - a new string manipulation method ``extract`` - Nanosecond support for Offsets - ``isin`` for DataFrames Several experimental features are added, including: - new ``eval/query`` methods for expression evaluation - support for ``msgpack`` serialization - an i/o interface to Google's ``BigQuery`` Their are several new or updated docs sections including: - :ref:`Comparison with SQL`, which should be useful for those familiar with SQL but still learning pandas. - :ref:`Comparison with R`, idiom translations from R to pandas. - :ref:`Enhancing Performance`, ways to enhance pandas performance with ``eval/query``. .. warning:: In 0.13.0 ``Series`` has internally been refactored to no longer sub-class ``ndarray`` but instead subclass ``NDFrame``, similar to the rest of the pandas containers. This should be a transparent change with only very limited API implications. See :ref:`Internal Refactoring` API changes ~~~~~~~~~~~ - ``read_excel`` now supports an integer in its ``sheetname`` argument giving the index of the sheet to read in (:issue:`4301`). - Text parser now treats anything that reads like inf ("inf", "Inf", "-Inf", "iNf", etc.) as infinity. (:issue:`4220`, :issue:`4219`), affecting ``read_table``, ``read_csv``, etc. - ``pandas`` now is Python 2/3 compatible without the need for 2to3 thanks to @jtratner. As a result, pandas now uses iterators more extensively. This also led to the introduction of substantive parts of the Benjamin Peterson's ``six`` library into compat. (:issue:`4384`, :issue:`4375`, :issue:`4372`) - ``pandas.util.compat`` and ``pandas.util.py3compat`` have been merged into ``pandas.compat``. ``pandas.compat`` now includes many functions allowing 2/3 compatibility. It contains both list and iterator versions of range, filter, map and zip, plus other necessary elements for Python 3 compatibility. ``lmap``, ``lzip``, ``lrange`` and ``lfilter`` all produce lists instead of iterators, for compatibility with ``numpy``, subscripting and ``pandas`` constructors.(:issue:`4384`, :issue:`4375`, :issue:`4372`) - ``Series.get`` with negative indexers now returns the same as ``[]`` (:issue:`4390`) - Changes to how ``Index`` and ``MultiIndex`` handle metadata (``levels``, ``labels``, and ``names``) (:issue:`4039`): .. code-block:: python # previously, you would have set levels or labels directly index.levels = [[1, 2, 3, 4], [1, 2, 4, 4]] # now, you use the set_levels or set_labels methods index = index.set_levels([[1, 2, 3, 4], [1, 2, 4, 4]]) # similarly, for names, you can rename the object # but setting names is not deprecated index = index.set_names(["bob", "cranberry"]) # and all methods take an inplace kwarg - but return None index.set_names(["bob", "cranberry"], inplace=True) - **All** division with ``NDFrame`` objects is now *truedivision*, regardless of the future import. This means that operating on pandas objects will by default use *floating point* division, and return a floating point dtype. You can use ``//`` and ``floordiv`` to do integer division. Integer division .. code-block:: ipython In [3]: arr = np.array([1, 2, 3, 4]) In [4]: arr2 = np.array([5, 3, 2, 1]) In [5]: arr / arr2 Out[5]: array([0, 0, 1, 4]) In [6]: Series(arr) // Series(arr2) Out[6]: 0 0 1 0 2 1 3 4 dtype: int64 True Division .. code-block:: ipython In [7]: pd.Series(arr) / pd.Series(arr2) # no future import required Out[7]: 0 0.200000 1 0.666667 2 1.500000 3 4.000000 dtype: float64 - Infer and downcast dtype if ``downcast='infer'`` is passed to ``fillna/ffill/bfill`` (:issue:`4604`) - ``__nonzero__`` for all NDFrame objects, will now raise a ``ValueError``, this reverts back to (:issue:`1073`, :issue:`4633`) behavior. See :ref:`gotchas` for a more detailed discussion. This prevents doing boolean comparison on *entire* pandas objects, which is inherently ambiguous. These all will raise a ``ValueError``. .. code-block:: python if df: .... df1 and df2 s1 and s2 Added the ``.bool()`` method to ``NDFrame`` objects to facilitate evaluating of single-element boolean Series: .. ipython:: python Series([True]).bool() Series([False]).bool() DataFrame([[True]]).bool() DataFrame([[False]]).bool() - All non-Index NDFrames (``Series``, ``DataFrame``, ``Panel``, ``Panel4D``, ``SparsePanel``, etc.), now support the entire set of arithmetic operators and arithmetic flex methods (add, sub, mul, etc.). ``SparsePanel`` does not support ``pow`` or ``mod`` with non-scalars. (:issue:`3765`) - ``Series`` and ``DataFrame`` now have a ``mode()`` method to calculate the statistical mode(s) by axis/Series. (:issue:`5367`) - Chained assignment will now by default warn if the user is assigning to a copy. This can be changed with the option ``mode.chained_assignment``, allowed options are ``raise/warn/None``. See :ref:`the docs`. .. ipython:: python dfc = DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]}) pd.set_option('chained_assignment','warn') The following warning / exception will show if this is attempted. .. ipython:: python :okwarning: dfc.loc[0]['A'] = 1111 :: Traceback (most recent call last) ... SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead Here is the correct method of assignment. .. ipython:: python dfc.loc[0,'A'] = 11 dfc - ``Panel.reindex`` has the following call signature ``Panel.reindex(items=None, major_axis=None, minor_axis=None, **kwargs)`` to conform with other ``NDFrame`` objects. See :ref:`Internal Refactoring` for more information. - ``Series.argmin`` and ``Series.argmax`` are now aliased to ``Series.idxmin`` and ``Series.idxmax``. These return the *index* of the min or max element respectively. Prior to 0.13.0 these would return the position of the min / max element. (:issue:`6214`) Prior Version Deprecations/Changes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These were announced changes in 0.12 or prior that are taking effect as of 0.13.0 - Remove deprecated ``Factor`` (:issue:`3650`) - Remove deprecated ``set_printoptions/reset_printoptions`` (:issue:`3046`) - Remove deprecated ``_verbose_info`` (:issue:`3215`) - Remove deprecated ``read_clipboard/to_clipboard/ExcelFile/ExcelWriter`` from ``pandas.io.parsers`` (:issue:`3717`) These are available as functions in the main pandas namespace (e.g. ``pd.read_clipboard``) - default for ``tupleize_cols`` is now ``False`` for both ``to_csv`` and ``read_csv``. Fair warning in 0.12 (:issue:`3604`) - default for `display.max_seq_len` is now 100 rather then `None`. This activates truncated display ("...") of long sequences in various places. (:issue:`3391`) Deprecations ~~~~~~~~~~~~ Deprecated in 0.13.0 - deprecated ``iterkv``, which will be removed in a future release (this was an alias of iteritems used to bypass ``2to3``'s changes). (:issue:`4384`, :issue:`4375`, :issue:`4372`) - deprecated the string method ``match``, whose role is now performed more idiomatically by ``extract``. In a future release, the default behavior of ``match`` will change to become analogous to ``contains``, which returns a boolean indexer. (Their distinction is strictness: ``match`` relies on ``re.match`` while ``contains`` relies on ``re.search``.) In this release, the deprecated behavior is the default, but the new behavior is available through the keyword argument ``as_indexer=True``. Indexing API Changes ~~~~~~~~~~~~~~~~~~~~ Prior to 0.13, it was impossible to use a label indexer (``.loc/.ix``) to set a value that was not contained in the index of a particular axis. (:issue:`2578`). See :ref:`the docs` In the ``Series`` case this is effectively an appending operation .. ipython:: python s = Series([1,2,3]) s s[5] = 5. s .. ipython:: python dfi = DataFrame(np.arange(6).reshape(3,2), columns=['A','B']) dfi This would previously ``KeyError`` .. ipython:: python dfi.loc[:,'C'] = dfi.loc[:,'A'] dfi This is like an ``append`` operation. .. ipython:: python dfi.loc[3] = 5 dfi A Panel setting operation on an arbitrary axis aligns the input to the Panel .. ipython:: python p = pd.Panel(np.arange(16).reshape(2,4,2), items=['Item1','Item2'], major_axis=pd.date_range('2001/1/12',periods=4), minor_axis=['A','B'],dtype='float64') p p.loc[:,:,'C'] = Series([30,32],index=p.items) p p.loc[:,:,'C'] Float64Index API Change ~~~~~~~~~~~~~~~~~~~~~~~ - Added a new index type, ``Float64Index``. This will be automatically created when passing floating values in index creation. This enables a pure label-based slicing paradigm that makes ``[],ix,loc`` for scalar indexing and slicing work exactly the same. See :ref:`the docs`, (:issue:`263`) Construction is by default for floating type values. .. ipython:: python index = Index([1.5, 2, 3, 4.5, 5]) index s = Series(range(5),index=index) s Scalar selection for ``[],.ix,.loc`` will always be label based. An integer will match an equal float index (e.g. ``3`` is equivalent to ``3.0``) .. ipython:: python s[3] s.ix[3] s.loc[3] The only positional indexing is via ``iloc`` .. ipython:: python s.iloc[3] A scalar index that is not found will raise ``KeyError`` Slicing is ALWAYS on the values of the index, for ``[],ix,loc`` and ALWAYS positional with ``iloc`` .. ipython:: python s[2:4] s.ix[2:4] s.loc[2:4] s.iloc[2:4] In float indexes, slicing using floats are allowed .. ipython:: python s[2.1:4.6] s.loc[2.1:4.6] - Indexing on other index types are preserved (and positional fallback for ``[],ix``), with the exception, that floating point slicing on indexes on non ``Float64Index`` will now raise a ``TypeError``. .. code-block:: ipython In [1]: Series(range(5))[3.5] TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index) In [1]: Series(range(5))[3.5:4.5] TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index) Using a scalar float indexer will be deprecated in a future version, but is allowed for now. .. code-block:: ipython In [3]: Series(range(5))[3.0] Out[3]: 3 HDFStore API Changes ~~~~~~~~~~~~~~~~~~~~ - Query Format Changes. A much more string-like query format is now supported. See :ref:`the docs`. .. ipython:: python path = 'test.h5' dfq = DataFrame(randn(10,4), columns=list('ABCD'), index=date_range('20130101',periods=10)) dfq.to_hdf(path,'dfq',format='table',data_columns=True) Use boolean expressions, with in-line function evaluation. .. ipython:: python read_hdf(path,'dfq', where="index>Timestamp('20130104') & columns=['A', 'B']") Use an inline column reference .. ipython:: python read_hdf(path,'dfq', where="A>0 or C>0") .. ipython:: python :suppress: import os os.remove(path) - the ``format`` keyword now replaces the ``table`` keyword; allowed values are ``fixed(f)`` or ``table(t)`` the same defaults as prior < 0.13.0 remain, e.g. ``put`` implies ``fixed`` format and ``append`` implies ``table`` format. This default format can be set as an option by setting ``io.hdf.default_format``. .. ipython:: python path = 'test.h5' df = DataFrame(randn(10,2)) df.to_hdf(path,'df_table',format='table') df.to_hdf(path,'df_table2',append=True) df.to_hdf(path,'df_fixed') with get_store(path) as store: print(store) .. ipython:: python :suppress: import os os.remove(path) - Significant table writing performance improvements - handle a passed ``Series`` in table format (:issue:`4330`) - can now serialize a ``timedelta64[ns]`` dtype in a table (:issue:`3577`), See :ref:`the docs`. - added an ``is_open`` property to indicate if the underlying file handle is_open; a closed store will now report 'CLOSED' when viewing the store (rather than raising an error) (:issue:`4409`) - a close of a ``HDFStore`` now will close that instance of the ``HDFStore`` but will only close the actual file if the ref count (by ``PyTables``) w.r.t. all of the open handles are 0. Essentially you have a local instance of ``HDFStore`` referenced by a variable. Once you close it, it will report closed. Other references (to the same file) will continue to operate until they themselves are closed. Performing an action on a closed file will raise ``ClosedFileError`` .. ipython:: python path = 'test.h5' df = DataFrame(randn(10,2)) store1 = HDFStore(path) store2 = HDFStore(path) store1.append('df',df) store2.append('df2',df) store1 store2 store1.close() store2 store2.close() store2 .. ipython:: python :suppress: import os os.remove(path) - removed the ``_quiet`` attribute, replace by a ``DuplicateWarning`` if retrieving duplicate rows from a table (:issue:`4367`) - removed the ``warn`` argument from ``open``. Instead a ``PossibleDataLossError`` exception will be raised if you try to use ``mode='w'`` with an OPEN file handle (:issue:`4367`) - allow a passed locations array or mask as a ``where`` condition (:issue:`4467`). See :ref:`the docs` for an example. - add the keyword ``dropna=True`` to ``append`` to change whether ALL nan rows are not written to the store (default is ``True``, ALL nan rows are NOT written), also settable via the option ``io.hdf.dropna_table`` (:issue:`4625`) - pass thru store creation arguments; can be used to support in-memory stores DataFrame repr Changes ~~~~~~~~~~~~~~~~~~~~~~ The HTML and plain text representations of :class:`DataFrame` now show a truncated view of the table once it exceeds a certain size, rather than switching to the short info view (:issue:`4886`, :issue:`5550`). This makes the representation more consistent as small DataFrames get larger. .. image:: _static/df_repr_truncated.png :alt: Truncated HTML representation of a DataFrame To get the info view, call :meth:`DataFrame.info`. If you prefer the info view as the repr for large DataFrames, you can set this by running ``set_option('display.large_repr', 'info')``. Enhancements ~~~~~~~~~~~~ - ``df.to_clipboard()`` learned a new ``excel`` keyword that let's you paste df data directly into excel (enabled by default). (:issue:`5070`). - ``read_html`` now raises a ``URLError`` instead of catching and raising a ``ValueError`` (:issue:`4303`, :issue:`4305`) - Added a test for ``read_clipboard()`` and ``to_clipboard()`` (:issue:`4282`) - Clipboard functionality now works with PySide (:issue:`4282`) - Added a more informative error message when plot arguments contain overlapping color and style arguments (:issue:`4402`) - ``to_dict`` now takes ``records`` as a possible outtype. Returns an array of column-keyed dictionaries. (:issue:`4936`) - ``NaN`` handing in get_dummies (:issue:`4446`) with `dummy_na` .. ipython:: python # previously, nan was erroneously counted as 2 here # now it is not counted at all get_dummies([1, 2, np.nan]) # unless requested get_dummies([1, 2, np.nan], dummy_na=True) - ``timedelta64[ns]`` operations. See :ref:`the docs`. .. warning:: Most of these operations require ``numpy >= 1.7`` Using the new top-level ``to_timedelta``, you can convert a scalar or array from the standard timedelta format (produced by ``to_csv``) into a timedelta type (``np.timedelta64`` in ``nanoseconds``). .. ipython:: python to_timedelta('1 days 06:05:01.00003') to_timedelta('15.5us') to_timedelta(['1 days 06:05:01.00003','15.5us','nan']) to_timedelta(np.arange(5),unit='s') to_timedelta(np.arange(5),unit='d') A Series of dtype ``timedelta64[ns]`` can now be divided by another ``timedelta64[ns]`` object, or astyped to yield a ``float64`` dtyped Series. This is frequency conversion. See :ref:`the docs` for the docs. .. ipython:: python from datetime import timedelta td = Series(date_range('20130101',periods=4))-Series(date_range('20121201',periods=4)) td[2] += np.timedelta64(timedelta(minutes=5,seconds=3)) td[3] = np.nan td # to days td / np.timedelta64(1,'D') td.astype('timedelta64[D]') # to seconds td / np.timedelta64(1,'s') td.astype('timedelta64[s]') Dividing or multiplying a ``timedelta64[ns]`` Series by an integer or integer Series .. ipython:: python td * -1 td * Series([1,2,3,4]) Absolute ``DateOffset`` objects can act equivalently to ``timedeltas`` .. ipython:: python from pandas import offsets td + offsets.Minute(5) + offsets.Milli(5) Fillna is now supported for timedeltas .. ipython:: python td.fillna(0) td.fillna(timedelta(days=1,seconds=5)) You can do numeric reduction operations on timedeltas. .. ipython:: python td.mean() td.quantile(.1) - ``plot(kind='kde')`` now accepts the optional parameters ``bw_method`` and ``ind``, passed to scipy.stats.gaussian_kde() (for scipy >= 0.11.0) to set the bandwidth, and to gkde.evaluate() to specify the indices at which it is evaluated, respectively. See scipy docs. (:issue:`4298`) - DataFrame constructor now accepts a numpy masked record array (:issue:`3478`) - The new vectorized string method ``extract`` return regular expression matches more conveniently. .. ipython:: python :okwarning: Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)') Elements that do not match return ``NaN``. Extracting a regular expression with more than one group returns a DataFrame with one column per group. .. ipython:: python :okwarning: Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)') Elements that do not match return a row of ``NaN``. Thus, a Series of messy strings can be *converted* into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating ``get()`` to access tuples or ``re.match`` objects. Named groups like .. ipython:: python :okwarning: Series(['a1', 'b2', 'c3']).str.extract( '(?P[ab])(?P\d)') and optional groups can also be used. .. ipython:: python :okwarning: Series(['a1', 'b2', '3']).str.extract( '(?P[ab])?(?P\d)') - ``read_stata`` now accepts Stata 13 format (:issue:`4291`) - ``read_fwf`` now infers the column specifications from the first 100 rows of the file if the data has correctly separated and properly aligned columns using the delimiter provided to the function (:issue:`4488`). - support for nanosecond times as an offset .. warning:: These operations require ``numpy >= 1.7`` Period conversions in the range of seconds and below were reworked and extended up to nanoseconds. Periods in the nanosecond range are now available. .. ipython:: python date_range('2013-01-01', periods=5, freq='5N') or with frequency as offset .. ipython:: python date_range('2013-01-01', periods=5, freq=pd.offsets.Nano(5)) Timestamps can be modified in the nanosecond range .. ipython:: python t = Timestamp('20130101 09:01:02') t + pd.tseries.offsets.Nano(123) - A new method, ``isin`` for DataFrames, which plays nicely with boolean indexing. The argument to ``isin``, what we're comparing the DataFrame to, can be a DataFrame, Series, dict, or array of values. See :ref:`the docs` for more. To get the rows where any of the conditions are met: .. ipython:: python dfi = DataFrame({'A': [1, 2, 3, 4], 'B': ['a', 'b', 'f', 'n']}) dfi other = DataFrame({'A': [1, 3, 3, 7], 'B': ['e', 'f', 'f', 'e']}) mask = dfi.isin(other) mask dfi[mask.any(1)] - ``Series`` now supports a ``to_frame`` method to convert it to a single-column DataFrame (:issue:`5164`) - All R datasets listed here http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html can now be loaded into Pandas objects .. code-block:: python # note that pandas.rpy was deprecated in v0.16.0 import pandas.rpy.common as com com.load_data('Titanic') - ``tz_localize`` can infer a fall daylight savings transition based on the structure of the unlocalized data (:issue:`4230`), see :ref:`the docs` - ``DatetimeIndex`` is now in the API documentation, see :ref:`the docs` - :meth:`~pandas.io.json.json_normalize` is a new method to allow you to create a flat table from semi-structured JSON data. See :ref:`the docs` (:issue:`1067`) - Added PySide support for the qtpandas DataFrameModel and DataFrameWidget. - Python csv parser now supports usecols (:issue:`4335`) - Frequencies gained several new offsets: * ``LastWeekOfMonth`` (:issue:`4637`) * ``FY5253``, and ``FY5253Quarter`` (:issue:`4511`) - DataFrame has a new ``interpolate`` method, similar to Series (:issue:`4434`, :issue:`1892`) .. ipython:: python df = DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8], 'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]}) df.interpolate() Additionally, the ``method`` argument to ``interpolate`` has been expanded to include ``'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'barycentric', 'krogh', 'piecewise_polynomial', 'pchip', `polynomial`, 'spline'`` The new methods require scipy_. Consult the Scipy reference guide_ and documentation_ for more information about when the various methods are appropriate. See :ref:`the docs`. Interpolate now also accepts a ``limit`` keyword argument. This works similar to ``fillna``'s limit: .. ipython:: python ser = Series([1, 3, np.nan, np.nan, np.nan, 11]) ser.interpolate(limit=2) - Added ``wide_to_long`` panel data convenience function. See :ref:`the docs`. .. ipython:: python np.random.seed(123) df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"}, "A1980" : {0 : "d", 1 : "e", 2 : "f"}, "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7}, "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1}, "X" : dict(zip(range(3), np.random.randn(3))) }) df["id"] = df.index df wide_to_long(df, ["A", "B"], i="id", j="year") .. _scipy: http://www.scipy.org .. _documentation: http://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation .. _guide: http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html - ``to_csv`` now takes a ``date_format`` keyword argument that specifies how output datetime objects should be formatted. Datetimes encountered in the index, columns, and values will all have this formatting applied. (:issue:`4313`) - ``DataFrame.plot`` will scatter plot x versus y by passing ``kind='scatter'`` (:issue:`2215`) - Added support for Google Analytics v3 API segment IDs that also supports v2 IDs. (:issue:`5271`) .. _whatsnew_0130.experimental: Experimental ~~~~~~~~~~~~ - The new :func:`~pandas.eval` function implements expression evaluation using ``numexpr`` behind the scenes. This results in large speedups for complicated expressions involving large DataFrames/Series. For example, .. ipython:: python nrows, ncols = 20000, 100 df1, df2, df3, df4 = [DataFrame(randn(nrows, ncols)) for _ in range(4)] .. ipython:: python # eval with NumExpr backend %timeit pd.eval('df1 + df2 + df3 + df4') .. ipython:: python # pure Python evaluation %timeit df1 + df2 + df3 + df4 For more details, see the :ref:`the docs` - Similar to ``pandas.eval``, :class:`~pandas.DataFrame` has a new ``DataFrame.eval`` method that evaluates an expression in the context of the ``DataFrame``. For example, .. ipython:: python :suppress: try: del a except NameError: pass try: del b except NameError: pass .. ipython:: python df = DataFrame(randn(10, 2), columns=['a', 'b']) df.eval('a + b') - :meth:`~pandas.DataFrame.query` method has been added that allows you to select elements of a ``DataFrame`` using a natural query syntax nearly identical to Python syntax. For example, .. ipython:: python :suppress: try: del a except NameError: pass try: del b except NameError: pass try: del c except NameError: pass .. ipython:: python n = 20 df = DataFrame(np.random.randint(n, size=(n, 3)), columns=['a', 'b', 'c']) df.query('a < b < c') selects all the rows of ``df`` where ``a < b < c`` evaluates to ``True``. For more details see the :ref:`the docs`. - ``pd.read_msgpack()`` and ``pd.to_msgpack()`` are now a supported method of serialization of arbitrary pandas (and python objects) in a lightweight portable binary format. See :ref:`the docs` .. warning:: Since this is an EXPERIMENTAL LIBRARY, the storage format may not be stable until a future release. .. ipython:: python df = DataFrame(np.random.rand(5,2),columns=list('AB')) df.to_msgpack('foo.msg') pd.read_msgpack('foo.msg') s = Series(np.random.rand(5),index=date_range('20130101',periods=5)) pd.to_msgpack('foo.msg', df, s) pd.read_msgpack('foo.msg') You can pass ``iterator=True`` to iterator over the unpacked results .. ipython:: python for o in pd.read_msgpack('foo.msg',iterator=True): print o .. ipython:: python :suppress: :okexcept: os.remove('foo.msg') - ``pandas.io.gbq`` provides a simple way to extract from, and load data into, Google's BigQuery Data Sets by way of pandas DataFrames. BigQuery is a high performance SQL-like database service, useful for performing ad-hoc queries against extremely large datasets. :ref:`See the docs ` .. code-block:: python from pandas.io import gbq # A query to select the average monthly temperatures in the # in the year 2000 across the USA. The dataset, # publicata:samples.gsod, is available on all BigQuery accounts, # and is based on NOAA gsod data. query = """SELECT station_number as STATION, month as MONTH, AVG(mean_temp) as MEAN_TEMP FROM publicdata:samples.gsod WHERE YEAR = 2000 GROUP BY STATION, MONTH ORDER BY STATION, MONTH ASC""" # Fetch the result set for this query # Your Google BigQuery Project ID # To find this, see your dashboard: # https://console.developers.google.com/iam-admin/projects?authuser=0 projectid = xxxxxxxxx; df = gbq.read_gbq(query, project_id = projectid) # Use pandas to process and reshape the dataset df2 = df.pivot(index='STATION', columns='MONTH', values='MEAN_TEMP') df3 = pandas.concat([df2.min(), df2.mean(), df2.max()], axis=1,keys=["Min Tem", "Mean Temp", "Max Temp"]) The resulting DataFrame is:: > df3 Min Tem Mean Temp Max Temp MONTH 1 -53.336667 39.827892 89.770968 2 -49.837500 43.685219 93.437932 3 -77.926087 48.708355 96.099998 4 -82.892858 55.070087 97.317240 5 -92.378261 61.428117 102.042856 6 -77.703334 65.858888 102.900000 7 -87.821428 68.169663 106.510714 8 -89.431999 68.614215 105.500000 9 -86.611112 63.436935 107.142856 10 -78.209677 56.880838 92.103333 11 -50.125000 48.861228 94.996428 12 -50.332258 42.286879 94.396774 .. warning:: To use this module, you will need a BigQuery account. See for details. As of 10/10/13, there is a bug in Google's API preventing result sets from being larger than 100,000 rows. A patch is scheduled for the week of 10/14/13. .. _whatsnew_0130.refactoring: Internal Refactoring ~~~~~~~~~~~~~~~~~~~~ In 0.13.0 there is a major refactor primarily to subclass ``Series`` from ``NDFrame``, which is the base class currently for ``DataFrame`` and ``Panel``, to unify methods and behaviors. Series formerly subclassed directly from ``ndarray``. (:issue:`4080`, :issue:`3862`, :issue:`816`) .. warning:: There are two potential incompatibilities from < 0.13.0 - Using certain numpy functions would previously return a ``Series`` if passed a ``Series`` as an argument. This seems only to affect ``np.ones_like``, ``np.empty_like``, ``np.diff`` and ``np.where``. These now return ``ndarrays``. .. ipython:: python s = Series([1,2,3,4]) Numpy Usage .. ipython:: python np.ones_like(s) np.diff(s) np.where(s>1,s,np.nan) Pandonic Usage .. ipython:: python Series(1,index=s.index) s.diff() s.where(s>1) - Passing a ``Series`` directly to a cython function expecting an ``ndarray`` type will no long work directly, you must pass ``Series.values``, See :ref:`Enhancing Performance` - ``Series(0.5)`` would previously return the scalar ``0.5``, instead this will return a 1-element ``Series`` - This change breaks ``rpy2<=2.3.8``. an Issue has been opened against rpy2 and a workaround is detailed in :issue:`5698`. Thanks @JanSchulz. - Pickle compatibility is preserved for pickles created prior to 0.13. These must be unpickled with ``pd.read_pickle``, see :ref:`Pickling`. - Refactor of series.py/frame.py/panel.py to move common code to generic.py - added ``_setup_axes`` to created generic NDFrame structures - moved methods - ``from_axes,_wrap_array,axes,ix,loc,iloc,shape,empty,swapaxes,transpose,pop`` - ``__iter__,keys,__contains__,__len__,__neg__,__invert__`` - ``convert_objects,as_blocks,as_matrix,values`` - ``__getstate__,__setstate__`` (compat remains in frame/panel) - ``__getattr__,__setattr__`` - ``_indexed_same,reindex_like,align,where,mask`` - ``fillna,replace`` (``Series`` replace is now consistent with ``DataFrame``) - ``filter`` (also added axis argument to selectively filter on a different axis) - ``reindex,reindex_axis,take`` - ``truncate`` (moved to become part of ``NDFrame``) - These are API changes which make ``Panel`` more consistent with ``DataFrame`` - ``swapaxes`` on a ``Panel`` with the same axes specified now return a copy - support attribute access for setting - filter supports the same API as the original ``DataFrame`` filter - Reindex called with no arguments will now return a copy of the input object - ``TimeSeries`` is now an alias for ``Series``. the property ``is_time_series`` can be used to distinguish (if desired) - Refactor of Sparse objects to use BlockManager - Created a new block type in internals, ``SparseBlock``, which can hold multi-dtypes and is non-consolidatable. ``SparseSeries`` and ``SparseDataFrame`` now inherit more methods from there hierarchy (Series/DataFrame), and no longer inherit from ``SparseArray`` (which instead is the object of the ``SparseBlock``) - Sparse suite now supports integration with non-sparse data. Non-float sparse data is supportable (partially implemented) - Operations on sparse structures within DataFrames should preserve sparseness, merging type operations will convert to dense (and back to sparse), so might be somewhat inefficient - enable setitem on ``SparseSeries`` for boolean/integer/slices - ``SparsePanels`` implementation is unchanged (e.g. not using BlockManager, needs work) - added ``ftypes`` method to Series/DataFrame, similar to ``dtypes``, but indicates if the underlying is sparse/dense (as well as the dtype) - All ``NDFrame`` objects can now use ``__finalize__()`` to specify various values to propagate to new objects from an existing one (e.g. ``name`` in ``Series`` will follow more automatically now) - Internal type checking is now done via a suite of generated classes, allowing ``isinstance(value, klass)`` without having to directly import the klass, courtesy of @jtratner - Bug in Series update where the parent frame is not updating its cache based on changes (:issue:`4080`) or types (:issue:`3217`), fillna (:issue:`3386`) - Indexing with dtype conversions fixed (:issue:`4463`, :issue:`4204`) - Refactor ``Series.reindex`` to core/generic.py (:issue:`4604`, :issue:`4618`), allow ``method=`` in reindexing on a Series to work - ``Series.copy`` no longer accepts the ``order`` parameter and is now consistent with ``NDFrame`` copy - Refactor ``rename`` methods to core/generic.py; fixes ``Series.rename`` for (:issue:`4605`), and adds ``rename`` with the same signature for ``Panel`` - Refactor ``clip`` methods to core/generic.py (:issue:`4798`) - Refactor of ``_get_numeric_data/_get_bool_data`` to core/generic.py, allowing Series/Panel functionality - ``Series`` (for index) / ``Panel`` (for items) now allow attribute access to its elements (:issue:`1903`) .. ipython:: python s = Series([1,2,3],index=list('abc')) s.b s.a = 5 s Bug Fixes ~~~~~~~~~ See :ref:`V0.13.0 Bug Fixes` for an extensive list of bugs that have been fixed in 0.13.0. See the :ref:`full release notes ` or issue tracker on GitHub for a complete list of all API changes, Enhancements and Bug Fixes.