File: pandas.rst

package info (click to toggle)
astropy 7.0.1-3
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 35,328 kB
  • sloc: python: 233,437; ansic: 55,264; javascript: 17,680; lex: 8,621; sh: 3,317; xml: 2,287; makefile: 191
file content (176 lines) | stat: -rw-r--r-- 6,418 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
.. doctest-skip-all

.. _pandas:

Interfacing with the Pandas Package
***********************************

The `pandas <https://pandas.pydata.org/>`__ package is a package for high
performance data analysis of table-like structures that is complementary to the
:class:`~astropy.table.Table` class in ``astropy``.

In order to exchange data between the :class:`~astropy.table.Table` class and
the :class:`pandas.DataFrame` class (the main data structure in ``pandas``),
the |Table| class includes two methods, :meth:`~astropy.table.Table.to_pandas`
and :meth:`~astropy.table.Table.from_pandas`.

Basic Example
-------------

.. EXAMPLE START: Interfacing Tables with the Pandas Package

To demonstrate, we can create a minimal table::

    >>> from astropy.table import Table
    >>> t = Table()
    >>> t['a'] = [1, 2, 3, 4]
    >>> t['b'] = ['a', 'b', 'c', 'd']

Which we can then convert to a :class:`~pandas.DataFrame`::

    >>> df = t.to_pandas()
    >>> df
       a  b
    0  1  a
    1  2  b
    2  3  c
    3  4  d
    >>> type(df)
    <class 'pandas.core.frame.DataFrame'>

It is also possible to create a table from a :class:`~pandas.DataFrame`::

    >>> t2 = Table.from_pandas(df)
    >>> t2
    <Table length=4>
      a      b
    int64 string8
    ----- -------
        1       a
        2       b
        3       c
        4       d

.. EXAMPLE END

Details
-------
The conversions to and from ``pandas`` are subject to the following caveats:

* The :class:`~pandas.DataFrame` structure does not support multidimensional
  columns, so |Table| objects with multidimensional columns cannot be converted
  to :class:`~pandas.DataFrame`.

* Masked tables can be converted, but pandas uses a sentinel such as `numpy.nan` to
  represent missing values. Astropy uses a separate mask array to represent masked
  values, where the value "under the mask" is preserved. When converting any astropy
  masked column to a pandas DataFrame, the original values are lost.

* Tables with :ref:`mixin_columns` such as `~astropy.time.Time`,
  `~astropy.coordinates.SkyCoord`, and |Quantity| can be converted, but
  *with loss of information or fidelity*. For instance, `~astropy.time.Time` columns
  will be converted to a `pandas TimeSeries
  <https://pandas.pydata.org/docs/user_guide/timeseries.html>`_, but this object has
  only 64-bit precision and does not support leap seconds or time scales.

These issues are highlighted below in a more complex example with a table that includes
masked and mixin columns.

.. EXAMPLE START: Interfacing Tables with the Pandas Package (Complex Example)

First we create a table with a masked columns and a mixin column::

    >>> import numpy as np
    >>> from astropy.table import MaskedColumn, QTable
    >>> from astropy.time import Time
    >>> from astropy.coordinates import SkyCoord
    >>> import astropy.units as u
    >>> t = QTable()
    >>> t['a'] = MaskedColumn([1, 2, 3], mask=[False, True, False])
    >>> t['b'] = MaskedColumn([1.0, 2.0, 3.0], mask=[False, False, True])
    >>> t['c'] = MaskedColumn(["a", "b", "c"], mask=[True, False, False])
    >>> t['tm'] = Time(["2021-01-01", "2021-01-02", "2021-01-03"])
    >>> t['sc'] = SkyCoord(ra=[1, 2, 3] * u.deg, dec=[4, 5, 6] * u.deg)
    >>> t['q'] = [1, 2, 3] * u.m

    >>> t
    <QTable length=3>
      a      b     c              tm              sc       q
                                               deg,deg     m
    int64 float64 str1           Time          SkyCoord float64
    ----- ------- ---- ----------------------- -------- -------
        1     1.0   -- 2021-01-01 00:00:00.000  1.0,4.0     1.0
       --     2.0    b 2021-01-02 00:00:00.000  2.0,5.0     2.0
        3      --    c 2021-01-03 00:00:00.000  3.0,6.0     3.0

Now we convert this table to a :class:`~pandas.DataFrame`::

    >>> df = t.to_pandas()
    >>> df
          a    b    c         tm  sc.ra  sc.dec    q
    0     1  1.0  NaN 2021-01-01    1.0     4.0  1.0
    1  <NA>  2.0    b 2021-01-02    2.0     5.0  2.0
    2     3  NaN    c 2021-01-03    3.0     6.0  3.0

    >>> df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 7 columns):
     #   Column  Non-Null Count  Dtype
    ---  ------  --------------  -----
     0   a       2 non-null      Int64
     1   b       2 non-null      float64
     2   c       2 non-null      object
     3   tm      3 non-null      datetime64[ns]
     4   sc.ra   3 non-null      float64
     5   sc.dec  3 non-null      float64
     6   q       3 non-null      float64
    dtypes: Int64(1), datetime64[ns](1), float64(4), object(1)
    memory usage: 303.0+ bytes

Notice a few things:

- The masked values in the original table are replaced with sentinel values
  in pandas. The integer column ``a`` is converted to a nullable integer column, and
  the string column ``c`` is converted to an ``object`` column.
- The `~astropy.time.Time` object is converted to a pandas TimeSeries using
  ``datetime64[ns]``.
- The `~astropy.coordinates.SkyCoord` object is converted to two float columns
  ``sc.ra`` and ``sc.dec``, and the unit is lost.
- The `~astropy.units.Quantity` object is converted to a float column and the unit is
  lost.

Now convert back to a table::

    >>> t_df = QTable.from_pandas(df)
    >>> t_df
    <QTable length=3>
      a      b     c              tm            sc.ra   sc.dec    q
    int64 float64 str1           Time          float64 float64 float64
    ----- ------- ---- ----------------------- ------- ------- -------
        1     1.0   -- 2021-01-01T00:00:00.000     1.0     4.0     1.0
       --     2.0    b 2021-01-02T00:00:00.000     2.0     5.0     2.0
        3      --    c 2021-01-03T00:00:00.000     3.0     6.0     3.0

The `~astropy.time.Time` column is restored (subject to the limitations discussed
previously), but the `~astropy.coordinates.SkyCoord` and `~astropy.units.Quantity`
columns are not restored as they were in the original table.

Finally see that the masked values in the original table are replaced with zero or "" in
the round-trip conversion::

    # Original data values
    >>> for nm in 'a', 'b', 'c':
    ...     print(t[nm].data.data)
    [1 2 3]
    [1. 2. 3.]
    ['a' 'b' 'c']

    # Data values after round-trip conversion
    >>> for nm in 'a', 'b', 'c':
    ...     print(t_df[nm].data.data)
    [1 0 3]
    [ 1.  2. nan]
    ['' 'b' 'c']

.. EXAMPLE END