File: transform.rst

package info (click to toggle)
python-agate 1.13.0-1~exp1
links: PTS, VCS
area: main
in suites: experimental
size: 2,008 kB
sloc: python: 8,578; makefile: 126
file content (205 lines) | stat: -rw-r--r-- 6,301 bytes
parent folder | download | duplicates (5)
=========
Transform
=========

Pivot by a single column
========================

The :meth:`.Table.pivot` method is a general process for grouping data by row and, optionally, by column, and then calculating some aggregation for each group. Consider the following table:

+---------+---------+--------+-------+
|  name   | race    | gender | age   |
+=========+=========+========+=======+
|  Joe    | white   | female | 20    |
+---------+---------+--------+-------+
|  Jane   | asian   | male   | 20    |
+---------+---------+--------+-------+
|  Jill   | black   | female | 20    |
+---------+---------+--------+-------+
|  Jim    | latino  | male   | 25    |
+---------+---------+--------+-------+
|  Julia  | black   | female | 25    |
+---------+---------+--------+-------+
|  Joan   | asian   | female | 25    |
+---------+---------+--------+-------+

In the very simplest case, this table can be pivoted to count the number occurences of values in a column:

.. code-block:: python

    transformed = table.pivot('race')

Result:

+---------+--------+
| race    | pivot  |
+=========+========+
| white   | 1      |
+---------+--------+
| asian   | 2      |
+---------+--------+
| black   | 2      |
+---------+--------+
| latino  | 1      |
+---------+--------+

Pivot by multiple columns
=========================

You can pivot by multiple columns either as additional row-groups, or as intersecting columns. For example, given the table in the previous example:

.. code-block:: python

    transformed = table.pivot(['race', 'gender'])

Result:

+---------+--------+-------+
| race    | gender | pivot |
+=========+========+=======+
| white   | female | 1     |
+---------+--------+-------+
| asian   | male   | 1     |
+---------+--------+-------+
| black   | female | 2     |
+---------+--------+-------+
| latino  | male   | 1     |
+---------+--------+-------+
| asian   | female | 1     |
+---------+--------+-------+

For the column, version you would do:

.. code-block:: python

    transformed = table.pivot('race', 'gender')

Result:

+---------+--------+--------+
| race    | male   | female |
+=========+========+========+
| white   | 0      | 1      |
+---------+--------+--------+
| asian   | 1      | 1      |
+---------+--------+--------+
| black   | 0      | 2      |
+---------+--------+--------+
| latino  | 1      | 0      |
+---------+--------+--------+

Pivot to sum
============

The default pivot aggregation is :class:`.Count` but you can also supply other operations. For example, to aggregate each group by :class:`.Sum` of their ages:

.. code-block:: python

    transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'))

+---------+--------+--------+
| race    | male   | female |
+=========+========+========+
| white   | 0      | 20     |
+---------+--------+--------+
| asian   | 20     | 25     |
+---------+--------+--------+
| black   | 0      | 45     |
+---------+--------+--------+
| latino  | 25     | 0      |
+---------+--------+--------+

Pivot to percent of total
=========================

Pivot allows you to apply a :class:`.Computation` to each row of aggregated results prior to returning the table. Use the stringified name of the aggregation as the column argument to your computation:

.. code-block:: python

    transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'), computation=agate.Percent('sum'))

+---------+--------+--------+
| race    | male   | female |
+=========+========+========+
| white   | 0      | 14.8   |
+---------+--------+--------+
| asian   | 14.8   | 18.4   |
+---------+--------+--------+
| black   | 0      | 33.3   |
+---------+--------+--------+
| latino  | 18.4   | 0      |
+---------+--------+--------+

*Note: actual computed percentages will be much more precise.*

It's helpful when constructing these cases to think of all the cells in the pivot table as a single sequence.

Denormalize key/value columns into separate columns
===================================================

It's common for very large datasets to be distributed in a "normalized" format, such as:

+---------+-----------+---------+
|  name   | property  | value   |
+=========+===========+=========+
|  Jane   | gender    | female  |
+---------+-----------+---------+
|  Jane   | race      | black   |
+---------+-----------+---------+
|  Jane   | age       | 24      |
+---------+-----------+---------+
|  ...    |  ...      |  ...    |
+---------+-----------+---------+

The :meth:`.Table.denormalize` method can be used to transform the table so that each unique property has its own column.

.. code-block:: python

    transformed = table.denormalize('name', 'property', 'value')

Result:

+---------+----------+--------+-------+
|  name   | gender   | race   | age   |
+=========+==========+========+=======+
|  Jane   | female   | black  | 24    |
+---------+----------+--------+-------+
|  Jack   | male     | white  | 35    |
+---------+----------+--------+-------+
|  Joe    | male     | black  | 28    |
+---------+----------+--------+-------+

Normalize separate columns into key/value columns
=================================================

Sometimes you have a dataset where each property has its own column, but your analysis would be easier if all properties were stored together. Consider this table:

+---------+----------+--------+-------+
|  name   | gender   | race   | age   |
+=========+==========+========+=======+
|  Jane   | female   | black  | 24    |
+---------+----------+--------+-------+
|  Jack   | male     | white  | 35    |
+---------+----------+--------+-------+
|  Joe    | male     | black  | 28    |
+---------+----------+--------+-------+

The :meth:`.Table.normalize` method can be used to transform the table so that all the properties and their values share two columns.

.. code-block:: python

    transformed = table.normalize('name', ['gender', 'race', 'age'])

Result:

+---------+-----------+---------+
|  name   | property  | value   |
+=========+===========+=========+
|  Jane   | gender    | female  |
+---------+-----------+---------+
|  Jane   | race      | black   |
+---------+-----------+---------+
|  Jane   | age       | 24      |
+---------+-----------+---------+
|  ...    |  ...      |  ...    |
+---------+-----------+---------+