File: compute.rst

package info (click to toggle)
python-agate 1.13.0-1~exp1
  • links: PTS, VCS
  • area: main
  • in suites: experimental
  • size: 2,008 kB
  • sloc: python: 8,578; makefile: 126
file content (289 lines) | stat: -rw-r--r-- 8,901 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
==================
Compute new values
==================

Change
======

.. code-block:: python

    new_table = table.compute([
        ('2000_change', agate.Change('2000', '2001')),
        ('2001_change', agate.Change('2001', '2002')),
        ('2002_change', agate.Change('2002', '2003'))
    ])

Or, better yet, compute the whole decade using a loop:

.. code-block:: Python

    computations = []

    for year in range(2000, 2010):
        change = agate.Change(year, year + 1)
        computations.append(('%i_change' % year, change))

    new_table = table.compute(computations)

Percent
=======

Calculate the percentage for each value in a column with :class:`.Percent`.
Values are divided into the sum of the column by default.

.. code-block:: python

    columns = ('value',)
    rows = ([1],[2],[2],[5])
    new_table = agate.Table(rows, columns)

    new_table = new_table.compute([
        ('percent', agate.Percent('value'))
    ])

    new_table.print_table()

    | value | percent |
    | ----- | ------- |
    |     1 |      10 |
    |     2 |      20 |
    |     2 |      20 |
    |     5 |      50 |

Override the denominator with a keyword argument.

.. code-block:: python

    new_table = new_table.compute([
        ('percent', agate.Percent('value', 5))
    ])

    new_table.print_table()

    | value | percent |
    | ----- | ------- |
    |     1 |      20 |
    |     2 |      40 |
    |     2 |      40 |
    |     5 |     100 |

Percent change
==============

Want percent change instead of value change? Just swap out the :class:`.Computation`:

.. code-block:: Python

    computations = []

    for year in range(2000, 2010):
        change = agate.PercentChange(year, year + 1)
        computations.append(('%i_change' % year, change))

    new_table = table.compute(computations)

Indexed/cumulative change
=========================

Need your change indexed to a starting year? Just fix the first argument:

.. code-block:: Python

    computations = []

    for year in range(2000, 2010):
        change = agate.Change(2000, year + 1)
        computations.append(('%i_change' % year, change))

    new_table = table.compute(computations)

Of course you can also use :class:`.PercentChange` if you need percents rather than values.

Round to two decimal places
===========================

agate stores numerical values using Python's :class:`decimal.Decimal` type. This data type ensures numerical precision beyond what is supported by the native :func:`float` type, however, because of this we can not use Python's builtin :func:`round` function. Instead we must use :meth:`decimal.Decimal.quantize`.

We can use :meth:`.Table.compute` to apply the quantize to generate a rounded column from an existing one:

.. code-block:: python

    from decimal import Decimal

    number_type = agate.Number()

    def round_price(row):
        return row['price'].quantize(Decimal('0.01'))

    new_table = table.compute([
        ('price_rounded', agate.Formula(number_type, round_price))
    ])

To round to one decimal place you would simply change :code:`0.01` to :code:`0.1`.

.. _difference_between_dates:

Difference between dates
========================

Calculating the difference between dates (or dates and times) works exactly the same as it does for numbers:

.. code-block:: python

    new_table = table.compute([
        ('age_at_death', agate.Change('born', 'died'))
    ])

Levenshtein edit distance
=========================

The Levenshtein edit distance is a common measure of string similarity. It can be used, for instance, to check for typos between manually-entered names and a version that is known to be spelled correctly.

Implementing Levenshtein requires writing a custom :class:`.Computation`. To save ourselves building the whole thing from scratch, we will lean on the `python-Levenshtein <https://pypi.python.org/pypi/python-Levenshtein/>`_ library for the actual algorithm.

.. code-block:: python

    import agate
    from Levenshtein import distance
    
    class LevenshteinDistance(agate.Computation):
        """
        Computes Levenshtein edit distance between the column and a given string.
        """
        def __init__(self, column_name, compare_string):
            self._column_name = column_name
            self._compare_string = compare_string

        def get_computed_data_type(self, table):
            """
            The return value is a numerical distance.
            """
            return agate.Number()

        def validate(self, table):
            """
            Verify the column is text.
            """
            column = table.columns[self._column_name]

            if not isinstance(column.data_type, agate.Text):
                raise agate.DataTypeError('Can only be applied to Text data.')

        def run(self, table):
            """
            Find the distance, returning null when the input column was null.
            """
            new_column = []

            for row in table.rows:
              val = row[self._column_name]

              if val is None:
                  new_column.append(None)
              else:
                  new_column.append(distance(val, self._compare_string))

            return new_column

This code can now be applied to any :class:`.Table` just as any other :class:`.Computation` would be:

.. code-block:: python

    new_table = table.compute([
        ('distance', LevenshteinDistance('column_name', 'string to compare'))
    ])

The resulting column will contain an integer measuring the edit distance between the value in the column and the comparison string.

USA Today Diversity Index
=========================

The `USA Today Diversity Index <https://www.usatoday.com/story/news/nation/2014/10/21/diversity-index-data-how-we-did-report/17432103/>`_ is a widely cited method for evaluating the racial diversity of a given area. Using a custom :class:`.Computation` makes it simple to calculate.

Assuming that your data has a column for the total population, another for the population of each race and a final column for the hispanic population, you can implement the diversity index like this:

.. code-block:: python

    class USATodayDiversityIndex(agate.Computation):
        def get_computed_data_type(self, table):
            return agate.Number()

        def run(self, table):
            new_column = []

            for row in table.rows:
              race_squares = 0

              for race in ['white', 'black', 'asian', 'american_indian', 'pacific_islander']:
                  race_squares += (row[race] / row['population']) ** 2

              hispanic_squares = (row['hispanic'] / row['population']) ** 2
              hispanic_squares += (1 - (row['hispanic'] / row['population'])) ** 2

              new_column.append((1 - (race_squares * hispanic_squares)) * 100)

            return new_column

We apply the diversity index like any other computation:

.. code-block:: Python

    with_index = table.compute([
        ('diversity_index', USATodayDiversityIndex())
    ])

Simple Moving Average
=====================

A simple moving average is the average of some number of prior values in a series. It is typically used to smooth out variation in time series data.

The following custom :class:`.Computation` will compute a simple moving average. This example assumes your data is already sorted.

.. code-block:: python

    class SimpleMovingAverage(agate.Computation):
        """
        Computes the simple moving average of a column over some interval.
        """
        def __init__(self, column_name, interval):
            self._column_name = column_name
            self._interval = interval

        def get_computed_data_type(self, table):
            """
            The return value is a numerical average.
            """
            return agate.Number()

        def validate(self, table):
            """
            Verify the column is numerical.
            """
            column = table.columns[self._column_name]

            if not isinstance(column.data_type, agate.Number):
                raise agate.DataTypeError('Can only be applied to Number data.')

        def run(self, table):
            new_column = []

            for i, row in enumerate(table.rows):
                if i < self._interval:
                    new_column.append(None)
                else:
                    values = tuple(r[self._column_name] for r in table.rows[i - self._interval:i])

                    if None in values:
                        new_column.append(None)
                    else:
                        new_column.append(sum(values) / self._interval)

            return new_column

You would use the simple moving average like so:

.. code-block:: Python

    with_average = table.compute([
        ('six_month_moving_average', SimpleMovingAverage('price', 6))
    ])