1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289
|
==================
Compute new values
==================
Change
======
.. code-block:: python
new_table = table.compute([
('2000_change', agate.Change('2000', '2001')),
('2001_change', agate.Change('2001', '2002')),
('2002_change', agate.Change('2002', '2003'))
])
Or, better yet, compute the whole decade using a loop:
.. code-block:: Python
computations = []
for year in range(2000, 2010):
change = agate.Change(year, year + 1)
computations.append(('%i_change' % year, change))
new_table = table.compute(computations)
Percent
=======
Calculate the percentage for each value in a column with :class:`.Percent`.
Values are divided into the sum of the column by default.
.. code-block:: python
columns = ('value',)
rows = ([1],[2],[2],[5])
new_table = agate.Table(rows, columns)
new_table = new_table.compute([
('percent', agate.Percent('value'))
])
new_table.print_table()
| value | percent |
| ----- | ------- |
| 1 | 10 |
| 2 | 20 |
| 2 | 20 |
| 5 | 50 |
Override the denominator with a keyword argument.
.. code-block:: python
new_table = new_table.compute([
('percent', agate.Percent('value', 5))
])
new_table.print_table()
| value | percent |
| ----- | ------- |
| 1 | 20 |
| 2 | 40 |
| 2 | 40 |
| 5 | 100 |
Percent change
==============
Want percent change instead of value change? Just swap out the :class:`.Computation`:
.. code-block:: Python
computations = []
for year in range(2000, 2010):
change = agate.PercentChange(year, year + 1)
computations.append(('%i_change' % year, change))
new_table = table.compute(computations)
Indexed/cumulative change
=========================
Need your change indexed to a starting year? Just fix the first argument:
.. code-block:: Python
computations = []
for year in range(2000, 2010):
change = agate.Change(2000, year + 1)
computations.append(('%i_change' % year, change))
new_table = table.compute(computations)
Of course you can also use :class:`.PercentChange` if you need percents rather than values.
Round to two decimal places
===========================
agate stores numerical values using Python's :class:`decimal.Decimal` type. This data type ensures numerical precision beyond what is supported by the native :func:`float` type, however, because of this we can not use Python's builtin :func:`round` function. Instead we must use :meth:`decimal.Decimal.quantize`.
We can use :meth:`.Table.compute` to apply the quantize to generate a rounded column from an existing one:
.. code-block:: python
from decimal import Decimal
number_type = agate.Number()
def round_price(row):
return row['price'].quantize(Decimal('0.01'))
new_table = table.compute([
('price_rounded', agate.Formula(number_type, round_price))
])
To round to one decimal place you would simply change :code:`0.01` to :code:`0.1`.
.. _difference_between_dates:
Difference between dates
========================
Calculating the difference between dates (or dates and times) works exactly the same as it does for numbers:
.. code-block:: python
new_table = table.compute([
('age_at_death', agate.Change('born', 'died'))
])
Levenshtein edit distance
=========================
The Levenshtein edit distance is a common measure of string similarity. It can be used, for instance, to check for typos between manually-entered names and a version that is known to be spelled correctly.
Implementing Levenshtein requires writing a custom :class:`.Computation`. To save ourselves building the whole thing from scratch, we will lean on the `python-Levenshtein <https://pypi.python.org/pypi/python-Levenshtein/>`_ library for the actual algorithm.
.. code-block:: python
import agate
from Levenshtein import distance
class LevenshteinDistance(agate.Computation):
"""
Computes Levenshtein edit distance between the column and a given string.
"""
def __init__(self, column_name, compare_string):
self._column_name = column_name
self._compare_string = compare_string
def get_computed_data_type(self, table):
"""
The return value is a numerical distance.
"""
return agate.Number()
def validate(self, table):
"""
Verify the column is text.
"""
column = table.columns[self._column_name]
if not isinstance(column.data_type, agate.Text):
raise agate.DataTypeError('Can only be applied to Text data.')
def run(self, table):
"""
Find the distance, returning null when the input column was null.
"""
new_column = []
for row in table.rows:
val = row[self._column_name]
if val is None:
new_column.append(None)
else:
new_column.append(distance(val, self._compare_string))
return new_column
This code can now be applied to any :class:`.Table` just as any other :class:`.Computation` would be:
.. code-block:: python
new_table = table.compute([
('distance', LevenshteinDistance('column_name', 'string to compare'))
])
The resulting column will contain an integer measuring the edit distance between the value in the column and the comparison string.
USA Today Diversity Index
=========================
The `USA Today Diversity Index <https://www.usatoday.com/story/news/nation/2014/10/21/diversity-index-data-how-we-did-report/17432103/>`_ is a widely cited method for evaluating the racial diversity of a given area. Using a custom :class:`.Computation` makes it simple to calculate.
Assuming that your data has a column for the total population, another for the population of each race and a final column for the hispanic population, you can implement the diversity index like this:
.. code-block:: python
class USATodayDiversityIndex(agate.Computation):
def get_computed_data_type(self, table):
return agate.Number()
def run(self, table):
new_column = []
for row in table.rows:
race_squares = 0
for race in ['white', 'black', 'asian', 'american_indian', 'pacific_islander']:
race_squares += (row[race] / row['population']) ** 2
hispanic_squares = (row['hispanic'] / row['population']) ** 2
hispanic_squares += (1 - (row['hispanic'] / row['population'])) ** 2
new_column.append((1 - (race_squares * hispanic_squares)) * 100)
return new_column
We apply the diversity index like any other computation:
.. code-block:: Python
with_index = table.compute([
('diversity_index', USATodayDiversityIndex())
])
Simple Moving Average
=====================
A simple moving average is the average of some number of prior values in a series. It is typically used to smooth out variation in time series data.
The following custom :class:`.Computation` will compute a simple moving average. This example assumes your data is already sorted.
.. code-block:: python
class SimpleMovingAverage(agate.Computation):
"""
Computes the simple moving average of a column over some interval.
"""
def __init__(self, column_name, interval):
self._column_name = column_name
self._interval = interval
def get_computed_data_type(self, table):
"""
The return value is a numerical average.
"""
return agate.Number()
def validate(self, table):
"""
Verify the column is numerical.
"""
column = table.columns[self._column_name]
if not isinstance(column.data_type, agate.Number):
raise agate.DataTypeError('Can only be applied to Number data.')
def run(self, table):
new_column = []
for i, row in enumerate(table.rows):
if i < self._interval:
new_column.append(None)
else:
values = tuple(r[self._column_name] for r in table.rows[i - self._interval:i])
if None in values:
new_column.append(None)
else:
new_column.append(sum(values) / self._interval)
return new_column
You would use the simple moving average like so:
.. code-block:: Python
with_average = table.compute([
('six_month_moving_average', SimpleMovingAverage('price', 6))
])
|