1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
|
===========
Filter rows
===========
By regex
========
You can use Python's builtin :mod:`re` module to introduce a regular expression into a :meth:`.Table.where` query.
For example, here we find all states that start with "C".
.. code-block:: python
import re
new_table = table.where(lambda row: re.match('^C', str(row['state'])))
This can also be useful for finding values that **don't** match your expectations. For example, finding all values in the "phone number" column that don't look like phone numbers:
.. code-block:: python
new_table = table.where(lambda row: not re.match('\d{3}-\d{3}-\d{4}', str(row['phone'])))
By glob
=======
Hate regexes? You can use glob (:mod:`fnmatch`) syntax too!
.. code-block:: python
from fnmatch import fnmatch
new_table = table.where(lambda row: fnmatch('C*', row['state']))
Values within a range
=====================
This snippet filters the dataset to incomes between 100,000 and 200,000.
.. code-block:: python
new_table = table.where(lambda row: 100000 < row['income'] < 200000)
Dates within a range
====================
This snippet filters the dataset to events during the summer of 2015:
.. code-block:: python
import datetime
new_table = table.where(lambda row: datetime.datetime(2015, 6, 1) <= row['date'] <= datetime.datetime(2015, 8, 31))
If you want to filter to events during the summer of any year:
.. code-block:: python
new_table = table.where(lambda row: 6 <= row['date'].month <= 8)
Top N percent
=============
To filter a dataset to the top 10% percent of values we first compute the percentiles for the column and then use the result in the :meth:`.Table.where` truth test:
.. code-block:: python
percentiles = table.aggregate(agate.Percentiles('salary'))
top_ten_percent = table.where(lambda r: r['salary'] >= percentiles[90])
Random sample
=============
By combining a random sort with limiting, we can effectively get a random sample from a table.
.. code-block:: python
import random
randomized = table.order_by(lambda row: random.random())
sampled = table.limit(10)
Ordered sample
==============
With can also get an ordered sample by simply using the :code:`step` parameter of the :meth:`.Table.limit` method to get every Nth row.
.. code-block:: python
sampled = table.limit(step=10)
Distinct values
===============
You can retrieve a distinct list of values in a column using :meth:`.Column.values_distinct` or :meth:`.Table.distinct`.
:meth:`.Table.distinct` returns the entire row so it's necessary to chain a select on the specific column.
.. code-block:: python
columns = ('value',)
rows = ([1],[2],[2],[5])
new_table = agate.Table(rows, columns)
new_table.columns['value'].values_distinct()
# or
new_table.distinct('value').columns['value'].values()
(Decimal('1'), Decimal('2'), Decimal('5'))
|