File: filter.rst

package info (click to toggle)
python-agate 1.13.0-1~exp1
  • links: PTS, VCS
  • area: main
  • in suites: experimental
  • size: 2,008 kB
  • sloc: python: 8,578; makefile: 126
file content (108 lines) | stat: -rw-r--r-- 2,901 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
===========
Filter rows
===========

By regex
========

You can use Python's builtin :mod:`re` module to introduce a regular expression into a :meth:`.Table.where` query.

For example, here we find all states that start with "C".

.. code-block:: python

    import re

    new_table = table.where(lambda row: re.match('^C', str(row['state'])))

This can also be useful for finding values that **don't** match your expectations. For example, finding all values in the "phone number" column that don't look like phone numbers:

.. code-block:: python

    new_table = table.where(lambda row: not re.match('\d{3}-\d{3}-\d{4}', str(row['phone'])))

By glob
=======

Hate regexes? You can use glob (:mod:`fnmatch`) syntax too!

.. code-block:: python

    from fnmatch import fnmatch

    new_table = table.where(lambda row: fnmatch('C*', row['state']))

Values within a range
=====================

This snippet filters the dataset to incomes between 100,000 and 200,000.

.. code-block:: python

    new_table = table.where(lambda row: 100000 < row['income'] < 200000)

Dates within a range
====================

This snippet filters the dataset to events during the summer of 2015:

.. code-block:: python

    import datetime

    new_table = table.where(lambda row: datetime.datetime(2015, 6, 1) <= row['date'] <= datetime.datetime(2015, 8, 31))

If you want to filter to events during the summer of any year:

.. code-block:: python

    new_table = table.where(lambda row: 6 <= row['date'].month <= 8)

Top N percent
=============

To filter a dataset to the top 10% percent of values we first compute the percentiles for the column and then use the result in the :meth:`.Table.where` truth test:

.. code-block:: python

    percentiles = table.aggregate(agate.Percentiles('salary'))
    top_ten_percent = table.where(lambda r: r['salary'] >= percentiles[90])

Random sample
=============

By combining a random sort with limiting, we can effectively get a random sample from a table.

.. code-block:: python

    import random

    randomized = table.order_by(lambda row: random.random())
    sampled = table.limit(10)

Ordered sample
==============

With can also get an ordered sample by simply using the :code:`step` parameter of the :meth:`.Table.limit` method to get every Nth row.

.. code-block:: python

    sampled = table.limit(step=10)

Distinct values
===============

You can retrieve a distinct list of values in a column using :meth:`.Column.values_distinct` or :meth:`.Table.distinct`.

:meth:`.Table.distinct` returns the entire row so it's necessary to chain a select on the specific column.

.. code-block:: python

    columns = ('value',)
    rows = ([1],[2],[2],[5])
    new_table = agate.Table(rows, columns)

    new_table.columns['value'].values_distinct()
    # or
    new_table.distinct('value').columns['value'].values()
    (Decimal('1'), Decimal('2'), Decimal('5'))