File: loading_tabular.rst

package info (click to toggle)
python-cogent 2024.5.7a1%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 74,600 kB
  • sloc: python: 92,479; makefile: 117; sh: 16
file content (299 lines) | stat: -rw-r--r-- 7,615 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
.. jupyter-execute::
    :hide-code:

    import set_working_directory

Loading a csv file
==================

We load a tab separated data file using the ``load_table()`` function. The format is inferred from the filename suffix and you will note, in this case, it's not actually a `csv` file.

.. jupyter-execute::

    from cogent3 import load_table

    table = load_table("data/stats.tsv")
    table

.. note:: The known filename suffixes for reading are ``.csv``, ``.tsv`` and ``.pkl`` or ``.pickle`` (Python's pickle format).

.. note:: If you invoke the static column types argument, i.e.``load_table(..., static_column_types=True)`` and the column data are not static, those columns will be left as a string type.

Loading from a url
==================

The ``cogent3`` load functions support loading from a url. We load the above ``.tsv`` file directly from GitHub.

.. jupyter-execute::

    from cogent3 import load_table

    table = load_table("https://raw.githubusercontent.com/cogent3/cogent3/develop/doc/data/stats.tsv")

Loading delimited specifying the format
=======================================

Although unnecessary in this case, it's possible to override the suffix by specifying the delimiter using the ``sep`` argument.

.. jupyter-execute::

    from cogent3 import load_table

    table = load_table("data/stats.tsv", sep="\t")
    table

Loading delimited data without a header line
============================================

To create a table from the follow examples, you specify your header and use ``make_table()``.

Using ``load_delimited()``
--------------------------

This is just a standard parsing function which does not do any filtering or converting elements to non-string types.

.. jupyter-execute::

    from cogent3.parse.table import load_delimited

    header, rows, title, legend = load_delimited("data/CerebellumDukeDNaseSeq.pk", header=False, sep="\t")
    rows[:4]

Using ``FilteringParser``
-------------------------

.. jupyter-execute::

    from cogent3.parse.table import FilteringParser
    
    reader = FilteringParser(with_header=False, sep="\t")
    rows = list(reader("data/CerebellumDukeDNaseSeq.pk"))
    rows[:4]

Selectively loading parts of a big file
=======================================

Loading a set number of lines from a file
-----------------------------------------

The ``limit`` argument specifies the number of lines to read.

.. jupyter-execute::

    from cogent3 import load_table

    table = load_table("data/stats.tsv", limit=2)
    table

Loading only some rows
----------------------

If you only want a subset of the contents of a file, use the ``FilteringParser``. This allows skipping certain lines by using a callback function. We illustrate this with ``stats.tsv``, skipping any rows with ``"Ratio"`` > 10.

.. jupyter-execute::

    from cogent3.parse.table import FilteringParser

    reader = FilteringParser(
        lambda line: float(line[2]) <= 10, with_header=True, sep="\t"
    )
    table = load_table("data/stats.tsv", reader=reader, digits=1)
    table

You can also ``negate`` a condition, which is useful if the condition is complex. In this example, it means keep the rows for which ``Ratio > 10``.

.. jupyter-execute::

    reader = FilteringParser(
        lambda line: float(line[2]) <= 10, with_header=True, sep="\t", negate=True
    )
    table = load_table("data/stats.tsv", reader=reader, digits=1)
    table

Loading only some columns
-------------------------

Specify the columns by their names.

.. jupyter-execute::

    from cogent3.parse.table import FilteringParser

    reader = FilteringParser(columns=["Locus", "Ratio"], with_header=True, sep="\t")
    table = load_table("data/stats.tsv", reader=reader)
    table

Or, by their index.

.. jupyter-execute::

    from cogent3.parse.table import FilteringParser

    reader = FilteringParser(columns=[0, -1], with_header=True, sep="\t")
    table = load_table("data/stats.tsv", reader=reader)
    table

.. note:: The ``negate`` argument does not affect the columns evaluated.

Load raw data as a list of lists of strings
-------------------------------------------

We just use ``FilteringParser``.

.. jupyter-execute::

    from cogent3.parse.table import FilteringParser

    reader = FilteringParser(with_header=True, sep="\t")
    data = list(reader("data/stats.tsv"))

We just display the first two lines.

.. jupyter-execute::

    data[:2]

.. note:: The individual elements are all ``str``.

Make a table from header and rows
=================================

.. jupyter-execute::

    from cogent3 import make_table

    header = ["A", "B", "C"]
    rows = [range(3), range(3, 6), range(6, 9), range(9, 12)]
    table = make_table(header=["A", "B", "C"], data=rows)
    table

Make a table from a ``dict``
============================

For a ``dict`` with key's as column headers.

.. jupyter-execute::

    from cogent3 import make_table

    data = dict(A=[0, 3, 6], B=[1, 4, 7], C=[2, 5, 8])
    table = make_table(data=data)
    table

Specify the column order when creating from a ``dict``.
=======================================================

.. jupyter-execute::

    table = make_table(header=["C", "A", "B"], data=data)
    table

Create the table with an index
==============================

A ``Table`` can be indexed like a dict if you designate a column as the index (and that column has a unique value for every row).

.. jupyter-execute::

    table = load_table("data/stats.tsv", index_name="Locus")
    table["NP_055852"]

.. jupyter-execute::

    table["NP_055852", "Region"]

.. note:: The ``index_name`` argument also applies when using ``make_table()``.

Create a table from a ``pandas.DataFrame``
==========================================

.. jupyter-execute::

    from pandas import DataFrame

    from cogent3 import make_table

    data = dict(a=[0, 3], b=["a", "c"])
    df = DataFrame(data=data)
    table = make_table(data_frame=df)
    table

Create a table from header and rows
===================================

.. jupyter-execute::

    from cogent3 import make_table

    table = make_table(header=["a", "b"], data=[[0, "a"], [3, "c"]])
    table

Create a table from dict
========================

``make_table()`` is the utility function for creating ``Table`` objects from standard python objects.

.. jupyter-execute::

    from cogent3 import make_table

    data = dict(a=[0, 3], b=["a", "c"])
    table = make_table(data=data)
    table

Create a table from a 2D dict
=============================

.. jupyter-execute::

    from cogent3 import make_table

    d2D = {
        "edge.parent": {
            "NineBande": "root",
            "edge.1": "root",
            "DogFaced": "root",
            "Human": "edge.0",
        },
        "x": {
            "NineBande": 1.0,
            "edge.1": 1.0,
            "DogFaced": 1.0,
            "Human": 1.0,
        },
        "length": {
            "NineBande": 4.0,
            "edge.1": 4.0,
            "DogFaced": 4.0,
            "Human": 4.0,
        },
    }
    table = make_table(
        data=d2D,
    )
    table

Create a table that has complex python objects as elements
==========================================================

.. jupyter-execute::

    from cogent3 import make_table

    table = make_table(
        header=["abcd", "data"],
        data=[[range(1, 6), "0"], ["x", 5.0], ["y", None]],
        missing_data="*",
        digits=1,
    )
    table

Create an empty table
=====================

.. jupyter-execute::

    from cogent3 import make_table

    table = make_table()
    table