File: loading_tabular.rst

package info (click to toggle)
python-cogent 2024.5.7a1%2Bdfsg-3
links: PTS, VCS
area: main
in suites: sid
size: 74,600 kB
sloc: python: 92,479; makefile: 117; sh: 16
file content (299 lines) | stat: -rw-r--r-- 7,615 bytes
parent folder | download | duplicates (2)
.. jupyter-execute::
    :hide-code:

    import set_working_directory

Loading a csv file
==================

We load a tab separated data file using the ``load_table()`` function. The format is inferred from the filename suffix and you will note, in this case, it's not actually a `csv` file.

.. jupyter-execute::

    from cogent3 import load_table

    table = load_table("data/stats.tsv")
    table

.. note:: The known filename suffixes for reading are ``.csv``, ``.tsv`` and ``.pkl`` or ``.pickle`` (Python's pickle format).

.. note:: If you invoke the static column types argument, i.e.``load_table(..., static_column_types=True)`` and the column data are not static, those columns will be left as a string type.

Loading from a url
==================

The ``cogent3`` load functions support loading from a url. We load the above ``.tsv`` file directly from GitHub.

.. jupyter-execute::

    from cogent3 import load_table

    table = load_table("https://raw.githubusercontent.com/cogent3/cogent3/develop/doc/data/stats.tsv")

Loading delimited specifying the format
=======================================

Although unnecessary in this case, it's possible to override the suffix by specifying the delimiter using the ``sep`` argument.

.. jupyter-execute::

    from cogent3 import load_table

    table = load_table("data/stats.tsv", sep="\t")
    table

Loading delimited data without a header line
============================================

To create a table from the follow examples, you specify your header and use ``make_table()``.

Using ``load_delimited()``
--------------------------

This is just a standard parsing function which does not do any filtering or converting elements to non-string types.

.. jupyter-execute::

    from cogent3.parse.table import load_delimited

    header, rows, title, legend = load_delimited("data/CerebellumDukeDNaseSeq.pk", header=False, sep="\t")
    rows[:4]

Using ``FilteringParser``
-------------------------

.. jupyter-execute::

    from cogent3.parse.table import FilteringParser
    
    reader = FilteringParser(with_header=False, sep="\t")
    rows = list(reader("data/CerebellumDukeDNaseSeq.pk"))
    rows[:4]

Selectively loading parts of a big file
=======================================

Loading a set number of lines from a file
-----------------------------------------

The ``limit`` argument specifies the number of lines to read.

.. jupyter-execute::

    from cogent3 import load_table

    table = load_table("data/stats.tsv", limit=2)
    table

Loading only some rows
----------------------

If you only want a subset of the contents of a file, use the ``FilteringParser``. This allows skipping certain lines by using a callback function. We illustrate this with ``stats.tsv``, skipping any rows with ``"Ratio"`` > 10.

.. jupyter-execute::

    from cogent3.parse.table import FilteringParser

    reader = FilteringParser(
        lambda line: float(line[2]) <= 10, with_header=True, sep="\t"
    )
    table = load_table("data/stats.tsv", reader=reader, digits=1)
    table

You can also ``negate`` a condition, which is useful if the condition is complex. In this example, it means keep the rows for which ``Ratio > 10``.

.. jupyter-execute::

    reader = FilteringParser(
        lambda line: float(line[2]) <= 10, with_header=True, sep="\t", negate=True
    )
    table = load_table("data/stats.tsv", reader=reader, digits=1)
    table

Loading only some columns
-------------------------

Specify the columns by their names.

.. jupyter-execute::

    from cogent3.parse.table import FilteringParser

    reader = FilteringParser(columns=["Locus", "Ratio"], with_header=True, sep="\t")
    table = load_table("data/stats.tsv", reader=reader)
    table

Or, by their index.

.. jupyter-execute::

    from cogent3.parse.table import FilteringParser

    reader = FilteringParser(columns=[0, -1], with_header=True, sep="\t")
    table = load_table("data/stats.tsv", reader=reader)
    table

.. note:: The ``negate`` argument does not affect the columns evaluated.

Load raw data as a list of lists of strings
-------------------------------------------

We just use ``FilteringParser``.

.. jupyter-execute::

    from cogent3.parse.table import FilteringParser

    reader = FilteringParser(with_header=True, sep="\t")
    data = list(reader("data/stats.tsv"))

We just display the first two lines.

.. jupyter-execute::

    data[:2]

.. note:: The individual elements are all ``str``.

Make a table from header and rows
=================================

.. jupyter-execute::

    from cogent3 import make_table

    header = ["A", "B", "C"]
    rows = [range(3), range(3, 6), range(6, 9), range(9, 12)]
    table = make_table(header=["A", "B", "C"], data=rows)
    table

Make a table from a ``dict``
============================

For a ``dict`` with key's as column headers.

.. jupyter-execute::

    from cogent3 import make_table

    data = dict(A=[0, 3, 6], B=[1, 4, 7], C=[2, 5, 8])
    table = make_table(data=data)
    table

Specify the column order when creating from a ``dict``.
=======================================================

.. jupyter-execute::

    table = make_table(header=["C", "A", "B"], data=data)
    table

Create the table with an index
==============================

A ``Table`` can be indexed like a dict if you designate a column as the index (and that column has a unique value for every row).

.. jupyter-execute::

    table = load_table("data/stats.tsv", index_name="Locus")
    table["NP_055852"]

.. jupyter-execute::

    table["NP_055852", "Region"]

.. note:: The ``index_name`` argument also applies when using ``make_table()``.

Create a table from a ``pandas.DataFrame``
==========================================

.. jupyter-execute::

    from pandas import DataFrame

    from cogent3 import make_table

    data = dict(a=[0, 3], b=["a", "c"])
    df = DataFrame(data=data)
    table = make_table(data_frame=df)
    table

Create a table from header and rows
===================================

.. jupyter-execute::

    from cogent3 import make_table

    table = make_table(header=["a", "b"], data=[[0, "a"], [3, "c"]])
    table

Create a table from dict
========================

``make_table()`` is the utility function for creating ``Table`` objects from standard python objects.

.. jupyter-execute::

    from cogent3 import make_table

    data = dict(a=[0, 3], b=["a", "c"])
    table = make_table(data=data)
    table

Create a table from a 2D dict
=============================

.. jupyter-execute::

    from cogent3 import make_table

    d2D = {
        "edge.parent": {
            "NineBande": "root",
            "edge.1": "root",
            "DogFaced": "root",
            "Human": "edge.0",
        },
        "x": {
            "NineBande": 1.0,
            "edge.1": 1.0,
            "DogFaced": 1.0,
            "Human": 1.0,
        },
        "length": {
            "NineBande": 4.0,
            "edge.1": 4.0,
            "DogFaced": 4.0,
            "Human": 4.0,
        },
    }
    table = make_table(
        data=d2D,
    )
    table

Create a table that has complex python objects as elements
==========================================================

.. jupyter-execute::

    from cogent3 import make_table

    table = make_table(
        header=["abcd", "data"],
        data=[[range(1, 6), "0"], ["x", 5.0], ["y", None]],
        missing_data="*",
        digits=1,
    )
    table

Create an empty table
=====================

.. jupyter-execute::

    from cogent3 import make_table

    table = make_table()
    table