File: app-overview.rst

package info (click to toggle)
python-cogent 2023.2.12a1%2Bdfsg-2%2Bdeb12u1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 12,416 kB
  • sloc: python: 89,165; makefile: 117; sh: 16
file content (182 lines) | stat: -rw-r--r-- 5,839 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
.. jupyter-execute::
    :hide-code:

    import set_working_directory

.. _apps:

**********************
Overview of using apps
**********************

There are 3 top-level functions that provide the major means for identifying what apps are installed, what an app can do and for getting an app to use it. These functions are:

- ``available_apps()`` (see :ref:`available_apps`)
- ``app_help()`` (see :ref:`app_help`)
- ``get_app()`` (see :ref:`get_app`)

Two other crucial concepts concern :ref:`data stores <data_stores>` and :ref:`tracking failures <not_completed>`.

.. app_types:

Types of apps
=============

There are 3 types of apps:

#. loaders (by convention, names starts with ``load_<data type>``)
#. writers (by convention, names starts with ``write_<data type>``)
#. generic (no naming convention)

As their names imply, loaders load, writers write and generic apps do other operations on data.

.. _app_composability:

Composability
=============

Most ``cogent3`` apps are "composable", meaning that multiple apps can be combined into a single function by addition. For example, say we have an app (``fit_model``) that performs a molecular evolutionary analysis on an alignment, and another app (``extract_stats``) that gets the statistics from the result. We could perform these steps sequentially as follows

.. code-block:: python
    
    fitted = fit_model(alignment)
    stats = extract_stats(fitted)

Composability allows us to simplify this as follows

.. code-block:: python
    
    app = fit_model + extract_stats
    stats = app(fitted)

We can have many more apps in a composed function than just the two shown here.

.. _composability_rules:

Composability rules
-------------------

There are rules around app composition, starting with app types. Loaders and writers are special cases. If included, a loader must always be first, e.g.

.. code-block:: python
    
    app = a_loader + a_generic

If included, a writer must always be last, e.g.

.. code-block:: python
    
    app = a_generic + a_writer

Changing the order for either of the above will result in a ``TypeError``.

The next constraint on app composition are the input and output types of the apps involved. Specifically, apps define the type of input they work on and the type of output they produce. For two apps to be composed, the output (or return) type of app on the left (e.g. ``a_loader``) must overlap with the input type of the app on the right (e.g. ``a_generic``). If they don't match, a ``TypeError`` is raised.

An example
==========

.. jupyter-execute::
    :hide-code:

    from pathlib import Path
    from tempfile import TemporaryDirectory
    
    tmpdir = TemporaryDirectory(dir=".")
    path_to_dir = tmpdir.name

I illustrate the general approach for a simple example -- extracting third codon positions. As I'm defining a writer, I also need to define the destination (a directory in this case) where it will write to.

.. jupyter-execute::

    from cogent3 import get_app, open_data_store

    out_dstore = open_data_store(path_to_dir, suffix="fa", mode="w")

    loader = get_app("load_aligned", format="fasta", moltype="dna")
    cpos3 = get_app("take_codon_positions", 3)
    writer = get_app("write_seqs", out_dstore, format="fasta")

Using apps sequentially like functions
--------------------------------------

.. jupyter-execute::

    data = loader("data/primate_brca1.fasta")
    just3rd = cpos3(data)
    m = writer(just3rd)

The resulting alignment ``just3rd`` will be written into the ``out_dstore`` directory in fasta format with the same filename as the original data (``"primate_brca1.fasta"``).

.. note:: ``m`` is a ``DataMember`` (:ref:`described here <data_member>`).

Composing a multi-step process from several apps
------------------------------------------------

We can make this simpler by creating a single composed function.

.. jupyter-execute::

    process = loader + cpos3 + writer
    m = process("data/primate_brca1.fasta")

Applying a process to multiple data records
-------------------------------------------

We use a data store to identify all data files in a directory that we want to analyse. ``process`` can be then applied to all records in the data store without having to loop.

.. jupyter-execute::

    dstore = open_data_store("data", suffix="fasta", mode="r")
    result = process.apply_to(dstore)

.. note:: ``result`` is ``out_dstore``.

Other important features
========================

The settings and data analysed will be logged
---------------------------------------------

A log file will be written into the same data store as the output. The log includes information on the conditions under which the analysis was run and fingerprint all input and output files.

.. jupyter-execute::

    out_dstore.summary_logs

Failures are recorded
---------------------

Any "failures" (see :ref:`not_completed`) are saved. The data store class provides methods for interrogating those. First, a general summary of the output data store indicates we have 6 records that did not complete.

.. jupyter-execute::

    out_dstore.describe

These occur for this example primarily because some of the files contain sequences that are not aligned

.. jupyter-execute::

    out_dstore.summary_not_completed

You can track progress
----------------------

.. jupyter-execute::

    result = process.apply_to(dstore, show_progress=True)

You can do parallel computation
-------------------------------

.. code-block:: python

    result = process.apply_to(dstore, parallel=True)

By default, this will use all available processors on your machine. (See :ref:`parallel` for more details plus how to take advantage of multiple machines using MPI.)

All of the above
----------------

.. code-block:: python

    process.apply_to(dstore, parallel=True, show_progress=True)