File: write-seqs.rst

package info (click to toggle)
python-cogent 2024.5.7a1%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 74,600 kB
  • sloc: python: 92,479; makefile: 117; sh: 16
file content (111 lines) | stat: -rw-r--r-- 3,882 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
.. jupyter-execute::
    :hide-code:

    import set_working_directory

Writing sequences and sequence alignments
-----------------------------------------

Writing a sequence alignment to disk
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Writing out an alignment can be achieved easily with the ``write_seqs`` app. First, let's load the alignment we want to write. 

.. jupyter-execute::
    :raises:

    from cogent3 import get_app

    load_aligned_app = get_app("load_aligned", moltype="dna", format="fasta")
    aln = load_aligned_app("data/primate_brca1.fasta")
    aln

When creating the ``write_seqs`` app, we need to provide a data store to which we want the data to be written, and optionally, we can specify the format we want the sequences to be written in. 

.. jupyter-execute::
    :hide-code:

    from tempfile import TemporaryDirectory

    tmpdir = TemporaryDirectory(dir=".")
    path_to_dir = tmpdir.name

.. jupyter-execute::
    :raises:

    from cogent3 import get_app, open_data_store

    seq_dstore = open_data_store(path_to_dir, suffix="phylip", mode="w")

    write_seqs_app = get_app("write_seqs", data_store=seq_dstore, format="phylip")
    result = write_seqs_app(aln)

    result.read()[:50]

Writing many sequence collections to a data store
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Typically, the final step of a data processing pipeline is writing out the filtered data. When ``write_seqs`` is composed into a process, the process will write out multiple sequence collections to a data store. 

.. jupyter-execute::
    :hide-code:

    from tempfile import TemporaryDirectory

    tmpdir = TemporaryDirectory(dir=".")
    path_to_dir = tmpdir.name

We can create our input data store containing all the files with the ".fasta" suffix in the data directory using ``open_data_store``. 

.. jupyter-execute::
    :raises:

    from cogent3 import open_data_store

    fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r")

Let's define a process. In this example, our process loads the sequences, filters the sequences to keep only those which are translatable, translates the sequences, and then writes the filtered sequences to a data store. 

.. jupyter-execute::
    :raises:
    
    from cogent3 import get_app, open_data_store

    out_dstore = open_data_store(path_to_dir, suffix="fa", mode="w")

    loader = get_app("load_unaligned", format="fasta", moltype="dna")
    keep_translatable = get_app("select_translatable")
    translate = get_app("translate_seqs")
    writer = get_app("write_seqs", out_dstore, format="fasta")

    process = loader + keep_translatable + translate + writer

.. tip:: When running this code on your machine, remember to replace ``path_to_dir`` with an actual directory path.

We apply ``process`` to our input data store, and assign the resulting data store to ``result``. 

.. jupyter-execute::
    :raises:

    result = process.apply_to(fasta_seq_dstore)

Accessing an overview of our process
""""""""""""""""""""""""""""""""""""

We can interrogate ``result`` to see an overview of the process. 

.. jupyter-execute::
    :raises:

    result.describe

There were 10 data files to which the process was successfully applied. However, there were three files for which the process did not complete. We can see a summary of the failures by accessing the ``summary_not_completed`` property. 

.. jupyter-execute::
    :raises:

    result.summary_not_completed

Looks like the first two failed because they are protein sequences and ``load_unaligned`` expected DNA sequences. 

Interestingly, another file failed in the ``keep_translatable`` step. By design, these failures did not stop the rest of the pipeline from being run. In fact, the data store collects the :ref:`NotCompleted objects <not_completed>`, which store traceback information, allowing you to interrogate any failings.