File: load-aligned.rst

package info (click to toggle)
python-cogent 2024.5.7a1%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 74,600 kB
  • sloc: python: 92,479; makefile: 117; sh: 16
file content (97 lines) | stat: -rw-r--r-- 3,579 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
.. jupyter-execute::
    :hide-code:

    import set_working_directory

Loading aligned sequence data
-----------------------------

We can load aligned sequence data using the ``load_aligned`` app. When making the app, you can optionally provide arguments for the molecular type of the sequence and the format of the data. 

Loading aligned DNA sequences from a single fasta file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Here we load the brca1 gene in bats, providing the molecular type (``moltype="dna"``) and file format (``format="fasta"``). 

.. jupyter-execute::
    :raises:
    
    from cogent3 import get_app

    load_aligned_app = get_app("load_aligned", moltype="dna", format="fasta")
    aln = load_aligned_app("data/brca1-bats.fasta")
    aln

Loading aligned protein sequences from a single phylip file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Here we load a globin alignment, providing the molecular type (``moltype="protein"``) and file format (``format="phylip"``). 

.. jupyter-execute::
    :raises:
    
    from cogent3 import get_app

    load_aligned_app = get_app("load_aligned", moltype="protein", format="phylip")
    aln = load_aligned_app("data/abglobin_aa.phylip")
    aln

Loading aligned DNA sequences from multiple fasta files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the above examples, the result is a single alignment, which could have been achieved using standard ``cogent3`` (``load_aligned_seqs()``). The real power of apps is for batch processing of a large number of files.

To apply apps to multiple files we need to set two things up:

.. jupyter-execute::
    :hide-code:

    from tempfile import TemporaryDirectory

    tmpdir = TemporaryDirectory(dir=".")
    path_to_dir = tmpdir.name

1. A data store that identifies the files we are interested in 
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Here, we create a data store containing all the files with the ".fasta" suffix in the data directory, limiting the data store to two members as a minimum example.

.. jupyter-execute::
    :raises:

    from cogent3 import open_data_store

    fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)

2. A composed process that defines our workflow 
"""""""""""""""""""""""""""""""""""""""""""""""

In this example, our process loads the sequences, filters the sequences to keep only the third codon position, and then writes the filtered sequences to a data store. 

.. note:: Apps that are "writers" require a data store to write to, learn more about writers :ref:`here! <writers>` 

.. jupyter-execute::
    :raises:
    
    from cogent3 import get_app, open_data_store

    out_dstore = open_data_store(path_to_dir, suffix="fa", mode="w")

    loader = get_app("load_aligned", format="fasta", moltype="dna")
    cpos3 = get_app("take_codon_positions", 3)
    writer = get_app("write_seqs", out_dstore, format="fasta")

    process = loader + cpos3 + writer

.. tip:: When running this code on your machine, remember to replace ``path_to_dir`` with an actual directory path.

Now we're good to go, we can apply ``process`` to our data store!
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

``result`` is a data store, which you can index to see individual data members - which are our alignments. We can take a closer look using the ``.read()`` method on data members (truncating to 50 characters). 

.. jupyter-execute::
    :raises:

    result = process.apply_to(fasta_seq_dstore)
    print(result[0].read()[:50])