File: taxdump.py

package info (click to toggle)
python-skbio 0.5.8-4
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 13,224 kB
  • sloc: python: 47,839; ansic: 672; makefile: 210; javascript: 50; sh: 19
file content (327 lines) | stat: -rw-r--r-- 13,183 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
"""
Taxdump format (:mod:`skbio.io.format.taxdump`)
===============================================

.. currentmodule:: skbio.io.format.taxdump

The NCBI Taxonomy database dump (``taxdump``) format stores information of
organism names, classifications and other properties. It is a tabular format
with a delimiter: ``<tab><pipe><tab>`` between columns, and a line end
``<tab><pipe>`` after all columns. The file name usually ends with .dmp.

Format Support
--------------
**Has Sniffer: No**

+------+------+---------------------------------------------------------------+
|Reader|Writer|                          Object Class                         |
+======+======+===============================================================+
|Yes   |No    |:mod:`pandas.DataFrame`                                        |
+------+------+---------------------------------------------------------------+

Format Specification
--------------------
**State: Experimental as of 0.5.8.**

The NCBI taxonomy database [1]_ [2]_ hosts organism names and classifications.
It has a web portal [3]_ and an FTP download server [4]_. It is also accessible
using E-utilities [5]_. The database is being updated daily, and an archive is
generated every month. The data release has the file name ``taxdump``. It
consists of multiple .dmp files. These files serve different purposes, but they
follow a common format pattern:

- It is a tabular format.
- Column delimiter is ``<tab><pipe><tab>``.
- Line end is ``<tab><pipe>``.
- The first column is a numeric identifier, which usually represent taxa (i.e.,
  "TaxID"), but can also be genetic codes, citations or other entries.

The two most important files of the data release are ``nodes.dmp`` and
``names.dmp``. They store the hierarchical structure of the classification
system (i.e., taxonomy) and the names of organisms, respectively. They can be
used to construct the taxonomy tree of organisms.

The definition of columns of each .dmp file type are taken from [6]_ and [7]_.

``nodes.dmp``
^^^^^^^^^^^^^
+----------------+-------------------------------------+
|Name            |Description                          |
+================+=====================================+
|tax_id          |node id in GenBank taxonomy database |
+----------------+-------------------------------------+
|parent tax_id   |parent node id in GenBank taxonomy   |
|                |database                             |
+----------------+-------------------------------------+
|rank            |rank of this node (superkingdom,     |
|                |kingdom, ...)                        |
+----------------+-------------------------------------+
|embl code       |locus-name prefix; not unique        |
+----------------+-------------------------------------+
|division id     |see division.dmp file                |
+----------------+-------------------------------------+
|inherited div   |1 if node inherits division from     |
|flag (1 or 0)   |parent                               |
+----------------+-------------------------------------+
|genetic code id |see gencode.dmp file                 |
+----------------+-------------------------------------+
|inherited GC    |1 if node inherits genetic code from |
|flag (1 or 0)   |parent                               |
+----------------+-------------------------------------+
|mitochondrial   |see gencode.dmp file                 |
|genetic code id |                                     |
+----------------+-------------------------------------+
|inherited MGC   |1 if node inherits mitochondrial     |
|flag (1 or 0)   |gencode from parent                  |
+----------------+-------------------------------------+
|GenBank hidden  |1 if name is suppressed in GenBank   |
|flag (1 or 0)   |entry lineage                        |
+----------------+-------------------------------------+
|hidden subtree  |1 if this subtree has no sequence    |
|root flag       |data yet                             |
|(1 or 0)        |                                     |
+----------------+-------------------------------------+
|comments        |free-text comments and citations     |
+----------------+-------------------------------------+

Since 2018, NCBI releases "new taxonomy files" [8]_ (``new_taxdump``). The new
``nodes.dmp`` format is compatible with the classical format, plus five extra
columns after all aforementioned columns.

+----------------+-------------------------------------+
|Name            |Description                          |
+================+=====================================+
|plastid genetic |see gencode.dmp file                 |
|code id         |                                     |
+----------------+-------------------------------------+
|inherited PGC   |1 if node inherits plastid gencode   |
|flag (1 or 0)   |from parent                          |
+----------------+-------------------------------------+
|specified\\_     |1 if species in the node's lineage   |
|species         |has formal name                      |
+----------------+-------------------------------------+
|hydrogenosome   |see gencode.dmp file                 |
|genetic code id |                                     |
+----------------+-------------------------------------+
|inherited HGC   |1 if node inherits hydrogenosome     |
|flag (1 or 0)   |gencode from parent                  |
+----------------+-------------------------------------+

``names.dmp``
^^^^^^^^^^^^^
+----------------+-------------------------------------+
|Name            |Description                          |
+================+=====================================+
|tax_id          |the id of node associated with this  |
|                |name                                 |
+----------------+-------------------------------------+
|name_txt        |name itself                          |
+----------------+-------------------------------------+
|unique name     |the unique variant of this name if   |
|                |name not unique                      |
+----------------+-------------------------------------+
|name class      |(synonym, common name, ...)          |
+----------------+-------------------------------------+

``division.dmp``
^^^^^^^^^^^^^^^^
+----------------+-------------------------------------+
|Name            |Description                          |
+================+=====================================+
|division id     |taxonomy database division id        |
+----------------+-------------------------------------+
|division cde    |GenBank division code (three         |
|                |characters)                          |
+----------------+-------------------------------------+
|division name   |e.g. BCT, PLN, VRT, MAM, PRI...      |
+----------------+-------------------------------------+
|comments        |                                     |
+----------------+-------------------------------------+

``gencode.dmp``
^^^^^^^^^^^^^^^
+----------------+-------------------------------------+
|Name            |Description                          |
+================+=====================================+
|genetic code id |GenBank genetic code id              |
+----------------+-------------------------------------+
|abbreviation    |genetic code name abbreviation       |
+----------------+-------------------------------------+
|name            |genetic code name                    |
+----------------+-------------------------------------+
|cde             |translation table for this genetic   |
|                |code                                 |
+----------------+-------------------------------------+
|starts          |start codons for this genetic code   |
+----------------+-------------------------------------+

Other types of .dmp files are currently not supported by scikit-bio. However,
the user may customize column definitions in using this utility. See below for
details.

Format Parameters
-----------------
The following format parameters are available in ``taxdump`` format:

- ``scheme``: The column definition scheme name of the input .dmp file.
  Available options are listed below. Alternatively, one can provide a custom
  scheme as defined in a name-to-data type dictionary.

  1. ``nodes``: The classical ``nodes.dmp`` scheme. It is also compatible with
     new ``nodes.dmp`` format, in which case only the columns defined by the
     classical format will be read.

  2. ``nodes_new``: The new ``nodes.dmp`` scheme.

  3. ``nodes_slim``: Only the first three columns: tax_id, parent_tax_id and
     rank, which are the minimum required information for constructing the
     taxonomy tree. It can be applied to both classical and new ``nodes.dmp``
     files. It can also handle custom files which only contains these three
     columns.

  4. ``names``: The ``names.dmp`` scheme.

  5. ``division``: The ``division.dmp`` scheme.

  6. ``gencode``: The ``gencode.dmp`` scheme.

.. note:: scikit-bio will read columns from leftmost till the number of columns
   defined in the scheme. Extra columns will be cropped.

Examples
--------

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\\n'.join([
...     '1\\t|\\t1\\t|\\tno rank\\t|',
...     '2\\t|\\t131567\\t|\\tsuperkingdom\\t|',
...     '6\\t|\\t335928\\t|\\tgenus\\t|'
... ])
>>> fh = StringIO(fs)

Read the file into a ``pd.DataFrame`` and specify that the "nodes_slim" scheme
should be used:

>>> df = skbio.io.read(fh, format="taxdump", into=pd.DataFrame,
...                    scheme="nodes_slim")
>>> df # doctest: +NORMALIZE_WHITESPACE
        parent_tax_id          rank
tax_id
1                   1       no rank
2              131567  superkingdom
6              335928         genus

References
----------
.. [1] Federhen, S. (2012). The NCBI taxonomy database. Nucleic acids
   research, 40(D1), D136-D143.
.. [2] Schoch, C. L., Ciufo, S., Domrachev, M., Hotton, C. L., Kannan, S.,
   Khovanskaya, R., ... & Karsch-Mizrachi, I. (2020). NCBI Taxonomy: a
   comprehensive update on curation, resources and tools. Database, 2020.
.. [3] https://www.ncbi.nlm.nih.gov/taxonomy
.. [4] https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
.. [5] Kans, J. (2022). Entrez direct: E-utilities on the UNIX command line.
   In Entrez Programming Utilities Help [Internet]. National Center for
   Biotechnology Information (US).
.. [6] https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_readme.txt
.. [7] https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/taxdump_readme.txt
.. [8] https://ncbiinsights.ncbi.nlm.nih.gov/2018/02/22/new-taxonomy-files-
       available-with-lineage-type-and-host-information/
"""

# ----------------------------------------------------------------------------
# Copyright (c) 2013--, scikit-bio development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE.txt, distributed with this software.
# ----------------------------------------------------------------------------

import pandas as pd

from skbio.io import create_format


taxdump = create_format('taxdump')

_taxdump_column_schemes = {
    'nodes_slim': {
        'tax_id': int,
        'parent_tax_id': int,
        'rank': str
    },
    'nodes': {
        'tax_id': int,
        'parent_tax_id': int,
        'rank': str,
        'embl_code': str,
        'division_id': int,
        'inherited_div_flag': bool,
        'genetic_code_id': int,
        'inherited_GC_flag': bool,
        'mitochondrial_genetic_code_id': int,
        'inherited_MGC_flag': bool,
        'GenBank_hidden_flag': bool,
        'hidden_subtree_root_flag': bool,
        'comments': str
    },
    'names': {
        'tax_id': int,
        'name_txt': str,
        'unique_name': str,
        'name_class': str
    },
    'division': {
        'division_id': int,
        'division_cde': str,
        'division_name': str,
        'comments': str
    },
    'gencode': {
        'genetic_code_id': int,
        'abbreviation': str,
        'name': str,
        'cde': str,
        'starts': str
    }
}

_taxdump_column_schemes['nodes_new'] = dict(
    _taxdump_column_schemes['nodes'], **{
        'plastid_genetic_code_id': bool,
        'inherited_PGC_flag': bool,
        'specified_species': bool,
        'hydrogenosome_genetic_code_id': int,
        'inherited_HGC_flag': bool
    })


@taxdump.reader(pd.DataFrame, monkey_patch=False)
def _taxdump_to_data_frame(fh, scheme):
    '''Read a taxdump file into a data frame.

    Parameters
    ----------
    fh : file handle
        Input taxdump file
    scheme : str
        Name of column scheme

    Returns
    -------
    pd.DataFrame
        Parsed table
    '''
    if isinstance(scheme, str):
        if scheme not in _taxdump_column_schemes:
            raise ValueError(f'Invalid taxdump column scheme: "{scheme}".')
        scheme = _taxdump_column_schemes[scheme]
    names = list(scheme.keys())
    try:
        return pd.read_csv(
            fh, sep='\t\\|(?:\t|$)', engine='python', index_col=0,
            names=names, dtype=scheme, usecols=range(len(names)))
    except ValueError:
        raise ValueError('Invalid taxdump file format.')