1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314
|
r"""Input and Output (:mod:`skbio.io`)
==================================
.. currentmodule:: skbio.io
This module provides input/output (I/O) functionality for scikit-bio.
Supported file formats
----------------------
scikit-bio provides parsers for the following file formats. For details on what objects
are supported by each format, see the associated documentation.
.. currentmodule:: skbio.io.format
.. autosummary::
:toctree: generated/
binary_dm
biom
blast6
blast7
clustal
embl
embed
fasta
fastq
genbank
gff3
lsmat
newick
ordination
phylip
qseq
stockholm
taxdump
sample_metadata
Read/write files
----------------
.. rubric:: Generic I/O functions
.. currentmodule:: skbio.io.registry
.. autosummary::
:toctree: generated/
write
read
sniff
.. rubric:: Additional I/O utilities
.. currentmodule:: skbio.io
.. autosummary::
:toctree: generated/
util
Develop custom formats
----------------------
.. rubric:: Developer documentation on extending I/O
.. autosummary::
:toctree: generated/
registry
Exceptions and warnings
^^^^^^^^^^^^^^^^^^^^^^^
.. currentmodule:: skbio.io
.. rubric:: General exceptions and warnings
.. autosummary::
FormatIdentificationWarning
ArgumentOverrideWarning
UnrecognizedFormatError
IOSourceError
FileFormatError
.. rubric:: Format-specific exceptions and warnings
.. autosummary::
BLAST7FormatError
ClustalFormatError
EMBLFormatError
FASTAFormatError
FASTQFormatError
GenBankFormatError
GFF3FormatError
LSMatFormatError
NewickFormatError
OrdinationFormatError
PhylipFormatError
QSeqFormatError
QUALFormatError
StockholmFormatError
Tutorial
--------
Reading and writing files (I/O) can be a complicated task:
* A file format can sometimes be read into more than one in-memory representation
(i.e., object). For example, a FASTA file can be read into an
:class:`skbio.alignment.TabularMSA` or :class:`skbio.sequence.DNA` depending on
what operations you'd like to perform on your data.
* A single object might be writeable to more than one file format. For example, an
:class:`skbio.alignment.TabularMSA` object could be written to FASTA, FASTQ,
CLUSTAL, or PHYLIP formats, just to name a few.
* You might not know the exact file format of your file, but you want to read
it into an appropriate object.
* You might want to read multiple files into a single object, or write an
object to multiple files.
* Instead of reading a file into an object, you might want to stream the file
using a generator (e.g., if the file cannot be fully loaded into memory).
To address these issues (and others), scikit-bio provides a simple, powerful
interface for dealing with I/O. We accomplish this by using a single I/O
registry.
What kinds of files scikit-bio can use
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To see a complete list of file-like inputs that can be used for reading,
writing, and sniffing, see the documentation for :func:`skbio.io.util.open`.
Reading files into scikit-bio
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are two ways to read files. The first way is to use the
procedural interface:
.. code-block:: python
my_obj = skbio.io.read(file, format='someformat', into=SomeSkbioClass)
The second is to use the object-oriented (OO) interface which is automatically
constructed from the procedural interface:
.. code-block:: python
my_obj = SomeSkbioClass.read(file, format='someformat')
For example, to read a ``newick`` file using both interfaces you would type:
>>> from skbio import read
>>> from skbio import TreeNode
>>> from io import StringIO
>>> open_filehandle = StringIO('(a, b);')
>>> tree = read(open_filehandle, format='newick', into=TreeNode)
>>> tree
<TreeNode, name: unnamed, internal node count: 0, tips count: 2>
For the OO interface:
>>> open_filehandle = StringIO('(a, b);')
>>> tree = TreeNode.read(open_filehandle, format='newick')
>>> tree
<TreeNode, name: unnamed, internal node count: 0, tips count: 2>
In the case of :func:`skbio.io.registry.read` if ``into`` is not provided, then a
generator will be returned. What the generator yields will depend on what
format is being read.
When ``into`` is provided, format may be omitted and the registry will use its
knowledge of the available formats for the requested class to infer the correct
format. This format inference is also available in the OO interface, meaning
that ``format`` may be omitted there as well.
As an example:
>>> open_filehandle = StringIO('(a, b);')
>>> tree = TreeNode.read(open_filehandle)
>>> tree
<TreeNode, name: unnamed, internal node count: 0, tips count: 2>
We call format inference **sniffing**, much like the :class:`csv.Sniffer`
class of Python's standard library. The goal of a ``sniffer`` is two-fold: to
identify if a file is a specific format, and if it is, to provide ``**kwargs``
which can be used to better parse the file.
.. note:: There is a built-in ``sniffer`` which results in a useful error message
if an empty file is provided as input and the format was omitted.
Writing files from scikit-bio
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Just as when reading files, there are two ways to write files.
Procedural Interface:
.. code-block:: python
skbio.io.write(my_obj, format='someformat', into=file)
OO Interface:
.. code-block:: python
my_obj.write(file, format='someformat')
In the procedural interface, ``format`` is required. Without it, scikit-bio does
not know how you want to serialize an object. OO interfaces define a default
``format``, so it may not be necessary to include it.
""" # noqa: D205, D415
# ----------------------------------------------------------------------------
# Copyright (c) 2013--, scikit-bio development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE.txt, distributed with this software.
# ----------------------------------------------------------------------------
from importlib import import_module
from ._warning import FormatIdentificationWarning, ArgumentOverrideWarning
from ._exception import (
UnrecognizedFormatError,
FileFormatError,
BLAST7FormatError,
ClustalFormatError,
FASTAFormatError,
GenBankFormatError,
IOSourceError,
FASTQFormatError,
LSMatFormatError,
NewickFormatError,
OrdinationFormatError,
PhylipFormatError,
QSeqFormatError,
QUALFormatError,
StockholmFormatError,
GFF3FormatError,
EMBLFormatError,
BIOMFormatError,
EmbedFormatError,
)
from .registry import write, read, sniff, create_format, io_registry
from .util import open
__all__ = [
"write",
"read",
"sniff",
"open",
"io_registry",
"create_format",
"FormatIdentificationWarning",
"ArgumentOverrideWarning",
"UnrecognizedFormatError",
"IOSourceError",
"FileFormatError",
"BLAST7FormatError",
"ClustalFormatError",
"EMBLFormatError",
"FASTAFormatError",
"FASTQFormatError",
"GenBankFormatError",
"GFF3FormatError",
"LSMatFormatError",
"NewickFormatError",
"OrdinationFormatError",
"PhylipFormatError",
"QSeqFormatError",
"QUALFormatError",
"StockholmFormatError",
"BIOMFormatError",
"EmbedFormatError",
]
# Necessary to import each file format module to have them added to the I/O
# registry. We use import_module instead of a typical import to avoid flake8
# unused import errors.
import_module("skbio.io.format.blast6")
import_module("skbio.io.format.blast7")
import_module("skbio.io.format.clustal")
import_module("skbio.io.format.embl")
import_module("skbio.io.format.fasta")
import_module("skbio.io.format.fastq")
import_module("skbio.io.format.lsmat")
import_module("skbio.io.format.newick")
import_module("skbio.io.format.ordination")
import_module("skbio.io.format.phylip")
import_module("skbio.io.format.qseq")
import_module("skbio.io.format.genbank")
import_module("skbio.io.format.gff3")
import_module("skbio.io.format.stockholm")
import_module("skbio.io.format.binary_dm")
import_module("skbio.io.format.taxdump")
import_module("skbio.io.format.sample_metadata")
import_module("skbio.io.format.biom")
import_module("skbio.io.format.embed")
# This is meant to be a handy indicator to the user that they have done
# something wrong.
import_module("skbio.io.format.emptyfile")
# Now that all of our I/O has loaded, we can add the object oriented methods
# (read and write) to each class which has registered I/O operations.
io_registry.monkey_patch()
|