1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
|
.. _semantics:
Understanding the semantics of loom files
-----------------------------------------
Connecting, not loading and saving
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Loom files are stored on disk and are never loaded entirely. They are
more like databases: you connect, retrieve some subset of the data,
maybe update some attributes.
When you connect, only some metadata is loaded,
but the main matrix, additional layers, graphs and
attribute values remain on disk. They are loaded only as
needed when you access them through the connection object.
The canonical way to connect to a Loom file is:
.. code:: python
with loompy.connect("filename.loom") as ds:
# do something with the connection object ds
This ensures that the file is automatically closed when no longer needed.
For interactive usage, it may be more convenient to connect directly:
.. code:: python
ds = loompy.connect("filename.loom")
# do something with the connection object ds
ds.close()
In that case, you need to close the connection explicitly.
Reading and writing
~~~~~~~~~~~~~~~~~~~
Loom files are based on
`HDF5 <https://en.wikipedia.org/wiki/Hierarchical_Data_Format>`__, a
file format suitable for large multidimensional datasets. Loom files
are great for distribution of large datasets, and for use
in computational pipelines. They **do
not** support writing and reading concurrently. They also do not support
journalling, so if something happens during a write, the **entire file
can be lost**. Therefore, do not use loom files as your primary data
storage. They are for working with data, not keeping it safe.
Loom files do support concurrent reads, but only from separate processes
(not threads), and (we think) only from a single compute node. On a
compute cluster, you may encounter bugs if you try to read the same Loom
file from different compute nodes concurrently.
Efficient indexing and compression
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The main matrix is stored in *chunked* format. That is, instead of being
stored by rows or by columns, it is stored as a sequence of little
rectangles. As a consequence, both rows and columns (as well as
submatrices) can be efficiently accessed.
By default, chunks are compressed and decompressed for you
automatically. This makes Loom a space-efficient format for storing
large sparse datasets. We have found that Loom often uses the same
amount of space as standard sparse matrix formats.
Matrix and attributes
~~~~~~~~~~~~~~~~~~~~~
Loom files represent a two-dimensional matrix with named row and column
attributes. If the main matrix has N rows and M columns, then each column
attribute has M elements (one per column) and vice versa.
By convention, the matrix represents measurements of some property (e.g.
gene expression), genes are stored in the rows and columns represent
samples (e.g. cells).
The file can grow by the addition of attributes and columns (but not
rows). For this reason, it's a good idea to put genes in the rows (since
you will likely always work with the same gene set).
Data types
~~~~~~~~~~
Loom supports the following data types:
- The main matrix, and any additional layers, are
two-dimensional arrays of numbers, which can be any 8, 16, 32 or
64-bit integer (signed or unsigned), or 16, 32 or 64-bit float.
- Attributes are N-dimensional arrays, and the elements can be any 8, 16, 32 or
64-bit integer (signed or unsigned), or 16, 32 or 64-bit float,
or string. The size of the first dimension must match the
corresponding matrix dimension.
See the file format specification for detailed information on data types,
including a discussion on string handling, ASCII and Unicode support.
|