File: schema_v3.rst

package info (click to toggle)
python-cooler 0.10.3-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 32,600 kB
  • sloc: python: 11,033; makefile: 173; sh: 31
file content (351 lines) | stat: -rw-r--r-- 14,247 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
.. _version-3:

+------------------------+-----+
| **Schema Version**     |  3  |
+------------------------+-----+

The following document describes a `compressed sparse row (CSR) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29>`_ storage scheme for a matrix (i.e., a quantitative heatmap) with genomically labeled dimensions/axes.

HDF5 does not natively implement sparse arrays or relational data structures: its datasets are dense multidimensional arrays. We implement tables and sparse array indexes in HDF5 using groups of 1D arrays. The descriptions of tables and indexes in this document specify required groups and arrays, conventional column orders, and default data types.

.. admonition:: Summary of changes

    * Version 3 introduces the ``storage-mode`` metadata attribute to accomodate square matrices that are non-symmetric. Version 2 files which lack the ``storage-mode`` attribute should be interpreted as using the "symmetric-upper" storage mode. See `Storage mode`_.
    * The multi-resolution cooler file layout has been standardized. See `File flavors`_.



Data collection
===============

We refer to the object hierarchy describing a single matrix as a cooler *data collection*. A cooler data collection consists of **tables**, **indexes** and **metadata** describing a genomically-labelled sparse matrix.

A typical data collection has the following structure. At the top level, there are four `HDF5 Groups <http://docs.h5py.org/en/stable/high/group.html>`_, each containing 1D arrays (`HDF5 Datasets <http://docs.h5py.org/en/stable/high/dataset.html>`_). The depiction below shows an example group hierarchy as a tree, with arrays at the leaves, printed with their shapes in parentheses and their data type symbols.

::

  /
   ├── chroms
   │   ├── length (24,) int32
   │   └── name (24,) |S64
   ├── bins
   │   ├── chrom (3088281,) int32
   │   ├── start (3088281,) int32
   │   ├── end (3088281,) int32
   │   └── weight (3088281,) float64
   ├── pixels
   │   ├── bin1_id (271958554,) int64
   │   ├── bin2_id (271958554,) int64
   │   └── count (271958554,) int32
   └── indexes
       ├── bin1_offset (3088282,) int64
       └── chrom_offset (25,) int64

URI syntax
==========

We identify a cooler data collection using a **URI string** to its top-level group, separating the system path to the container file from the **group path** within the container file by a double colon ``::``.

::

  path/to/container.cool::/path/to/cooler/group

For any URI, the leading slash after the ``::`` may be omitted. To reference the root group ``/``, the entire ``::/`` suffix may be omitted (i.e., just a file path).

Tables
======

A **table** is a group of equal-length 1D arrays representing **columns**.

Additional groups and tables may be added to a data collection as long as they are not nested under the group of another table.

This storage mode does not enforce specific **column orders**, but conventional orders for *required* columns is provided in the listings below.

This storage mode does not set limits on the **number or length of columns**. Additional arrays may be inserted into a table to form new columns, but they must conform to the common length of the table.

The table descriptions below are given in the `datashape <http://datashape.readthedocs.org/en/latest/>`_ layout language. The column **data types** are given as numpy equivalents. They are only defaults and may be altered as desired.

GZIP is chosen as the default **compression** filter for all columns. This is for portability reasons, since all versions of the HDF5 library ship with it.

chroms
------

::

    chroms: {
      # REQUIRED
      name:     typevar['Nchroms'] * string['ascii'],
      length:   typevar['Nchroms'] * int32
    }

In HDF5, ``name`` is a null-padded, fixed-length ASCII array, which maps to numpy's ``S`` dtype.

bins
----

::

    bins: {
      # REQUIRED
      chrom:    typevar['Nbins'] * categorical[typevar['name'], type=string, ordered=True],
      start:    typevar['Nbins'] * int32,
      end:      typevar['Nbins'] * int32,

      # RESERVED
      weight:   typevar['Nbins'] * float64
    }

In HDF5, we use the integer-backed ENUM type to encode the ``chrom`` column. For data collections with a very large number of scaffolds, the ENUM type information may be too large to fit in the object's metadata header. In that case, the ``chrom`` column is stored using raw integers and the enumeration is inferred from the ``chrom`` table.

Genomic intervals are stored using a `0-start, half-open <http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems>`_ representation. The first interval in a scaffold should have ``start`` = 0 and the last interval should have ``end`` = the chromosome length. Intervals are sorted by ``chrom``, then by ``start``.

Because they measure the same quantity in the same units, the coordinate columns ``chroms/length``, ``bins/start`` and ``bins/end`` should be encoded using the same data type.

The :command:`cooler balance` command stores balancing weights in a column called ``weight`` by default. NaN values indicate genomic bins that were blacklisted during the balancing procedure.

pixels
------

::

    pixels: {
      # REQUIRED
      bin1_id:  typevar['Nnz'] * int64,
      bin2_id:  typevar['Nnz'] * int64,

      # RESERVED
      count:    typevar['Nnz'] * int32
    }

In the matrix coordinate system, ``bin1_id`` refers to the ith axis and ``bin2_id`` refers to the jth. Bin IDs are zero-based, i.e. we start counting at 0. Pixels are sorted by ``bin1_id`` then by ``bin2_id``.

The ``count`` column is integer by default, but floating point types can be substituted. Additional columns are to be interpreted as supplementary value columns.

.. warning:: `float16 <https://github.com/hetio/hetio/pull/15>`_ has limited support from 3rd party libraries and is not recommended. For floating point value columns consider using either single- (float32) or double-precision (float64).

Indexes
=======

Indexes are stored as 1D arrays in a separate group called ``indexes``. They can be thought of as run-length encodings of the ``bins/chrom`` and ``pixels/bin1_id`` columns, respectively. Both arrays are required.

::

    indexes: {
      chrom_offset:  (typevar['Nchroms'] + 1) * int64,
      bin1_offset:   (typevar['Nbins'] + 1) * int64
    }

* ``chrom_offset``: indicates which row in the bin table each chromosome first appears. The last element stores the length of the bin table.
* ``bin1_offset``: indicates which row in the pixel table each bin1 ID first appears. The last element stores the length of the pixel table. This index is usually called *indptr* in CSR data structures.

Storage mode
============

Storing a symmetric matrix requires only the *upper triangular part, including the diagonal*, since the remaining elements can be reconstructed from the former ones. To indicate the use of this **mode of matrix storage** to client software, the value of the metadata attribute ``storage-mode`` must be set to ``"symmetric-upper"`` (see `Metadata`_).

.. versionadded:: 3

    To indicate the absence of a special storage mode, e.g. for **non-symmetric** matrices, ``storage-mode`` must be set to ``"square"``.  This storage mode indicates to client software that 2D range queries should not be symmetrized.

.. warning:: In schema v2 and earlier, the symmetric-upper storage mode is always assumed.


Metadata
========

Essential key-value properties are stored as `HDF5 attributes <http://docs.h5py.org/en/stable/high/attr.html>`_ at the top-level group of the data collection. Note that depending on where the data collection is located in the file, this can be different from the root group of the entire file ``/``.

.. rubric:: Required attributes

.. describe:: format : string (constant)

    "HDF5::Cooler"

.. describe:: format-version : int

    The schema version used.

.. describe:: bin-type : { "fixed", "variable" }

    Indicates whether the resolution is constant along both axes.

.. describe:: bin-size : int or "null"

    Size of genomic bins in base pairs if bin-type is "fixed". Otherwise, "null".

.. describe:: storage-mode : { "symmetric-upper", "square" }

    Indicates whether ordinary sparse matrix encoding is used ("square") or whether a symmetric matrix is encoded by storing only the upper triangular elements ("symmetric-upper").

.. rubric:: Reserved, but optional

.. describe:: assembly : string

    Name of the genome assembly, e.g. "hg19".

.. describe:: generated-by : string

    Agent that created the file, e.g. "cooler-x.y.z".

.. describe:: creation-date : datetime string

    The moment the collection was created.

.. describe:: metadata : JSON

    Arbitrary JSON-compatible **user metadata** about the experiment.


All scalar string attributes, including serialized JSON, must be stored as **variable-length UTF-8 encoded strings**.

.. warning:: When assigning scalar string attributes in Python 2, always store values having ``unicode`` type. In h5py, assigning a Python text string (Python 3 ``str`` or Python 2 ``unicode``) to an HDF5 attribute results in variable-length UTF-8 storage.

Additional metadata may be stored in other top-level attributes and the attributes of table groups and columns.


File flavors
============

Many cooler data collections can be stored in a single file. We recognize two conventional **layouts**:


Single-resolution
-----------------

* A single-resolution cooler file that contains a single data collection under the ``/`` group. Conventional file extension: ``.cool``.

::

  XYZ.1000.cool
  /
   ├── bins
   ├── chroms
   ├── pixels
   └── indexes


Multi-resolution
----------------

* A multi-resolution cooler file that contains multiple "coarsened" resolutions or "zoom-levels" derived from the same dataset. Multires cooler files should store each data collection underneath a group called ``/resolutions`` within a sub-group whose name is the bin size (e.g, ``XYZ.1000.mcool::resolutions/10000``). If the base cooler has variable-length bins, then use ``1`` to designate the base resolution, and the use coarsening multiplier (e.g. ``2``, ``4``, ``8``, etc.) to name the lower resolutions. Conventional file extension: ``.mcool``.

::

  XYZ.1000.mcool
  /
   └── resolutions
       ├── 1000
       │   ├── bins
       │   ├── chroms
       │   ├── pixels
       │   └── indexes
       ├── 2000
       │   ├── bins
       │   ├── chroms
       │   ├── pixels
       │   └── indexes
       ├── 5000
       │   ├── bins
       │   ├── chroms
       │   ├── pixels
       │   └── indexes
       ├── 10000
       │   ├── bins
       │   ├── chroms
       │   ├── pixels
       │   └── indexes
       .
       .
       .

In addition, a multi-resolution cooler file may indicate to clients that it is using this layout with the following ``/``-level attributes:

.. describe:: format : string (constant)

    "HDF5::MCOOL"

.. describe:: format-version : int

    2

.. describe:: bin-type : { "fixed", "variable" }

    Indicates whether the resolution is constant along both axes.


.. note::

  The old multi-resolution layout used resolutions strictly in increments of *powers of 2*. In this layout (MCOOL version 2), the data collections are named by zoom level, starting with ``XYZ.1000.mcool::0`` being the coarsest resolution up until the finest or "base" resolution (e.g., ``XYZ.1000.mcool::14`` for 14 levels of coarsening).

  .. versionchanged:: 0.8
    Both the legacy layout and the new mcool layout are supported by `HiGlass <http://higlass.io/app/>`_. Prior to cooler 0.8, the new layout was produced only when requesting a specific list of resolutions. As of cooler 0.8, the new layout is always produced by the :command:`cooler zoomify` command unless the ``--legacy`` option is given. Files produced by :py:func:`cooler.zoomify_cooler`, `hic2cool <https://github.com/4dn-dcic/hic2cool/>`_, and the mcools from the `4DN data portal <https://data.4dnucleome.org/>`_ also follow the new layout.



Single-cell (single-resolution)
-------------------------------

A single-cell cooler file contains all the matrices of a single-cell Hi-C data set. All cells are stored under a group called ``/cells``, and all cells share the primary bin table columns
i.e. ``bins['chrom']``, ``bins['start']`` and ``bins['end']`` which are `hardlinked <http://docs.h5py.org/en/stable/high/group.html#hard-links>`_ to the root-level bin table. Any individual cell can be accessed using the regular :class:`cooler.Cooler` interface.
Conventional file extension: ``.scool``.

::

  XYZ.scool
  /
   ├── bins
   ├── chroms
   └── cells
       ├── cell_id1
       │   ├── bins
       │   ├── chroms
       │   ├── pixels
       │   └── indexes
       ├── cell_id2
       │   ├── bins
       │   ├── chroms
       │   ├── pixels
       │   └── indexes
       ├── cell_id3
       │   ├── bins
       │   ├── chroms
       │   ├── pixels
       │   └── indexes
       ├── cell_id4
       │   ├── bins
       │   ├── chroms
       │   ├── pixels
       │   └── indexes
       .
       .
       .

In addition, a single-cell single-resolution cooler file may indicate to clients that it is using this layout with the following ``/``-level attributes:

.. describe:: format : string (constant)

    "HDF5::SCOOL"

.. describe:: format-version : int

    1

.. describe:: bin-type : { "fixed", "variable" }

    Indicates whether the resolution is constant along both axes.

.. describe:: bin-size : int

    The bin resolution

.. describe:: nbins : int

    The number of bins

.. describe:: nchroms : int

    The number of chromosomes of the cells

.. describe:: ncells : int

    The number of stored cells