1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351
|
.. _version-3:
+------------------------+-----+
| **Schema Version** | 3 |
+------------------------+-----+
The following document describes a `compressed sparse row (CSR) <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29>`_ storage scheme for a matrix (i.e., a quantitative heatmap) with genomically labeled dimensions/axes.
HDF5 does not natively implement sparse arrays or relational data structures: its datasets are dense multidimensional arrays. We implement tables and sparse array indexes in HDF5 using groups of 1D arrays. The descriptions of tables and indexes in this document specify required groups and arrays, conventional column orders, and default data types.
.. admonition:: Summary of changes
* Version 3 introduces the ``storage-mode`` metadata attribute to accomodate square matrices that are non-symmetric. Version 2 files which lack the ``storage-mode`` attribute should be interpreted as using the "symmetric-upper" storage mode. See `Storage mode`_.
* The multi-resolution cooler file layout has been standardized. See `File flavors`_.
Data collection
===============
We refer to the object hierarchy describing a single matrix as a cooler *data collection*. A cooler data collection consists of **tables**, **indexes** and **metadata** describing a genomically-labelled sparse matrix.
A typical data collection has the following structure. At the top level, there are four `HDF5 Groups <http://docs.h5py.org/en/stable/high/group.html>`_, each containing 1D arrays (`HDF5 Datasets <http://docs.h5py.org/en/stable/high/dataset.html>`_). The depiction below shows an example group hierarchy as a tree, with arrays at the leaves, printed with their shapes in parentheses and their data type symbols.
::
/
├── chroms
│ ├── length (24,) int32
│ └── name (24,) |S64
├── bins
│ ├── chrom (3088281,) int32
│ ├── start (3088281,) int32
│ ├── end (3088281,) int32
│ └── weight (3088281,) float64
├── pixels
│ ├── bin1_id (271958554,) int64
│ ├── bin2_id (271958554,) int64
│ └── count (271958554,) int32
└── indexes
├── bin1_offset (3088282,) int64
└── chrom_offset (25,) int64
URI syntax
==========
We identify a cooler data collection using a **URI string** to its top-level group, separating the system path to the container file from the **group path** within the container file by a double colon ``::``.
::
path/to/container.cool::/path/to/cooler/group
For any URI, the leading slash after the ``::`` may be omitted. To reference the root group ``/``, the entire ``::/`` suffix may be omitted (i.e., just a file path).
Tables
======
A **table** is a group of equal-length 1D arrays representing **columns**.
Additional groups and tables may be added to a data collection as long as they are not nested under the group of another table.
This storage mode does not enforce specific **column orders**, but conventional orders for *required* columns is provided in the listings below.
This storage mode does not set limits on the **number or length of columns**. Additional arrays may be inserted into a table to form new columns, but they must conform to the common length of the table.
The table descriptions below are given in the `datashape <http://datashape.readthedocs.org/en/latest/>`_ layout language. The column **data types** are given as numpy equivalents. They are only defaults and may be altered as desired.
GZIP is chosen as the default **compression** filter for all columns. This is for portability reasons, since all versions of the HDF5 library ship with it.
chroms
------
::
chroms: {
# REQUIRED
name: typevar['Nchroms'] * string['ascii'],
length: typevar['Nchroms'] * int32
}
In HDF5, ``name`` is a null-padded, fixed-length ASCII array, which maps to numpy's ``S`` dtype.
bins
----
::
bins: {
# REQUIRED
chrom: typevar['Nbins'] * categorical[typevar['name'], type=string, ordered=True],
start: typevar['Nbins'] * int32,
end: typevar['Nbins'] * int32,
# RESERVED
weight: typevar['Nbins'] * float64
}
In HDF5, we use the integer-backed ENUM type to encode the ``chrom`` column. For data collections with a very large number of scaffolds, the ENUM type information may be too large to fit in the object's metadata header. In that case, the ``chrom`` column is stored using raw integers and the enumeration is inferred from the ``chrom`` table.
Genomic intervals are stored using a `0-start, half-open <http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems>`_ representation. The first interval in a scaffold should have ``start`` = 0 and the last interval should have ``end`` = the chromosome length. Intervals are sorted by ``chrom``, then by ``start``.
Because they measure the same quantity in the same units, the coordinate columns ``chroms/length``, ``bins/start`` and ``bins/end`` should be encoded using the same data type.
The :command:`cooler balance` command stores balancing weights in a column called ``weight`` by default. NaN values indicate genomic bins that were blacklisted during the balancing procedure.
pixels
------
::
pixels: {
# REQUIRED
bin1_id: typevar['Nnz'] * int64,
bin2_id: typevar['Nnz'] * int64,
# RESERVED
count: typevar['Nnz'] * int32
}
In the matrix coordinate system, ``bin1_id`` refers to the ith axis and ``bin2_id`` refers to the jth. Bin IDs are zero-based, i.e. we start counting at 0. Pixels are sorted by ``bin1_id`` then by ``bin2_id``.
The ``count`` column is integer by default, but floating point types can be substituted. Additional columns are to be interpreted as supplementary value columns.
.. warning:: `float16 <https://github.com/hetio/hetio/pull/15>`_ has limited support from 3rd party libraries and is not recommended. For floating point value columns consider using either single- (float32) or double-precision (float64).
Indexes
=======
Indexes are stored as 1D arrays in a separate group called ``indexes``. They can be thought of as run-length encodings of the ``bins/chrom`` and ``pixels/bin1_id`` columns, respectively. Both arrays are required.
::
indexes: {
chrom_offset: (typevar['Nchroms'] + 1) * int64,
bin1_offset: (typevar['Nbins'] + 1) * int64
}
* ``chrom_offset``: indicates which row in the bin table each chromosome first appears. The last element stores the length of the bin table.
* ``bin1_offset``: indicates which row in the pixel table each bin1 ID first appears. The last element stores the length of the pixel table. This index is usually called *indptr* in CSR data structures.
Storage mode
============
Storing a symmetric matrix requires only the *upper triangular part, including the diagonal*, since the remaining elements can be reconstructed from the former ones. To indicate the use of this **mode of matrix storage** to client software, the value of the metadata attribute ``storage-mode`` must be set to ``"symmetric-upper"`` (see `Metadata`_).
.. versionadded:: 3
To indicate the absence of a special storage mode, e.g. for **non-symmetric** matrices, ``storage-mode`` must be set to ``"square"``. This storage mode indicates to client software that 2D range queries should not be symmetrized.
.. warning:: In schema v2 and earlier, the symmetric-upper storage mode is always assumed.
Metadata
========
Essential key-value properties are stored as `HDF5 attributes <http://docs.h5py.org/en/stable/high/attr.html>`_ at the top-level group of the data collection. Note that depending on where the data collection is located in the file, this can be different from the root group of the entire file ``/``.
.. rubric:: Required attributes
.. describe:: format : string (constant)
"HDF5::Cooler"
.. describe:: format-version : int
The schema version used.
.. describe:: bin-type : { "fixed", "variable" }
Indicates whether the resolution is constant along both axes.
.. describe:: bin-size : int or "null"
Size of genomic bins in base pairs if bin-type is "fixed". Otherwise, "null".
.. describe:: storage-mode : { "symmetric-upper", "square" }
Indicates whether ordinary sparse matrix encoding is used ("square") or whether a symmetric matrix is encoded by storing only the upper triangular elements ("symmetric-upper").
.. rubric:: Reserved, but optional
.. describe:: assembly : string
Name of the genome assembly, e.g. "hg19".
.. describe:: generated-by : string
Agent that created the file, e.g. "cooler-x.y.z".
.. describe:: creation-date : datetime string
The moment the collection was created.
.. describe:: metadata : JSON
Arbitrary JSON-compatible **user metadata** about the experiment.
All scalar string attributes, including serialized JSON, must be stored as **variable-length UTF-8 encoded strings**.
.. warning:: When assigning scalar string attributes in Python 2, always store values having ``unicode`` type. In h5py, assigning a Python text string (Python 3 ``str`` or Python 2 ``unicode``) to an HDF5 attribute results in variable-length UTF-8 storage.
Additional metadata may be stored in other top-level attributes and the attributes of table groups and columns.
File flavors
============
Many cooler data collections can be stored in a single file. We recognize two conventional **layouts**:
Single-resolution
-----------------
* A single-resolution cooler file that contains a single data collection under the ``/`` group. Conventional file extension: ``.cool``.
::
XYZ.1000.cool
/
├── bins
├── chroms
├── pixels
└── indexes
Multi-resolution
----------------
* A multi-resolution cooler file that contains multiple "coarsened" resolutions or "zoom-levels" derived from the same dataset. Multires cooler files should store each data collection underneath a group called ``/resolutions`` within a sub-group whose name is the bin size (e.g, ``XYZ.1000.mcool::resolutions/10000``). If the base cooler has variable-length bins, then use ``1`` to designate the base resolution, and the use coarsening multiplier (e.g. ``2``, ``4``, ``8``, etc.) to name the lower resolutions. Conventional file extension: ``.mcool``.
::
XYZ.1000.mcool
/
└── resolutions
├── 1000
│ ├── bins
│ ├── chroms
│ ├── pixels
│ └── indexes
├── 2000
│ ├── bins
│ ├── chroms
│ ├── pixels
│ └── indexes
├── 5000
│ ├── bins
│ ├── chroms
│ ├── pixels
│ └── indexes
├── 10000
│ ├── bins
│ ├── chroms
│ ├── pixels
│ └── indexes
.
.
.
In addition, a multi-resolution cooler file may indicate to clients that it is using this layout with the following ``/``-level attributes:
.. describe:: format : string (constant)
"HDF5::MCOOL"
.. describe:: format-version : int
2
.. describe:: bin-type : { "fixed", "variable" }
Indicates whether the resolution is constant along both axes.
.. note::
The old multi-resolution layout used resolutions strictly in increments of *powers of 2*. In this layout (MCOOL version 2), the data collections are named by zoom level, starting with ``XYZ.1000.mcool::0`` being the coarsest resolution up until the finest or "base" resolution (e.g., ``XYZ.1000.mcool::14`` for 14 levels of coarsening).
.. versionchanged:: 0.8
Both the legacy layout and the new mcool layout are supported by `HiGlass <http://higlass.io/app/>`_. Prior to cooler 0.8, the new layout was produced only when requesting a specific list of resolutions. As of cooler 0.8, the new layout is always produced by the :command:`cooler zoomify` command unless the ``--legacy`` option is given. Files produced by :py:func:`cooler.zoomify_cooler`, `hic2cool <https://github.com/4dn-dcic/hic2cool/>`_, and the mcools from the `4DN data portal <https://data.4dnucleome.org/>`_ also follow the new layout.
Single-cell (single-resolution)
-------------------------------
A single-cell cooler file contains all the matrices of a single-cell Hi-C data set. All cells are stored under a group called ``/cells``, and all cells share the primary bin table columns
i.e. ``bins['chrom']``, ``bins['start']`` and ``bins['end']`` which are `hardlinked <http://docs.h5py.org/en/stable/high/group.html#hard-links>`_ to the root-level bin table. Any individual cell can be accessed using the regular :class:`cooler.Cooler` interface.
Conventional file extension: ``.scool``.
::
XYZ.scool
/
├── bins
├── chroms
└── cells
├── cell_id1
│ ├── bins
│ ├── chroms
│ ├── pixels
│ └── indexes
├── cell_id2
│ ├── bins
│ ├── chroms
│ ├── pixels
│ └── indexes
├── cell_id3
│ ├── bins
│ ├── chroms
│ ├── pixels
│ └── indexes
├── cell_id4
│ ├── bins
│ ├── chroms
│ ├── pixels
│ └── indexes
.
.
.
In addition, a single-cell single-resolution cooler file may indicate to clients that it is using this layout with the following ``/``-level attributes:
.. describe:: format : string (constant)
"HDF5::SCOOL"
.. describe:: format-version : int
1
.. describe:: bin-type : { "fixed", "variable" }
Indicates whether the resolution is constant along both axes.
.. describe:: bin-size : int
The bin resolution
.. describe:: nbins : int
The number of bins
.. describe:: nchroms : int
The number of chromosomes of the cells
.. describe:: ncells : int
The number of stored cells
|