File: hdf5_format.rst

package info (click to toggle)
mdtraj 1.11.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 79,324 kB
  • sloc: python: 25,217; ansic: 6,266; cpp: 5,685; xml: 1,252; makefile: 192
file content (371 lines) | stat: -rw-r--r-- 18,008 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
.. _HDF5FormatSpec:

MDTraj HDF5 Format Specification
================================

Overview
--------

This is the specification for a molecular dynamics trajectory file
format based on HDF5 which is supported by the MDTraj package.

Why HDF5?
---------

Design Goals
~~~~~~~~~~~~

In storing MD trajectory data for the for purposes including very large
scale analysis, there are a few design goals. (1) The trajectories should
be small and space efficient on disk. (2) The trajectories should be fast to
write and fast to read. (3) The data format should support flexible read
options. For instance, random access to different frames in the
trajectory should be possible. It should be possible to easily query the
dimensions of the trajectory (``n_frames``, ``n_atoms``, etc) without
loading the file into memory. It should be possible to load only every
n-th frame, or to directly only a subset of the atoms with limited
memory overhead. (5) The trajectory format should be easily extensible
in a backward compatible manner. For instance, it should be possible to
add new arrays/fields like the potential energy or the topology without
breaking backwards compatibility.

Other Formats
~~~~~~~~~~~~~

Currently, MDTraj is able to read and write trajectories in DCD, XTC,
TRR, and AMBER NetCDF formats, in addition to HDF5. This
presents an opportunity to compare these formats and see how they fit
our design goals. The most space efficient is XTC, because it uses 16 bit
fixed precision encoding. For some reason, the XTC read times are quite
slow though. DCD is fast to read and write, but relatively inflexible.
NetCDF is fast and flexible. BINPOS and MDCRD are garbage -- they're
neither fast, small, nor flexible.

What's lacking?
~~~~~~~~~~~~~~~

Of the formats we currently have, AMBER NetCDF is the best, in that it
it satisfies all of the design goals except for the first. But the
trajectories are twice as big on disk as XTC, which is really quite
unfortunate. For dealing with large data sets, size matters. So let's
define a HDF5 standard that has the benefits of AMBER NetCDF and the
benefits of XTC mixed together. We'll use an extensible data format
(HDF5), we'll provide options for lossy and lossless compression, and
we'll store the topology inside the trajectory, so that a single
trajectory file always contains the information needed to understand
(and visualize) the system.

Details
-------

This specification is heavily influenced by the AMBER NetCDF
`standard <http://ambermd.org/netcdf/nctraj.html>`__. Significant
portions of the text are copied verbatim.

Encoding
~~~~~~~~

-  Files will be encoded in HDF5, `a data model, library, and file
   format for storing and managing
   data <http://www.hdfgroup.org/HDF5/>`__ produced at NCSA.
-  Arrays may be encoded with zlib compression.
-  Libraries implementing this standard may, at their desecration, round
   the data to an appropriate number of significant digits, which can
   significantly enhance zlib compression ratios.

Recommended file extensions
~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  The ``.h5`` extension will be preferred, although not required.

Global Attributes
~~~~~~~~~~~~~~~~~

-  **Conventions (required)** Contents of this attribute are a comma or
   space delimited list of tokens representing all of the conventions to
   which the file conforms. Creators shall include the string Pande as
   one of the tokens in this list. In the usual case, where the file
   conforms only to this convention, the value of the attribute will
   simply be \`\`Pande''. Readers may fail if this attribute is not
   present or none of the tokens in the list are Pande. Optionally, if
   the reader does not expect HDF5 files other than those conforming to
   the Pande convention, it may emit a warning and attempt to read the
   file even when the Conventions attribute is missing.
-  **ConventionVersion (required)** Contents are a string representation
   of the version number of this convention. Future revisions of this
   convention having the same version number may include definitions of
   additional variables, dimensions or attributes, but are guaranteed to
   have no incompatible changes to variables, dimensions or attributes
   specified in previous revisions. Creators shall set this attribute to
   "1.1". If this attribute is present and has a value other than "1.1",
   readers may fail or may emit a warning and continue. It is expected
   that the version of this convention will change rarely, if ever.
-  **application (optional)** If the creator is part of a suite of
   programs or modules, this attribute shall be set to the name of the
   suite.
-  **program (required)** Creators shall set this attribute to the name
   of the creating program or module.
-  **programVersion (required)** Creators shall set this attribute to
   the preferred textual formatting of the current version number of the
   creating program or module.
-  **title (optional)** Creators may set use this attribute to represent
   a user-defined title for the data represented in the file. Absence of
   a title may be indicated by omitting the attribute or by including it
   with an empty string value.
-  **randomState (optional), ASCII-encoded string** Creators may
   optionally describe the state of their internal random number
   generators at the start of their simulation. The semantics of this
   string are specific to the MD code and are not specified by this
   standard.
-  **forcefield (optional), ASCII-encoded string** For data from a
   molecular dynamics simulation, creators may optionally describe the
   Hamiltonian used. This should be a short, human readable string, like
   "AMBER99sbildn".
-  **reference (optional), ASCII-encoded string** Creators may
   optionally specify published a reference that documents the program
   or parameters used to generate the data. The reference should be
   listed in a simple, human readable format. Multiple references may be
   listed simply by separating the references with a human readable
   delimiter within the string, like a newline.

Arrays
~~~~~~

-  **coordinates (required) shape=(n\_frames, n\_atoms, 3),
   type=Float32, units="nanometers"**. This variable shall contain the
   Cartesian coordinates of the specified particle for the specified.
-  **time (optional), shape=(n\_frames), dtype=Float32,
   units="picoseconds"** When coordinates on the frame dimension have a
   temporal sequence (e.g. they form a molecular dynamics trajectory),
   creators shall define this dimension and write a float for each frame
   coordinate representing the simulated time value in picoseconds
   associated with the frame. Time zero is arbitrary, but typically will
   correspond to the start of the simulation. When the file stores a
   collection of conformations having no temporal sequence, creators
   shall omit this variable.
-  **cell\_lengths (optional), shape=(n\_frames, 3, 3), dtype=Float32,
   units="nanometers"** When the data in the coordinates variable come
   from a simulation with periodic boundaries, creators shall include
   this variable. his variable shall represent the lengths (a,b,c) of
   the unit cell for each frame. The edge with length a lies along the x
   axis; the edge with length b lies in the x-y plane. The origin (point
   of invariance under scaling) of the unit cell is defined as (0,0,0).
   If the simulation has one or two dimensional periodicity, then the
   length(s) corresponding to spatial dimensions in which there is no
   periodicity shall be set to zero.
-  **cell\_angles shape=(n\_frames, 3, 3), dtype=Float32,
   units="degrees"** Creators shall include this variable if and only if
   they include the cell\_lengths variable. This variable shall
   represent the angles (, , ) defining the unit cell for each frame.
   defines the angle between the b and c vectors, defines the angle
   between the a and c vectors and defines the angle between the a and b
   vectors. Angles that are undefined due to less than three dimensional
   periodicity shall be set to zero.
-  **velocities (optional), shape=(n\_frames, n\_atoms, 3),
   type=Float32, units="nanometers/picosecond"** When the velocities
   variable is present, it shall represent the cartesian components of
   the velocity for the specified particle and frame. It is recognized
   that due to the nature of commonly used integrators in molecular
   dynamics, it may not be possible for the creator to write a set of
   velocities corresponding to exactly the same point in time as defined
   by the time variable and represented in the coordinates variable. In
   such cases, the creator shall write a set of velocities from the
   nearest point in time to that represented by the specified frame.
-  **kineticEnergy (optional), shape=(n\_frames), type=Float32,
   units="kJ/mol"** Creators may optionally specify the kinetic energy
   of the system at each frame.
-  **potentialEnergy (optional), shape=(n\_frames), type=Float32,
   units="kJ/mol"** Creators may optionally specify the potential energy
   of the system at each frame.
-  **temperature (optional), shape=(n\_frames), type=Float32,
   units="Kelvin"** Creators may optionally specify the temperature of
   the system at each frame.
-  **lambda (optional), shape=(n\_frames), type=Floa32 units=""** For
   describing an alchemical free energy simulation, a creator may
   optionally notate each frame in the simulation with a value of
   lambda.
-  **constraints (optional), shape=(n\_constraints, 3),
   type=CompoundType(int, int, float) units=[None, None, "nanometers"]**
   Creators may optionally describe any constraints applied to the bond
   lengths. ``constraints`` shall be a compound-type table (referred to
   a table as opposed to an array in the pytables documentation), such
   that the first two entries are the indices of the two atoms involved
   in the constant, and the final entry is the distance those atoms are
   constrained to.
-  **topology (optional, but highly recommended), shape=(1,
   length\_as\_needed) type=string** For protein systems, creators shall
   describe the topology of the system in ASCII encoded JSON. The format
   for the topology definition is described in the topology subsection
   of this document. The JSON string encoding the topology shall be
   stored as the sole row in an array of strings.


Array Metadata
~~~~~~~~~~~~~~

-  For arrays that contain naturally unitted numbers (which is all of
   them except for 'topology'), creators shall explicitly declare their
   units. The unit system of length=nanometers, time=picoseconds,
   mass=daltons, temperature=Kelvin, energy=kJ/mol, force=kJ/mol/nm
   shall be used everywhere. For angles, degrees shall be used. The
   units shall be set as an "attribute", on the array, under the key
   "units", within the parlance of HDF5. It shall be a string.

-  For arrays that contain numbers which have been rounded to a certain
   number of significant digits, creators shall declare the number of
   significant digits by setting the "least\_significant\_digit"
   attribue, which should be a positive integer.


Extended Arrays
~~~~~~~~~~~~~~~
Creators may extend this format by adding new arrays. Arrays containing
per-atom and per-frame data that naturally possesses physical units should
declare those units explicitly in the array attributes. Readers should be
flexible, ignoring the presence of arrays that they are not equiped to handle.


Topology
--------

Rational
~~~~~~~~

It is our experience that not having the topology stored in the same
file as the the trajectory's coordinate data is a pain -- it's just really
inconvenient. And generally, the trajectories are long enough that it
doesn't take up much incremental storage space to store the topology in
there too. The topology is not that complicated.

Format
~~~~~~

The topology will be stored in JSON. The JSON will then be serialized as
a string and stored in the HDF5 file with an ASCII encoding.

The topology stores a hierarchical description of the chains, residues,
and atoms in the system. Each chain is associated with an ``index`` and
a list of residues. Each residue is associated with a ``name``, an
``index``, a ``resSeq`` index (not zero-indexed), and a list of ``atom``\ s.
Each ``atom`` is associated with a
``name``, an ``element``, and an ``index``. All of the indicies should
be zero-based.

The ``name`` of a residue is not strictly proscribed, but should
generally follow PDB 3.0 nomenclature. The ``element`` of an atom
shall be one of the one or two letter element abbreviations from the
periodic table. The ``name`` of an atom shall indicate some information
about the type of the atom beyond just its element, such as 'CA' for
the alpha carbom, 'HG' for a gamma hydrogen, etc. This format
does not specify exactly what atom names are allowed -- creators should
follow the conventions from the forcefield they are using.

In addition to the chains, the topology shall also contain a list of the
bonds. The bonds shall be a list of length-2 lists of integers, where
the integers refer to the index of the two ``atoms`` that are bonded.

Example
~~~~~~~

The following shows the topology of alanine dipeptide in this format.
Since it's JSON, the whitespace is optional and just for readability.

::

    {'bonds': [[4, 1],
               [4, 5],
               [1, 0],
               [1, 2],
               [1, 3],
               [4, 6],
               [14, 8],
               [14, 15],
               [8, 10],
               [8, 9],
               [8, 6],
               [10, 11],
               [10, 12],
               [10, 13],
               [7, 6],
               [14, 16],
               [18, 19],
               [18, 20],
               [18, 21],
               [18, 16],
               [17, 16]],
     'chains': [{'index': 0,
                 'residues': [{'atoms': [{'element': 'H',
                                          'index': 0,
                                          'name': 'H1'},
                                         {'element': 'C',
                                          'index': 1,
                                          'name': 'CH3'},
                                         {'element': 'H',
                                          'index': 2,
                                          'name': 'H2'},
                                         {'element': 'H',
                                          'index': 3,
                                          'name': 'H3'},
                                         {'element': 'C',
                                          'index': 4,
                                          'name': 'C'},
                                         {'element': 'O',
                                          'index': 5,
                                          'name': 'O'}],
                               'index': 0,
                               'resSeq': 1,
                               'name': 'ACE'},
                              {'atoms': [{'element': 'N',
                                          'index': 6,
                                          'name': 'N'},
                                         {'element': 'H',
                                          'index': 7,
                                          'name': 'H'},
                                         {'element': 'C',
                                          'index': 8,
                                          'name': 'CA'},
                                         {'element': 'H',
                                          'index': 9,
                                          'name': 'HA'},
                                         {'element': 'C',
                                          'index': 10,
                                          'name': 'CB'},
                                         {'element': 'H',
                                          'index': 11,
                                          'name': 'HB1'},
                                         {'element': 'H',
                                          'index': 12,
                                          'name': 'HB2'},
                                         {'element': 'H',
                                          'index': 13,
                                          'name': 'HB3'},
                                         {'element': 'C',
                                          'index': 14,
                                          'name': 'C'},
                                         {'element': 'O',
                                          'index': 15,
                                          'name': 'O'}],
                               'index': 1,
                               'resSeq': 2,
                               'name': 'ALA'},
                              {'atoms': [{'element': 'N',
                                          'index': 16,
                                          'name': 'N'},
                                         {'element': 'H',
                                          'index': 17,
                                          'name': 'H'},
                                         {'element': 'C',
                                          'index': 18,
                                          'name': 'C'},
                                         {'element': 'H',
                                          'index': 19,
                                          'name': 'H1'},
                                         {'element': 'H',
                                          'index': 20,
                                          'name': 'H2'},
                                         {'element': 'H',
                                          'index': 21,
                                          'name': 'H3'}],
                               'index': 2,
                               'resSeq': 3,
                               'name': 'NME'}]}]}