1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373
|
Low-level file layout
=====================
The overall structure of a file is as follows (in order):
- :ref:`header`
- :ref:`comments`, optional
- :ref:`tree`, optional
- Zero or more :ref:`block`
- :ref:`block-index`, optional
ASDF is a hybrid text and binary format. The header, tree and block
index are text, (specifically, in UTF-8 with DOS or UNIX-style
newlines), while the blocks are raw binary.
The low-level file layout is designed in such a way that the tree
section can be edited by hand, possibly changing its size, without
requiring changes in other parts of the file. While such an operation
may invalidate the :ref:`block-index`, the format is designed so that
if the block index is removed or invalid, it may be regenerated by
"skipping along" the blocks in the file.
The same is not true for resizing a block, which has an explicit size
stored in the block header (except for, optionally, the last block).
Note also that, by design, an ASDF file containing no binary blocks is
also a completely standard and valid YAML file.
Additionally, the spec allows for extra unallocated space after the
tree and between blocks. This allows libraries to more easily update
the files in place, since it allows expansion of certain areas without
rewriting of the entire file.
.. _header:
Header
------
All ASDF files must start with a short one-line header. For example::
#ASDF 1.0.0
It is made up of two parts, separated by white space characters:
- **ASDF token**: The constant string ``#ASDF``. This can be used to
quickly identify the file as an ASDF file by reading the first 5
bytes. It begins with a ``#`` so it will be treated as a YAML
comment such that the :ref:`header` and the :ref:`tree` together
form a valid YAML file.
- **File format version**: The version of the low-level file format
that this file was written with. This version may differ from the
version of the ASDF specification, and is only updated when a
change is made that affects the layout of file. It follows the
`Semantic Versioning 2.0.0 <http://semver.org/spec/v2.0.0.html>`__
specification. See :ref:`versioning-conventions` for more
information about these versions.
The header in EBNF form::
asdf_token = "#ASDF"
header = asdf_token " " format_version ["\r"] "\n"
.. _comments:
Comments
--------
Additional comment lines may appear between the Header and the Tree.
The use of comments here is intended for information for the ASDF
parser, and not information of general interest to the end user. All
data of interest to the end user should be in the Tree.
Each line must begin with a ``#`` character.
.. _tree:
Tree
----
The tree stores structured information using a subset of `YAML Ain’t Markup
Language (YAML™) 1.1 <http://yaml.org/spec/1.1/>`__ syntax (see :ref:`yaml_subset` for
details on YAML features that are excluded from ASDF). While it
is the main part of most ASDF files, it is entirely optional, and a
ASDF file may skip it completely. This is useful for creating files
in :ref:`exploded`. Interpreting the contents of this section is
described in greater detail in :ref:`tree-in-depth`. This section
only deals with the serialized representation of the tree, not its
logical contents.
The tree is always encoded in UTF-8, without an explicit byteorder
marker (BOM). Newlines in the tree may be either DOS (``"\r\n"``) or
UNIX (``"\n"``) format.
In ASDF |version|, the tree must be encoded in `YAML version 1.1
<http://yaml.org/spec/1.1/>`__. At the time of this writing, the
latest version of the YAML specification is 1.2, however most YAML
parsers only support YAML 1.1, and the benefits of YAML 1.2 are minor.
Therefore, for maximum portability, ASDF requires that the YAML is
encoded in YAML 1.1. To declare that YAML 1.1 is being used, the tree
must begin with the following line::
%YAML 1.1
The tree must contain exactly one YAML document, starting with ``---``
(YAML document start marker) and ending with ``...`` (YAML document
end marker), each on their own line. Between these two markers is the
YAML content. For example::
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.0.0
data: !core/ndarray-1.0.0
source: 0
datatype: float64
shape: [1024, 1024]
...
The size of the tree is not explicitly specified in the file, so that
it can easily be edited by hand. Therefore, ASDF parsers must search
for the end of the tree by looking for the end-of-document marker
(``...``) on its own line. For example, the following regular
expression may be used to find the end of the tree::
\r?\n...\r?\n
Though not required, the tree should be followed by some unused space
to allow for the tree to be updated and increased in size without
performing an insertion operation in the file. It also may be
desirable to align the start of the first block to a filesystem block
boundary. This empty space may be filled with any content (as long as
it doesn't contain the ``block_magic_token`` described in
:ref:`block`). It is recommended that the content is made up of space
characters (``0x20``) so it appears as empty space when viewing the
file.
.. _block:
Blocks
------
Following the tree and some empty space, or immediately following the
header, there are zero or more binary blocks.
Blocks represent a contiguous chunk of binary data and nothing more.
Information about how to interpret the block, such as the data type or
array shape, is stored entirely in ``ndarray`` structures in the tree,
as described in :ref:`ndarray <core/ndarray-1.0.0>`. This allows for a very
flexible type system on top of a very simple approach to memory management
within the file. It also allows for new extensions to ASDF that might
interpret the raw binary data in ways that are yet to be defined.
There may be an arbitrary amount of unused space between the end of
the tree and the first block. To find the beginning of the first
block, ASDF parsers should search from the end of the tree for the
first occurrence of the ``block_magic_token``. If the file contains
no tree, the first block must begin immediately after the header with
no padding.
.. _block-header:
Block header
^^^^^^^^^^^^
Each block begins with the following header:
- ``block_magic_token`` (4 bytes): Indicates the start of the block.
This allows the file to contain some unused space in which to grow
the tree, and to perform consistency checks when jumping from one
block to the next. It is made up of the following 4 8-bit characters:
- in hexadecimal: d3, 42, 4c, 4b
- in ascii: ``"\323BLK"``
- ``header_size`` (16-bit unsigned integer, big-endian): Indicates the
size of the remainder of the header (not including the length of the
``header_size`` entry itself or the ``block_magic_token``), in bytes.
It is stored explicitly in the header itself so that the header may be
enlarged in a future version of the ASDF standard while retaining
backward compatibility. Importantly, ASDF parsers should not assume
a fixed size of the header, but should obey the ``header_size``
defined in the file. In ASDF version 0.1, this should be at least
48, but may be larger, for example to align the beginning of the
block content with a file system block boundary.
- ``flags`` (32-bit unsigned integer, big-endian): A bit field
containing flags (described below).
- ``compression`` (4-byte byte string): The name of the compression
algorithm, if any. Should be ``\0\0\0\0`` to indicate no
compression. See :ref:`compression` for valid values.
- ``allocated_size`` (64-bit unsigned integer, big-endian): The amount
of space allocated for the block (not including the header), in
bytes.
- ``used_size`` (64-bit unsigned integer, big-endian): The amount of
used space for the block on disk (not including the header), in
bytes.
- ``data_size`` (64-bit unsigned integer, big-endian): The size of the
block when decoded, in bytes. If ``compression`` is all zeros
(indicating no compression), it **must** be equal to ``used_size``.
If compression is being used, this is the size of the decoded block
data.
- ``checksum`` (16-byte string): An optional MD5 checksum of the used
data in the block. The special value of all zeros indicates that no
checksum verification should be performed.
Flags
^^^^^
The following bit flags are understood in the ``flags`` field:
- ``STREAMED`` (0x1): When set, the block is in streaming mode, and it
extends to the end of the file. When set, the ``allocated_size``,
``used_size`` and ``data_size`` fields are ignored. By necessity,
any block with the ``STREAMED`` bit set must be the last block in
the file.
.. _compression:
Compression
^^^^^^^^^^^
Currently, two block compression types are supported:
- ``zlib``: The zlib lossless compression algorithm. It is widely
used, patent-unencumbered, and has an implementation released under
a permissive license in `zlib <http://www.zlib.net/>`__.
- ``bzp2``: The bzip2 lossless compression algorithm. It is widely
used, assumed to be patent-unencumbered, and has an implementation
released under a permissive license in the `bzip2 library
<http://www.bzip.org/>`__.
Block content
^^^^^^^^^^^^^
Immediately following the block header, there are exactly
``used_space`` bytes of meaningful data, followed by
``allocated_space - used_space`` bytes of unused data. The exact
content of the unused data is not enforced. The ability to have gaps
of unused space allows an ASDF writer to reduce the number of disk
operations when updating the file.
.. _block-index:
Block index
-----------
The block index allows for fast random access to each of the blocks in
the file. It is completely optional: if not present, libraries may
"skip along" the block headers to find the location of each block in
the file. Libraries should detect invalid or obsolete block indices
and ignore them and regenerate the index by skipping along the block
headers.
The block index appears at the end of the file to make streaming an
ASDF file possible without needing to determine the size of all blocks
up front, which is non-trivial in the case of compression. It also
allows for updating the index without an expensive insertion operation
earlier in the file.
The block index must appear immediately after the allocated space for
the last block in the file. If the last block is a streaming block,
no block index may be present -- the streaming block feature and block
index are incompatible.
If no blocks are present in the file, the block index must also be
absent.
The block index consists of a header, followed by a YAML document
containing the indices of each block in the file.
The header must be exactly::
#ASDF BLOCK INDEX
followed by a DOS or UNIX newline.
Following the header is a YAML document (in YAML version 1.1, like the
:ref:`tree`), containing a list of integers indicating the byte offset
of each block in the file.
The following is an example block index::
#ASDF BLOCK INDEX
%YAML 1.1
--- [2043, 16340]
...
The offsets in the block index must be monotonically increasing, and
must, by definition, be at least "block header size" apart. If they
were allowed to appear in any order, it would be impossible to rebuild
the index by skipping blocks were the index to become damaged or
out-of-sync.
Additional zero-valued bytes may appear after the block index. This
is mainly to support operating systems, such as Microsoft Windows,
where truncating the file may not be easily possible.
Implementation recommendations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Libraries should look for the block index by reading backward from the
end of the file.
Libraries should be conservative about what is an acceptable index,
since addressing incorrect parts of the file could result in undefined
behavior.
The following checks are recommended:
- Always ensure that the first offset entry matches the location of
the first block in the file. This will catch the common use case
where the YAML tree was edited by hand without updating the index.
If they do not match, do not use the entire block index.
- Ensure that the last entry in the index refers to a block magic
token, and that the end of the allocated space in the last block is
immediately followed by the block index. If they do not match, do
not use the entire block index.
- When using an offset in the block index, always ensure that the
block magic token exists at that offset before reading data.
.. _exploded:
Exploded form
-------------
Exploded form expands a self-contained ASDF file into multiple files:
- An ASDF file containing only the header and tree, which by design is
also a valid YAML file.
- *n* ASDF files, each containing a single block.
Exploded form is useful in the following scenarios:
- Not all text editors may handle the hybrid text and binary nature of
the ASDF file, and therefore either can't open an ASDF file or would
break an ASDF file upon saving. In this scenario, a user may explode
the ASDF file, edit the YAML portion as a pure YAML file, and
implode the parts back together.
- Over a network protocol, such as HTTP, a client may only need to
access some of the blocks. While reading a subset of the file can
be done using HTTP ``Range`` headers, not all web servers support
this HTTP feature. Exploded form allows each block to be requested
directly by a specific URI.
- An ASDF writer may stream a table to disk, when the size of the table
is not known at the outset. Using exploded form simplifies this,
since a standalone file containing a single table can be iteratively
appended to without worrying about any blocks that may follow it.
Exploded form describes a convention for storing ASDF file content in
multiple files, but it does not require any additions to the file
format itself. There is nothing indicating that an ASDF file is in
exploded form, other than the fact that some or all of its blocks come
from external files. The exact way in which a file is exploded is up
to the library and tools implementing the standard. In the simplest
scenario, to explode a file, each :ref:`ndarray source property
<core/ndarray-1.0.0>` in the tree is converted from a local block reference
into a relative URI.
|