File: file_layout.rst

package info (click to toggle)
asdf-standard 1.1.1-1
links: PTS, VCS
area: main
in suites: sid, trixie
size: 1,020 kB
sloc: python: 1,050; makefile: 16
file content (373 lines) | stat: -rw-r--r-- 14,407 bytes
parent folder | download | duplicates (2)
Low-level file layout
=====================

The overall structure of a file is as follows (in order):

- :ref:`header`

- :ref:`comments`, optional

- :ref:`tree`, optional

- Zero or more :ref:`block`

- :ref:`block-index`, optional

ASDF is a hybrid text and binary format.  The header, tree and block
index are text, (specifically, in UTF-8 with DOS or UNIX-style
newlines), while the blocks are raw binary.

The low-level file layout is designed in such a way that the tree
section can be edited by hand, possibly changing its size, without
requiring changes in other parts of the file.  While such an operation
may invalidate the :ref:`block-index`, the format is designed so that
if the block index is removed or invalid, it may be regenerated by
"skipping along" the blocks in the file.

The same is not true for resizing a block, which has an explicit size
stored in the block header (except for, optionally, the last block).

Note also that, by design, an ASDF file containing no binary blocks is
also a completely standard and valid YAML file.

Additionally, the spec allows for extra unallocated space after the
tree and between blocks.  This allows libraries to more easily update
the files in place, since it allows expansion of certain areas without
rewriting of the entire file.

.. _header:

Header
------

All ASDF files must start with a short one-line header.  For example::

  #ASDF 1.0.0

It is made up of two parts, separated by white space characters:

  - **ASDF token**: The constant string ``#ASDF``. This can be used to
    quickly identify the file as an ASDF file by reading the first 5
    bytes.  It begins with a ``#`` so it will be treated as a YAML
    comment such that the :ref:`header` and the :ref:`tree` together
    form a valid YAML file.

  - **File format version**: The version of the low-level file format
    that this file was written with.  This version may differ from the
    version of the ASDF specification, and is only updated when a
    change is made that affects the layout of file.  It follows the
    `Semantic Versioning 2.0.0 <http://semver.org/spec/v2.0.0.html>`__
    specification. See :ref:`versioning-conventions` for more
    information about these versions.

The header in EBNF form::

    asdf_token = "#ASDF"
    header     = asdf_token " " format_version ["\r"] "\n"

.. _comments:

Comments
--------

Additional comment lines may appear between the Header and the Tree.

The use of comments here is intended for information for the ASDF
parser, and not information of general interest to the end user.  All
data of interest to the end user should be in the Tree.

Each line must begin with a ``#`` character.

.. _tree:

Tree
----

The tree stores structured information using a subset of `YAML Ain’t Markup
Language (YAML™) 1.1 <http://yaml.org/spec/1.1/>`__ syntax (see :ref:`yaml_subset` for
details on YAML features that are excluded from ASDF).  While it
is the main part of most ASDF files, it is entirely optional, and a
ASDF file may skip it completely.  This is useful for creating files
in :ref:`exploded`.  Interpreting the contents of this section is
described in greater detail in :ref:`tree-in-depth`.  This section
only deals with the serialized representation of the tree, not its
logical contents.

The tree is always encoded in UTF-8, without an explicit byteorder
marker (BOM). Newlines in the tree may be either DOS (``"\r\n"``) or
UNIX (``"\n"``) format.

In ASDF |version|, the tree must be encoded in `YAML version 1.1
<http://yaml.org/spec/1.1/>`__.  At the time of this writing, the
latest version of the YAML specification is 1.2, however most YAML
parsers only support YAML 1.1, and the benefits of YAML 1.2 are minor.
Therefore, for maximum portability, ASDF requires that the YAML is
encoded in YAML 1.1.  To declare that YAML 1.1 is being used, the tree
must begin with the following line::

    %YAML 1.1

The tree must contain exactly one YAML document, starting with ``---``
(YAML document start marker) and ending with ``...`` (YAML document
end marker), each on their own line.  Between these two markers is the
YAML content.  For example::

      %YAML 1.1
      %TAG ! tag:stsci.edu:asdf/
      --- !core/asdf-1.0.0
      data: !core/ndarray-1.0.0
        source: 0
        datatype: float64
        shape: [1024, 1024]
      ...

The size of the tree is not explicitly specified in the file, so that
it can easily be edited by hand.  Therefore, ASDF parsers must search
for the end of the tree by looking for the end-of-document marker
(``...``) on its own line.  For example, the following regular
expression may be used to find the end of the tree::

   \r?\n...\r?\n

Though not required, the tree should be followed by some unused space
to allow for the tree to be updated and increased in size without
performing an insertion operation in the file.  It also may be
desirable to align the start of the first block to a filesystem block
boundary.  This empty space may be filled with any content (as long as
it doesn't contain the ``block_magic_token`` described in
:ref:`block`).  It is recommended that the content is made up of space
characters (``0x20``) so it appears as empty space when viewing the
file.

.. _block:

Blocks
------

Following the tree and some empty space, or immediately following the
header, there are zero or more binary blocks.

Blocks represent a contiguous chunk of binary data and nothing more.
Information about how to interpret the block, such as the data type or
array shape, is stored entirely in ``ndarray`` structures in the tree,
as described in :ref:`ndarray <core/ndarray-1.0.0>`.  This allows for a very
flexible type system on top of a very simple approach to memory management
within the file.  It also allows for new extensions to ASDF that might
interpret the raw binary data in ways that are yet to be defined.

There may be an arbitrary amount of unused space between the end of
the tree and the first block.  To find the beginning of the first
block, ASDF parsers should search from the end of the tree for the
first occurrence of the ``block_magic_token``.  If the file contains
no tree, the first block must begin immediately after the header with
no padding.

.. _block-header:

Block header
^^^^^^^^^^^^

Each block begins with the following header:

- ``block_magic_token`` (4 bytes): Indicates the start of the block.
  This allows the file to contain some unused space in which to grow
  the tree, and to perform consistency checks when jumping from one
  block to the next.  It is made up of the following 4 8-bit characters:

  - in hexadecimal: d3, 42, 4c, 4b
  - in ascii: ``"\323BLK"``

- ``header_size`` (16-bit unsigned integer, big-endian): Indicates the
  size of the remainder of the header (not including the length of the
  ``header_size`` entry itself or the ``block_magic_token``), in bytes.
  It is stored explicitly in the header itself so that the header may be
  enlarged in a future version of the ASDF standard while retaining
  backward compatibility.  Importantly, ASDF parsers should not assume
  a fixed size of the header, but should obey the ``header_size``
  defined in the file.  In ASDF version 0.1, this should be at least
  48, but may be larger, for example to align the beginning of the
  block content with a file system block boundary.

- ``flags`` (32-bit unsigned integer, big-endian): A bit field
  containing flags (described below).

- ``compression`` (4-byte byte string): The name of the compression
  algorithm, if any.  Should be ``\0\0\0\0`` to indicate no
  compression.  See :ref:`compression` for valid values.

- ``allocated_size`` (64-bit unsigned integer, big-endian): The amount
  of space allocated for the block (not including the header), in
  bytes.

- ``used_size`` (64-bit unsigned integer, big-endian): The amount of
  used space for the block on disk (not including the header), in
  bytes.

- ``data_size`` (64-bit unsigned integer, big-endian): The size of the
  block when decoded, in bytes.  If ``compression`` is all zeros
  (indicating no compression), it **must** be equal to ``used_size``.
  If compression is being used, this is the size of the decoded block
  data.

- ``checksum`` (16-byte string): An optional MD5 checksum of the used
  data in the block.  The special value of all zeros indicates that no
  checksum verification should be performed.

Flags
^^^^^

The following bit flags are understood in the ``flags`` field:

- ``STREAMED`` (0x1): When set, the block is in streaming mode, and it
  extends to the end of the file.  When set, the ``allocated_size``,
  ``used_size`` and ``data_size`` fields are ignored.  By necessity,
  any block with the ``STREAMED`` bit set must be the last block in
  the file.

.. _compression:

Compression
^^^^^^^^^^^

Currently, two block compression types are supported:

- ``zlib``: The zlib lossless compression algorithm.  It is widely
  used, patent-unencumbered, and has an implementation released under
  a permissive license in `zlib <http://www.zlib.net/>`__.

- ``bzp2``: The bzip2 lossless compression algorithm.  It is widely
  used, assumed to be patent-unencumbered, and has an implementation
  released under a permissive license in the `bzip2 library
  <http://www.bzip.org/>`__.

Block content
^^^^^^^^^^^^^

Immediately following the block header, there are exactly
``used_space`` bytes of meaningful data, followed by
``allocated_space - used_space`` bytes of unused data.  The exact
content of the unused data is not enforced.  The ability to have gaps
of unused space allows an ASDF writer to reduce the number of disk
operations when updating the file.

.. _block-index:

Block index
-----------

The block index allows for fast random access to each of the blocks in
the file.  It is completely optional: if not present, libraries may
"skip along" the block headers to find the location of each block in
the file.  Libraries should detect invalid or obsolete block indices
and ignore them and regenerate the index by skipping along the block
headers.

The block index appears at the end of the file to make streaming an
ASDF file possible without needing to determine the size of all blocks
up front, which is non-trivial in the case of compression.  It also
allows for updating the index without an expensive insertion operation
earlier in the file.

The block index must appear immediately after the allocated space for
the last block in the file.  If the last block is a streaming block,
no block index may be present -- the streaming block feature and block
index are incompatible.

If no blocks are present in the file, the block index must also be
absent.

The block index consists of a header, followed by a YAML document
containing the indices of each block in the file.

The header must be exactly::

    #ASDF BLOCK INDEX

followed by a DOS or UNIX newline.

Following the header is a YAML document (in YAML version 1.1, like the
:ref:`tree`), containing a list of integers indicating the byte offset
of each block in the file.

The following is an example block index::

    #ASDF BLOCK INDEX
    %YAML 1.1
    --- [2043, 16340]
    ...

The offsets in the block index must be monotonically increasing, and
must, by definition, be at least "block header size" apart.  If they
were allowed to appear in any order, it would be impossible to rebuild
the index by skipping blocks were the index to become damaged or
out-of-sync.

Additional zero-valued bytes may appear after the block index.  This
is mainly to support operating systems, such as Microsoft Windows,
where truncating the file may not be easily possible.

Implementation recommendations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Libraries should look for the block index by reading backward from the
end of the file.

Libraries should be conservative about what is an acceptable index,
since addressing incorrect parts of the file could result in undefined
behavior.

The following checks are recommended:

- Always ensure that the first offset entry matches the location of
  the first block in the file.  This will catch the common use case
  where the YAML tree was edited by hand without updating the index.
  If they do not match, do not use the entire block index.

- Ensure that the last entry in the index refers to a block magic
  token, and that the end of the allocated space in the last block is
  immediately followed by the block index.  If they do not match, do
  not use the entire block index.

- When using an offset in the block index, always ensure that the
  block magic token exists at that offset before reading data.

.. _exploded:

Exploded form
-------------

Exploded form expands a self-contained ASDF file into multiple files:

- An ASDF file containing only the header and tree, which by design is
  also a valid YAML file.

- *n* ASDF files, each containing a single block.

Exploded form is useful in the following scenarios:

- Not all text editors may handle the hybrid text and binary nature of
  the ASDF file, and therefore either can't open an ASDF file or would
  break an ASDF file upon saving.  In this scenario, a user may explode
  the ASDF file, edit the YAML portion as a pure YAML file, and
  implode the parts back together.

- Over a network protocol, such as HTTP, a client may only need to
  access some of the blocks.  While reading a subset of the file can
  be done using HTTP ``Range`` headers, not all web servers support
  this HTTP feature.  Exploded form allows each block to be requested
  directly by a specific URI.

- An ASDF writer may stream a table to disk, when the size of the table
  is not known at the outset.  Using exploded form simplifies this,
  since a standalone file containing a single table can be iteratively
  appended to without worrying about any blocks that may follow it.

Exploded form describes a convention for storing ASDF file content in
multiple files, but it does not require any additions to the file
format itself.  There is nothing indicating that an ASDF file is in
exploded form, other than the fact that some or all of its blocks come
from external files.  The exact way in which a file is exploded is up
to the library and tools implementing the standard.  In the simplest
scenario, to explode a file, each :ref:`ndarray source property
<core/ndarray-1.0.0>` in the tree is converted from a local block reference
into a relative URI.