1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
|
<html>
<head>
<title>
Variable-Length Datatypes in HDF5
</title>
<STYLE TYPE="text/css">
P { text-indent: 2em}
P.item { margin-left: 2em; text-indent: -2em}
P.item2 { margin-left: 2em; text-indent: 2em}
TABLE.format { border:solid; border-collapse:collapse; caption-side:top; text-align:center; width:80%;}
TABLE.format TH { border:ridge; padding:4px; width:25%;}
TABLE.format TD { border:ridge; padding:4px; }
TABLE.format CAPTION { font-weight:bold; font-size:larger;}
TABLE.note {border:none; text-align:right; width:80%;}
TABLE.desc { border:solid; border-collapse:collapse; caption-size:top; text-align:left; width:80%;}
TABLE.desc TR { vertical-align:top;}
TABLE.desc TH { border-style:ridge; font-size:larger; padding:4px; text-decoration:underline;}
TABLE.desc TD { border-style:ridge; padding:4px; }
TABLE.desc CAPTION { font-weight:bold; font-size:larger;}
TABLE.list { border:none; }
TABLE.list TR { vertical-align:top;}
TABLE.list TH { border:none; text-decoration:underline;}
TABLE.list TD { border:none; }
</STYLE>
<link href="../ed_styles/FormatElect.css" rel="stylesheet" type="text/css">
</head>
<body bgcolor="#FFFFFF">
<!-- #BeginLibraryItem "/ed_libs/styles_Format.lbi" -->
<!--
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Copyright by The HDF Group. *
* Copyright by the Board of Trustees of the University of Illinois. *
* All rights reserved. *
* *
* This file is part of HDF5. The full HDF5 copyright notice, including *
* terms governing use, modification, and redistribution, is contained in *
* the files COPYING and Copyright.html. COPYING can be found at the root *
* of the source code distribution tree; Copyright.html can be found at the *
* root level of an installed copy of the electronic HDF5 document set and *
* is linked from the top-level documents page. It can also be found at *
* http://www.hdfgroup.org/HDF5/doc/Copyright.html. If you do not have *
* access to either file, you may request a copy from help@hdfgroup.org. *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
-->
<!-- #EndLibraryItem --><H3>Introduction</H3>
<P>Variable-length (VL) datatypes have a great deal of flexibility, but can
be over- or mis-used. VL datatypes are ideal at capturing the notion
that elements in an HDF5 dataset (or attribute) can have different
amounts of information (VL strings are the canonical example),
but they have some drawbacks that this document attempts
to address.
</P>
<H3>Background</H3>
<P>Because fast random access to dataset elements requires that each
element be a fixed size, the information stored for VL datatype elements
is actually information to locate the VL information, not
the information itself.
</P>
<H3>When to use VL datatypes</H3>
<P>VL datatypes are designed allow the amount of data stored in each
element of a dataset to vary. This change could be
over time as new values, with different lengths, were written to the
element. Or, the change can be over "space" - the dataset's space,
with each element in the dataset having the same fundamental type, but
different lengths. "Ragged arrays" are the classic example of elements
that change over the "space" of the dataset. If the elements of a
dataset are not going to change over "space" or time, a VL datatype
should probably not be used.
</P>
<H3>Access Time Penalty</H3>
<P>Accessing VL information requires reading the element in the file, then
using that element's location information to retrieve the VL
information itself.
In the worst case, this obviously doubles the number of disk accesses
required to access the VL information.
</P>
<P>However, in order to avoid this extra disk access overhead, the HDF5
library groups VL information together into larger blocks on disk and
performs I/O only on those larger blocks. Additionally, these blocks of
information are cached in memory as long as possible. For most access
patterns, this amortizes the extra disk accesses over enough pieces of
VL information to hide the extra overhead involved.
</P>
<H3>Storage Space Penalty</H3>
<P>Because VL information must be located and retrieved from another
location in the file, extra information must be stored in the file to
locate
each item of VL information (i.e. each element in a dataset or each
VL field in a compound datatype, etc.).
Currently, that extra information amounts to 32 bytes per VL item.
</P>
<P>
With some judicious re-architecting of the library and file format,
this could be reduced to 18 bytes per VL item with no loss in
functionality or additional time penalties. With some additional
effort, the space could perhaps could be pushed down as low as 8-10
bytes per VL item with no loss in functionality, but potentially a
small time penalty.
</P>
<H3>Chunking and Filters</H3>
<P>Storing data as VL information has some affects on chunked storage and
the filters that can be applied to chunked data. Because the data that
is stored in each chunk is the location to access the VL information,
the actual VL information is not broken up into chunks in the same way
as other data stored in chunks. Additionally, because the
actual VL information is not stored in the chunk, any filters which
operate on a chunk will operate on the information to
locate the VL information, not the VL information itself.
</P>
<H3>File Drivers</H3>
<P>Because the parallel I/O file drivers (MPI-I/O and MPI-posix) don't
allow objects with varying sizes to be created in the file, attemping
to create
a dataset or attribute with a VL datatype in a file managed by those
drivers will cause the creation call to fail.
</P>
<P>Additionally, using
VL datatypes and the 'multi' and 'split' file drivers may not operate
in the manner desired. The HDF5 library currently categorizes the
"blocks of VL information" stored in the file as a type of metadata,
which means that they may not be stored with the other raw data for
the file.
</P>
<H3>Rewriting</H3>
<P>When VL information in the file is re-written, the old VL information
must be released, space for the new VL information allocated and
the new VL information must be written to the file. This may cause
additional I/O accesses.
</P>
<hr>
<address>
Last modified: 9 September 2003
</address>
</body>
</html>
|