1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302
|
Metadata
########
EDL metadata and data compliance is what makes a regular directory hierarchy into an structure
readable by tools supporting EDL.
In order for a directory to be recognized as part of an EDL hierarchy, a ``manifest.toml`` file
must be present.
Directory/Unit Names
====================
A directory name is also the name of the EDL unit, be it a collection, group or dataset.
In order to be somewhat sane and also portable to non-UNIX systems, some constraints apply
to directory naming:
* Only printable characters may be used.
* No special characters and punctuation is permitted, except for dots, hyphens, underscores and plus-signs (``.-_+``).
* Names must not start or end with a dot.
* Unicode characters are permitted, but users are encouraged to stick to ASCII characters if possible.
EDL implementations must convert from the system's native encoding to UTF-8 when reading names.
* The maximum length of a name is capped at 255 characters.
* MS-DOS device names (such as `AUX`) are forbidden, as Windows will not permit directories of that name.
* EDL names are handled in a case-sensitive way, yet two units in the same directory must not have the same
name when both are converted to lowercase letters.
* It is recommended to not start unit names with a number, and encouraged to only use lowercase names
Common Metadata
===============
The ``manifest.toml`` file contains basic information about an organizational unit in EDL.
It must be a valid `TOML <https://github.com/toml-lang/toml>`_ 1.0 file and pass TOML
compliance tests.
A minimal file may look like this:
.. code-block:: toml
format_version = "1"
type = "group"
collection_id = "49db9875-c0a2-4f70-8ba4-ec00a4e6be9c"
time_created = 2020-05-08T17:23:06+02:00
The following toplevel keys can or must be present in every ``manifest.toml`` file:
``format_version``
------------------
*[required, string]* A string containing the version of the EDL specification used to create this unit and metadata. May be used in future
in case changes are made to EDL.
``type``
--------
*[required, string]* A string denoting the type of this organizational unit. May be one of ``collection``, ``group`` or ``dataset``, depending
on the respective unit type. Some unit types may require the presence of additional metadata.
``collection_id``
-----------------
*[required, string]* A Universally unique identifier (UUID) version 4 which is unique for the collection that the respective unit is part of.
This ID is purposefully not human-readable. It is intended as a unique identifier for the given experiment, to be used
by data processing pipelines and other programs handling the data.
Within EDL, the last characters of this UUID may also be used in names for data, to make it possible to infer where the
data originated from in case a user copied it out of the EDL layout and sent it somewhere else without its metadata.
In case no collection UUID exists (yet), the ID string may be an invalid UUID consisting only of zeros.
``time_created``
----------------
*[required, datetime]* The creation time of this EDL unit, as an `RFC 3339 <https://tools.ietf.org/html/rfc3339>`_ formatted date-time with offset.
The offset must be present in case the data is shifted between group members in different timezones. For this field, TOMLs native
date-time type support is used.
``generator``
-------------
*[optional, string]* Name and version of the tool that generated this metadata/data/EDL unit.
Collections
===========
Collections are the root of each EDL directory tree. Their ``type`` is ``collection``.
In addition to the common keys in their ``manifest.toml``, they may also contain the following entries:
``generator``
-------------
*[recommended, string]* Name and version of the tool that generated this data collection, if there was one.
This is usually used by tools like Syntalos.
``authors``
-----------
*[optional, array of tables]* Array of tables with the author names as string values in ``name`` keys and author
email addresses as string values in ``email`` keys. This can be useful to track authorship of who generated
the data originally, or who edited it later.
Manifest Example:
-----------------
.. code-block:: toml
collection_id = "49db9875-c0a2-4f70-8ba4-ec00a4e6be9c"
format_version = "1"
generator = "Syntalos 1.0"
time_created = 2020-05-08T17:23:06.000662+02:00
type = "collection"
[[authors]]
email = "rick@c137.local"
name = "Rick Sanchez"
[[authors]]
email = "morty@c137.local"
name = "Morty Smith"
Groups
======
Groups are named containers for more groups or datasets. Their ``type`` is ``group``.
They may contain any of the common keys in their ``manifest.toml`` metadata.
Datasets
========
Datasets are EDL units which contain the actual experiment data. they are leafs in the directory
hierarchy. Their ``type`` is ``dataset``.
In addition to the common keys in their ``manifest.toml``, they may also contain the following entries:
``data``
--------
*[required, table]* This block briefly describes the data of the dataset. It may have the following keys:
``media_type``
^^^^^^^^^^^^^^
*[maybe-optional, string]* The MIME/`MediaType <https://en.wikipedia.org/wiki/Media_type>`_ of the contained data, if one is associated
with the given media. In case no media type can be determined, the ``file_type`` key becomes a required key. Either the ``media_type``
or ``file_type`` key or both must be present.
``file_type``
^^^^^^^^^^^^^
*[maybe-optional, string]* The file-type of the contained data. This is usually the file extension of the contained data without the dot,
but may be any agreed-on string to indicate a specific type of data. In case a ``media_type`` could be determined, this key becomes optional,
otherwise it is required. Either the ``media_type`` or ``file_type`` key or both must be present.
``summary``
^^^^^^^^^^^^^
*[optional, string]* This optional field can contain a human-readable description string that can provide some information about
what the files are about. Values could be for example "Videos recorded from the overview camera" or "Electrophysiology data from silicon probes".
``parts``
^^^^^^^^^
*[required, array of tables]* Array of tables with one entry for data part. Since the data is potentially very big, DAQ tools may decide to chunk
it into smaller bits to make the impact of data corruption while writing less severe and to permit data processing in smaller chunks.
This is especially common with video files.
In case data is not chunked, this array is still present, but contains only one entry. Each table entry must have a ``fname`` key with the filename
of the respective chunk as string value. The filename must be a path relative to the dataset directory (which almost always means it is the file
base name, without any path segment). An entry may have an optional start-at-zero ``index`` key with an integer value attached to it as well, to make
the ordering of the individual chunks explicit. In case the index is not explicitly defined, data will be read in list order. Explicit indexing is
occasionally useful when whole chunks may be transparently taken out of the data analysis.
``data_aux``
------------
*[optional, table]* This block describes auxiliary data to the primary data of this dataset. This may for example be a frame-number to timestamp mapping
file for a video file, or time-sync information files. Auxiliary data is usually so tightly coupled to its primary data that you will never want to have it
separate from the primary data in its own dataset.
A ``data_aux`` table follows the same semantics as a ``data`` table, with the same key names and permitted values.
Manifest Example:
-----------------
.. code-block:: toml
collection_id = "49db9875-c0a2-4f70-8ba4-ec00a4e6be9c"
format_version = "1"
time_created = 2020-05-08T17:23:06+02:00
type = "dataset"
[data]
media_type = "video/x-matroska"
[[data.parts]]
fname = "video_1.mkv"
index = 0
[[data.parts]]
fname = "video_2.mkv"
index = 1
[data_aux]
media_type = "text/csv"
[[data_aux.parts]]
fname = "video_1_timestamps.csv"
index = 0
[[data_aux.parts]]
fname = "video_2_timestamps.csv"
index = 1
Custom Metadata
===============
Custom metadata follows no defined specification. Users and programs may add it arbitrarily as TOML 1.0 data to ``attributes.toml`` files which are
shipped alongside the well-defined ``manifest.toml`` files in the same directory. Usually attributes files contain additional metadata describing an
actual dataset (such as explanations for an array dataset, or additional information for an experiment run).
Syntalos Metadata
=================
The `Syntalos <https://github.com/bothlab/syntalos>`_ DAQ system uses the ``attributes.toml`` file of the `collection` root node it creates to add a
bunch of additional metadata.
This behavior is restricted to the ``attributes.toml`` file of the main collection, all other attributes files are exclusively in the domain of Syntalos
modules without interference from the main engine.
The additional metadata includes the following fields:
``machine_node``
----------------
*[required, string]* A string consisting of the recording machine's hostname followed by the operating system name and version in square brackets.
``recording_length_msec``
-------------------------
*[required, number]* Full length of the recording run in milliseconds.
``subject_id``
--------------
*[optional, string]* Name of the test subject, as entered in the "Subject" form in Syntalos.
``subject_group``
-----------------
*[optional, string]* Group of the test subject.
``subject_comment``
-------------------
*[optional, string]* Experimenter comment for the test subject.
``success``
-----------
*[required, boolean]* Boolean, indicating whether the run was successful or failed.
``failure_reason``
------------------
*[optional, string]* In case ``success`` was `false`, this field contains a string with the last error message received by the system,
and which module emitted it.
``modules``
-----------
*[required, array of tables]* List of all Syntalos modules that were active during the run. The ``id`` key contains the machine-readable
string ID of the respective module, while the ``name`` key contain the user-defined name that was given to the module during the run.
Example Attributes File
-----------------------
.. code-block:: toml
machine_node = "glados [Debian 10]"
recording_length_msec = 1078556.0
subject_id = "TAX-010"
success = true
[[modules]]
id = "camera-tis"
name = "TIS Camera"
[[modules]]
id = "miniscope"
name = "Miniscope"
[[modules]]
id = "canvas"
name = "MS Canvas"
[[modules]]
id = "videorecorder"
name = "Overview Recorder"
[[modules]]
id = "videorecorder"
name = "Scope Recorder"
[[modules]]
id = "canvas"
name = "OV Canvas"
|