1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791
|
I improved and lengthened the descriptions of netCDF and PDB binary
data file formats, and added a description of the HDF format. The
new descriptions are available by anonymous FTP from:
ftp-icf.llnl.gov: /pub/Yorick/comp-formats.txt
The "clog" file format which Yorick uses internally is described here.
I have a partial design for a better and more ecumenical file format,
so I'm not sure how long I'm going to continue supporting this clog
file concept.
4. A Generic Binary Data Description Language
---------------------------------------------
Unidata has provided a plain-text representation for netCDF files
(CDL format) which is extremely useful. The following plain-text
format can of describe either a netCDF or a PDB file (as well as HDF
and the majority of one-of-a-kind binary file formats which have been
designed to store scientific data).
The PDB format provides two capabilities lacking in the netCDF
format: non-XDR primitive data formats, and definable data types
including compound data structures and additional primitive types.
The netCDF format offers two capabilities lacking in the PDB format:
history records, and variable attributes.
In a PDB file, the want of a formal provision for history records
or for variable attributes can be fulfilled by variable naming
conventions. The trick is to use special naming conventions to
associate related variables ("x-units" might be the variable used to
store the "units" attribute of the variable "x"), or to imbue a
special significance to some of the variables in the file (instances
of a history record sequence might be named "rec0000", "rec0001", and
so on). Such tricks result in data which is accessed nearly as
efficiently as with the netCDF interface to a netCDF file.
Similar tricks allow single instances of data structures, such as
FORTRAN common blocks, to be accessed in a netCDF file. However, an
array of structure instances is very difficult to handle in netCDF
format, and any file with primitive data formats other than XDR is
impossible.
Of course, the underlying simplicity of the netCDF format is really
a virtue, not a weakness. A physics code written in FORTRAN can't
really generate any data the netCDF format can't handle. In fact, in
designing a PDB format suitable for holding restart or post-processing
information for a physics code, one of the primary objectives is to
remain within the bounds of the data describable by the netCDF format.
Two considerations beyond the scope of the format of an individual
file are important. The first is robustness against unexpected
program or machine crashes when part, but not all, of the data in a
file has been written. The second is the difficulty of handling
extremely large files as a single unit -- it is much easier to deal
with a few dozen smaller files than one monster. Both problems
commonly arise with files used for storing history data, which is
generally stored one record at a time with a long pause between the
writing of one record and the next, and which may grow to a very large
size.
The usual PDB format, which places the self-descriptive information
at the end of the file, is not very robust. Unless the structure
chart, symbol table, and extras are written after each history record,
only to be overwritten by the next, a crash causes the data in the
file to be uninterpretable. The file format itself allows the
structure chart, symbol table, and extras section to precede the data,
as in a netCDF file; perhaps such a strategy should be adopted to deal
with this problem.
A very general way to deal with the problem of robustness is to
keep two files open. Whenever new data is written at the end of the
data file, its description is written at the end of the description
file. The files can be merged when they are really finished, or left
as eparate components of a single whole. This two-file scheme has the
advantage that new data can be declared after some data has been
written, without sacrificing robustness in the face of program or
machine crashes. The plain-text binary data description language
defined below is a designed to be usable as the format for the
description member of such a pair.
The problem of dealing with very large files is easy to handle in
the case of history data -- just produce a family of files each having
a restricted length, instead of a single giant file. Splitting
history data across several files has a major impact on the
programming interface used to access the data, but no effect on the
format of an individual file. The most important thing to notice here
is that an attractive feature of the netCDF programming interface --
that you can retrieve values of a particular variable over a range of
times with a single subroutine call -- becomes much more difficult to
implement if the data extends over several files. Another remark is
that it is wise to copy the non-record data into each file in a
history family, so that each file "makes sense" by itself.
Bearing all of these remarks in mind, here is a generic binary data
description language, hereby christened "Clog", which can describe the
contents of any PDB or netCDF file, as well as many other binary file
formats:
4A. Notation
-----
The extended Backus-Nauer Form notation used in the RFC1014 XDR
standard is adopted here:
1. The characters |, (, ), [, ], and * are special.
2. Terminal symbols are strings surrounded by double quotes.
3. Non-terminal symbols are strings of non-special characters.
4. Alternative items are separated by the vertical bar character |.
5. Optional items are enclosed in square brackets [ ... ].
6. Items are grouped by enclosing them in parentheses ( ... ... ).
7. An item followed by * means zero or more occurrences of that item.
8. A non-terminal followed by : and a set of alternatives constitutes
the definition of that non-terminal.
Comments in a binary data description file begin with /* and end
with */, and are treated as whitespace. An identifier is a letter or
underscore "A-Za-z_", followed by zero or more letters, digits,
underscores, pluses, minuses, periods, or commas "A-Za-z_0-9,.+-". An
identifier may also consist of a quoted string, which is interpreted
as the characters within the quotes, recognizing the following escape
sequences:
\" double quote
\\ one backslash
\ooo an arbitrary 8-bit byte, except that \000 and any following
characters are ignored
An identifier may not be more than 1023 characters long, in its printed
form including the open and close quotes, if any.
A number is a sequence of one or more decimal digits optionally
preceded by a minus "-". A float_number is anything readable by the
standard C library "%e" format directive. All control characters and
spaces are treated as whitespace, principally to allow for any
differences in the newline character among various operating systems.
4B. Overview of language
-----
The binary data description language Clog is modeled on C variable
declaration and structure definition syntax. C declarations relate a
variable name, data type and dimension information. Clog variable
declarations must additionally specify the disk address for the
variable. Clog structure defintions are similarly extended to allow
the offset of each member to be specified. This extension makes it
easier to automatically generate a Clog description of some binary
file formats.
Following the PDB file format, Clog allows the bit-by-bit format of
the primitive data types to be specified. This greatly increases the
set of binary files describable using Clog. This same mechanism
enables new primitive data types to be declared.
Following the netCDF file format, Clog has a formal mechanism for
describing a sequence of history records. This allows natural
descriptions of an important class of binary files (including netCDF
files).
Finally, the Clog contains a formal means for including new types
of descriptive information not envisioned at its inception. (This has
been a valuable feature of the PDB file format.) Separate "trial" and
"standard" extension syntax is provided. The rules for Clog
extensions are simple: No Clog extension can alter the meaning of any
previously defined part of Clog (thus, nothing like the "Major-Order"
extra block in the PDB file format is acceptable). And any Clog
extensions should be conceived as supplying supplemental information,
rather than as wholesale replacements of existing features to get
additonal functionality.
4C. Basic primitive data types
-----
There are six basic primitive data types:
char an 8-bit byte
short a signed integer of at least 2 bytes
- used when small size is important
int a signed integer of at least 2 bytes
- most efficient integer type, used for boolean values
long a signed integer of at least 4 bytes
- most commonly used integer type, e.g.- an array index
float a floating point number of at least 4 bytes, range is
at least 10^(+-38), precision at least 6 decimal digits
double a floating point number usually 8 bytes, range is
at least 10^(+-38), precision at least 9 decimal digits,
but usually at least 10^(+-307) and 14 decimal digits
No data structure (compound data type) or additional primitive may
have one of these six names. Any one of these six names may be used
as a data type without any definition. All other identifiers used as
data type names must be previous defined, with the following two
exceptions:
string a string of 8-bit ASCII characters not containing '\0'
- a string is represented as a long containing the
disk address of the string; the string itself is
a long (aligned as a long) with a non-negative
count of the non-0 characters in the string (i.e.-
the result of the ANSI C strlen function), followed
by that many characters.
pointer a pointer to an array of any type
- the pointee contains the data type and dimension
information; the pointer is represented as a long
containing the disk address of the pointee
If string or pointer is used as a data type without defining it, the
default meaning is as shown. Once used, the string or pointer data
type may not be redefined. A NULL pointer or string is represented by
a disk address of -1; there is no associated pointee.
4D. Clog file layout
-----
clog_description:
"Contents Log"
basic_statement*
[ record_initializer basic_statement* record_declaration* ]
[ end_of_data ]
basic_statement:
primitive_definition
| structure_definition
| variable_declaration
| alignment_spec
| other_information
record_initializer:
record_declaration
| record_begin
end_of_data:
"+" "eod" "@" disk_address
The clog_description must begin with the QUOTED string "Contents
Log" -- if it is not quoted, the Clog lexical rules will divide it
into two tokens. Note that comments and white space may precede the
"Contents Log" token.
The clog_statement's order is restricted by a general "definition
before use" rule, as described in detail below. Basically, this means
a data type must be defined before it can be used. If the
record_initializer is present, clog_statements before it describe
non-record variables, while clog_statements afterwards describe record
variables. Notice that all record_declaration statements, except
possibly the first, follow the clog_statements which declare the
record variables. Thus, the structure of the records in a Clog
description cannot change.
If present, the +eod statement must be the very last thing in a
Clog description of a file, even beyond any comments. The
disk_address specified is the address of the first byte beyond all of
the data in the file. This may be beyond the end of any variable
declared in the Clog for any number of reasons; the most obvious is
that there may be pointees beyond the last declared data. The entire
+eod statement from the initial "+" to the final digit of the
disk_address must not occupy more than 80 characters.
In addition to specifying a safe address at which to begin adding
data to the binary file, the +eod statement allows the entire Clog to
be appended to the end of the binary file itself to make a single,
self-descriptive package. (Note that this procedure does not damage
either a netCDF or a PDB file.) This is done as follows:
At the address specified in the +eod statement, write the entire
plain-text Clog description of the file, including the final +eod
statement. Then close and truncate the file. A program which opens
the file can scan the last 80 bytes; if it finds a +eod statement, it
can check that "Contents Log" is the first token after specified
address, and, if so, interpret the Clog to determine the layout of the
binary data in the file.
4E. Variable declaration
-----
variable_declaration:
type_name variable_name dimension_spec* ["@" disk_address]
("," variable_name dimension_spec* ["@" disk_address])*
type_name: identifier
variable_name: identifier
dimension_spec:
"[" dimension_length [dimension_name] "]"
| "[" minimum_index ":" maximum_index [dimension_name] "]"
dimension_name: identifier
dimension_length: number
minimum_index: number
maximum_index: number
disk_address: number
The disk_address is a byte address. If omitted, the default is the
next available address after all the variables previously declared.
The "next available" address may be rounded up for alignment purposes,
as discussed in more detail below.
If more than one dimension_spec is present, the slowest varying
dimension is first in the list, and the fastest varying dimension
last. This is the C convention. As in C, a multidimensional array is
best regarded as an array of arrays -- the first index specifies which
array, so the second index must vary faster.
The optional dimension_name is provided for easier compatibility
with the netCDF format. Behavior is undefined if the same
dimension_name is used for dimension_spec's with different lengths.
The idea is to distinguish between dimension lengths which are
accidentally equal, and those which are equal by virtue of their
variable's meanings. If not suppied, the default dimension name is
"_%ld", where "%ld" is the decimal representation of the dimension
length.
The alternative minimum_index:maximum_index syntax is intended to
suggest the preferred range of the index values. The equivalent
dimension_length is maximum_index-minimum_index+1. This information
is far less important than the dimension_length, since the
dimension_length values (in their proper order!) specify the topology
of the array -- that is, how to find nearest neighbors along the
various dimensions.
The type_name and variable_name identifiers must be unique among
all other type_name and variable_name identifiers, respectively.
However, there are two separate name spaces, so a type_name identifier
may match a variable_name identifier without conflict. The
dimension_name identifiers, if any, form a third independent name
space.
4F. Primitive data type definition
-----
primitive_definition:
"+" "define" type_name
"[" size_value "]" "[" alignment_value "]"
[ "[" order_value "]"
["{" sign_address exponent_address exponent_size
mantissa_address mantissa_size mantissa_flag
exponent_bias "}"] ]
| "+" "define" "string" "standard"
| "+" "define" "pointer" "standard"
size_value: number
alignment_value: number
order_value:
number
| "sequential"
| "pdbpointer"
sign_address: number
exponent_address: number
exponent_size: number
mantissa_address: number
mantissa_size: number
mantissa_flag: number
exponent_bias: number
The type_name must neither have been defined nor referenced
previously.
All primitive type definitions must have a size_value (the number
of bytes one instance occupies) and an alignment_value (the largest
number by which the byte offset a structure member of this type is
guaranteed to be divisible).
The order_value, if a number, determines the byte ordering within
the size_value. If order_value is not present, the primitive type is
to be regarded as opaque. If order_value is present, but { ... } is
not present, the primitive type is an integer value. If { ... } is
present, the primitive type is a floating point value (order_value
must be present also in this case).
If the order_value is the identifier "sequential", then any
instance of this primitive requires sequential I/O; that is, portions
of arrays of this type may not be read or written. If the size_value
is 0, then the size of an instance is indeterminate. Otherwise, an
instance of the object occupies the specified size and has the
specified alignment as a structure member. The code reading or
writing the file is responsible for recognizing the name of a
sequential primitive and taking appropriate action to read or write
it.
The parameterization of a floating point format is the same as
described above for the PDB file format. The order_value is a
simplification of the general byte permutation provided by the PDB
file format. The meaning of the order value is as follows:
order_value ==> 1. The magnitude of the order_value represents
the number of bytes per "word". Within a
word, the byte order is monotone (either from
most significant to least significant or vice
versa). The magnitude of the order_value is
thus a multiple of the size_value. If the
entire object has monotone byte order, then
the magnitude of the order_value is one. In
practice, this is always the case, except for
the VAX floating point formats.
2. The sign of the order value determines the
word order. The byte order within a word is
always opposite to the word order. (Otherwise
the entire word is monotone, the word size is
one, and the word order is the byte order.)
The sign is positive if the most significant
word is first, negative if the least
significant word is first.
3. An order_value of zero is equivalent to omitting
the order_value altogether; it indicates opaque
data with the specified size and alignment.
In brief, the vast majority of numeric formats fall into one of the
following four categories:
order_value= 1 for big-endian (MSB first) machines
order_value= -1 for little-endian (LSB first) machines
order_value= 0 for opaque data
order_value= 2 for VAX floating point formats
The precise order of definition of primitive types is significant if
the file contains pointers to objects of non-predefined types. In
this case, the data type of the pointee will be encoded as an ordinal
based on the order of +struct and +define definitions. The exception
is a +define with a type_name of one of the predefined types: char,
short, int, long, float, double, string, or pointer. The order of
such a +define is unimportant, provided only that it precede the first
use of that predefined type. As explained in the next section, a
+define of type long must also precede any uses of the predefined
string or pointer types.
To use the default definitions of string and pointer, you must NOT
specify them using a +define (doing so would redefine them to the
meaning specified in the +define), OR you must use the special
"standard" forms of +define. With their default meanings, the a
string and a pointer are represented as a long.
In general, +defines of the six basic primitive data types should
precede any other statements in the Clog description of a file. If
these are not defined, the default is the standard for the machine on
which the binary data file was written. Without the basic primitive
type definitions, therefore, a Clog description of a file is not
portable across different machine architectures. As an example, a
Clog describing a netCDF file (in XDR format) would begin:
+define char [1][4][1]
+define short [2][4][1]
+define int [4][4][1]
+define long [4][4][1]
+define float [4][4][1] {0 1 8 9 23 0 127}
+define double [8][4][1] {0 1 11 12 52 0 1023}
Note that, since a netCDF does not support data structures, the
alignment_value is not really significant. For the same reason, the
definition of int is unnecessary. Furthermore, a definition of a
synonym for char would be appropriate:
+define byte [1][4][1]
4G. Compound data structure definition
-----
structure_definition:
"+" "struct" type_name "{"
full_member_definition member_definition* "}"
full_member_definition:
type_name member_name dimension_spec* ["@" byte_offset]
member_definition:
full_member_definition
| "," member_name dimension_spec* ["@" byte_offset]
member_name: identifier
The leading "+" permits immediate recognition a structure_definition
as opposed to a variable_declaration, without the necessity for making
"struct" a reserved word in the context of a type_name.
If present, the byte_offset specifies the byte offset of the member
from the beginning of an instance of the data structure. If absent,
the byte offset is the next available byte beyond all previously
declared members; the first member has a byte_offset of 0 by default.
The "next available" byte offset is always rounded up to the nearest
multiple of the alignment value for the type_name of the member.
The alignment value of a compound structure is the largest alignment
value for any of its members (see the discussion of alignment within
data structures below). Despite the fact that the byte_offset syntax
allows it, no two members of a data structure may overlap.
The body of each structure definition has its own name space, so
the member_name need only be unique among all the member_names for
the structure currently being defined.
As usual, type_name in a member_definition must be either a
predefined primitive, or must have been previously defined.
Obviously, a structure may not contain members which are instances of
itself.
Beyond this "define before using" requirement, the precise order of
defintion of structures is significant if the file contains pointers
to objects which are structures. In this case, the data type of the
pointee will be encoded as an ordinal based on the order of +struct
and +define definitions.
4H. Record definition
-----
record_begin:
"+" "record" "begin"
record_declaration:
"+" "record" "{" [time_value] "," [cycle_value] "}"
["@" disk_address]
time_value: float_number
cycle_value: number
The first occurrence of +record changes the meaning of subsequent
variable_declaration statements from declaring non-record variables to
declaring record variables. This first occurance may actually declare
the first history record itself, or it may merely be the record_begin
marker.
In any record_declaration, the disk_address defaults to the next
available address, as for a variable declaration.
The time_value and the cycle_value need not actually represent the
time and cycle number associated with the record; they can be any
double and long value which characterizes a record. Note that the
time_value is specified only to the accuracy it is printed; if a
precise time is required, the time should be made a record variable.
The intent of time_value and cycle_value is to provide more
informative "names" for the record than merely its position in the
sequence of records. If either time_value or cycle_value is omitted
in the first history record instance declaration, it must be omitted
in all following declarations; similarly, if present in the first
declaration, it must be present in all subsequent declarations.
Because of the possibility of families of time history files, it is
very difficult to realize a programming interface which allows
non-record data to be written after the writing of records has begun.
This practical difficulty is the reason for the division of the Clog
description into a non-record section, followed by a record section.
The restriction of history data to a sequence of a single type of
record, rather than allowing several interleaved sequences of records
of various types, is also deliberate. If several types of record need
to be written, several output files or file families should be used.
Note that a member of the history record structure may be a pointer
to a block of data whose size changes from record to record. For this
reason, Clog records are not necessarily layed out end-to-end in the
file, as are netCDF records. If the record addresses are random, the
efficiency of collection of a portion of the record data across
several or all records is reduced. More importantly, the Clog
description of a file will fail in practice if the storage of an
exceedingly large number of exceedingly small records is attempted.
As a rule of thumb, you're in trouble if your record length is less
than the number of bytes in the "+record" statement necessary to
declare the record (this can't be less than 8 bytes). If you are in
this category, you should strongly consider buffering several records
and writing them as an array for the sake of efficiency anyway.
4I. Additional alignment information
-----
alignment_spec:
"+" "align" alignment_type "[" alignment_value "]"
alignment_type:
"variables"
| "structs"
Some files, for example netCDF files, impose additional alignment
restrictions on variables. This can be specified using the "+align
variables" syntax in Clog. As a special case, alignment_value of 0
means that the same alignment should be applied to a variable in a
file as it were a member of a data structure. An alignment_value of 1
means that there is no padding between variables; every variabe starts
on the byte immediately following the previous variable. The
"@address" syntax in a variable declaration overrides the "+align
variables" statement. The special value 0 is the Clog default; netCDF
files would have "+align variables 4", and PDB files would have "+align
variables 1".
Some C compilers place an additional alignment restriction on
struct members which are themselves struct instances (beyond the usual
restriction that the alignment of a struct instance is the same as the
alignment of its most restrictively aligned member). Such an
additional alignment restriction may be expressed in Clog via the
"+align structs" syntax. The value 1 is the default, meaning that
there is no additional alignment restriction on struct instances.
At most "+align" statement of each type is allowed in a Clog
description. If present, these must come before any variables,
compound data structures, or records have been declared.
4J. Predefined string and pointer formats
-----
As indicated above, the type_names "string" and "pointer" have an
optional predefined meaning, which is designed to map relatively
easily into the C language "char *" and "void *" data types. In order
to do this, descriptive information unavoidably leaks from the
description of the binary data into the data itself. The following
design minimizes this leakage; nevertheless, a binary file designer
should avoid gratuitous use of indirect data types.
An instance of either string or pointer is represented on disk as a
long. Hence, no use of the default string or pointer types may
precede a +define of the long primitive. The value of that long is
interpreted as a disk address (as always, in bytes, with 0 meaning the
address of the first byte). A disk address of -1 is taken as a NULL
pointer, meaning that there is no data associated with the pointer.
In the case of a string, a character count of type long will be
found at the disk address specified in the (non-negative) pointer.
The characters of the string, if any, begin with the byte immediately
following the character count. The terminating NULL character is not
included in either the character count or the string itself.
In the case of a pointer, the (non-zero) pointer is the disk address
of a small header describing the pointee, followed by the pointee data
itself. The header is an array of long integers encoded as follows:
long type_number number representing the data type
of the pointee data:
0 char, 1 short, 2 int, 3 long,
4 float, 5 double,
6 string, 7 pointer,
>=8 for the data types defined using
+struct or +define in the order of
definition in the Clog
long n_dims number of dimensions (0 if scalar)
long[n_dims] length[n_dims] (not present if n_dims==0)
the number of elements along each
dimension, in order from slowest
varying to fastest varying dimension
<garbage> pad any pad necessary to align the
data to a disk address which would
be acceptable if it were an ordinary
variable
<type_number> data the pointee itself
4K. Generic extension syntax
-----
The preceding sections cover the basic requirements for being able
to decipher the contents of a binary data file. Of course, you can't
necessarily do anything with a bunch of numbers just because you can
read them. In general, the meaning of the numbers in a binary file
emerges only from careful documentation. The required level of
documentation is appropriate to a user's manual for the program which
wrote the file, not to the Clog description of the file.
Nevertheless, sometimes it is appropriate to carry a higher level
of meaning around with a binary data file. The Clog therefore provides
a generic syntax for such information:
other_information:
"+" public_extension [extension_id]
"{" extension_data "}" ["@" disk_address]
"-" private_extension [extension_id]
"{" extension_data "}" ["@" disk_address]
public_extension: identifier
private_extension: identifier
extension_id: identifier
extension_data: <any sequence of tokens with balanced "{" and "}">
The notion of a public_extension is that considerable effort be
expended to ensure that the associated identifier be unique across all
implementations of the Clog. A private_extension, on the other hand,
can be used immediately and freely at a single site.
The private_extension syntax should not be used as a substitute for
the comment syntax /* ... */.
A public_extension cannot have the identifiers "struct", "define",
"history", or "eod". Furthermore, the following public_extensions are
hereby defined, in order to prevent the corresponding inevitable
private extensions:
"+" "pedigree" "{" pedigree_spec ("," pedigree_spec)* "}"
pedigree_spec:
"created_by" "=" identifier
| "creation_date" "=" identifier
| "modified_by" "=" identifier
| "modification_date" "=" identifier
| "revision" "=" number
| "archive_id" "=" identifier
| "format_version" "=" number
| identifier "=" identifier
| identifier "=" number
Blue bloods and bureaucrats demand pedigrees for their data.
"+" "attributes" [variable_name]
"{" attribute_spec (";" attribute_spec)* "}"
attribute_spec:
attribute_name ["=" attribute_value]
attribute_name: identifier
attribute_value:
number ("," number)*
| float_number ("," float_number)*
| identifier
The attribute extension handles netCDF-style attributes. The
number and float_number tokens are extended by the suffix notation
described in the netCDF User's Guide from Unidata, in the section on
CDL format. An identifier as an attribute_value covers the case of a
string valued attribute; it should normally be a quoted string. If
the variable_name is not present, the attributes apply to the whole
binary file. If present, the variable_name must specify a previously
declared variable.
"+" "value" variable_name "{" variable_value "}"
variable_value:
number ("," variable_value)*
| float_number ("," variable_value)*
| "{" variable_value "}"
This extension is provided in order to be able to directly
translate Unidata CDL files into Clog files. Just as the string and
pointer data types cause a leakage of descriptive information into the
data, the +value extension amounts to a leakage of data into the
descriptive information. Each level of { ... } descends one level
into a structure instance.
"+" "PDBpointer" ("variable" | "member") "{" type_name "}"
The type_name must specify an opaque data type previously defined
by +define. Two separate types should be supplied; one for variables,
which has size==0, and one for structure members, which has non-zero
size. Both are sequential data types, since any object containing a
PDB-style pointer must be read sequentially as a complete block. The
following declarations would be reasonable:
+define "char *" [0][4][sequential]
+PDBpointer variable { "char *" }
+define "char *" [4][4][sequential] /* note extra space */
+PDBpointer member { "char *" }
These definitions assume that the sizeof(void *) specified in the PDB
prim_info section was 4, and the ptr_align specified in the PDB
Alignment extra block was 4. There is no limit on the number of
different type_names which can be declared to be PDBpointer, but all
of the corresponding +define statements must be identical.
"+" "PDBcast" type_name "{" member_pair ("," member_pair)* "}"
member_pair: cast_member "," type_member
cast_member: identifier
type_member: identifier
The PDBcast extension handles the information in the PDB Casts
extra block for PDB-style derived classes. Given the clumsy nature of
the PDBpointer public_extension, this does not really work very
well...
|