1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564
|
<!-- =defdoc funtext funtext n -->
<HTML>
<HEAD>
<TITLE>Column-based Text Files</TITLE>
</HEAD>
<BODY>
<!-- =section funtext NAME -->
<H2><A NAME="funtext">Funtext: Support for Column-based Text Files</A></H2>
<!-- =section funtext SYNOPSIS -->
<H2>Summary</H2>
<P>
This document contains a summary of the options for processing column-based
text files.
<!-- =section funtext DESCRIPTION -->
<H2>Description</H2>
<P>
Funtools will automatically sense and process "standard"
column-based text files as if they were FITS binary tables without any
change in Funtools syntax. In particular, you can filter text files
using the same syntax as FITS binary tables:
<pre>
fundisp foo.txt'[cir 512 512 .1]'
fundisp -T foo.txt > foo.rdb
funtable foo.txt'[pha=1:10,cir 512 512 10]' foo.fits
</pre>
<P>
The first example displays a filtered selection of a text file. The
second example converts a text file to an RDB file. The third example
converts a filtered selection of a text file to a FITS binary table.
<P>
Text files can also be used in Funtools image programs. In this case,
you must provide binning parameters (as with raw event files), using
the bincols keyword specifier:
<pre>
bincols=([xname[:tlmin[:tlmax:[binsiz]]]],[yname[:tlmin[:tlmax[:binsiz]]]
</pre>
For example:
<pre>
funcnts foo'[bincols=(x:1024,y:1024)]' "ann 512 512 0 10 n=10"
</pre>
<H2>Standard Text Files</H2>
<P>
Standard text files have the following characteristics:
<ul>
<li> Optional comment lines start with #
<li> Optional blank lines are considered comments
<li> An optional table header consists of the following (in order):
<ul>
<li> a single line of alpha-numeric column names
<li> an optional line of unit strings containing the same number of cols
<li> an optional line of dashes containing the same number of cols
</ul>
<li> Data lines follow the optional header and (for the present) consist of
the same number of columns as the header.
<li> Standard delimiters such as space, tab, comma, semi-colon, and bar.
</ul>
<P>
Examples:
<pre>
# rdb file
foo1 foo2 foo3 foos
---- ---- ---- ----
1 2.2 3 xxxx
10 20.2 30 yyyy
# multiple consecutive whitespace and dashes
foo1 foo2 foo3 foos
--- ---- ---- ----
1 2.2 3 xxxx
10 20.2 30 yyyy
# comma delims and blank lines
foo1,foo2,foo3,foos
1,2.2,3,xxxx
10,20.2,30,yyyy
# bar delims with null values
foo1|foo2|foo3|foos
1||3|xxxx
10|20.2||yyyy
# header-less data
1 2.2 3 xxxx
10 20.2 30 yyyy
</pre>
<P>
The default set of token delimiters consists of spaces, tabs, commas,
semi-colons, and vertical bars. Several parsers are used
simultaneously to analyze a line of text in different ways. One way
of analyzing a line is to allow a combination of spaces, tabs, and
commas to be squashed into a single delimiter (no null values between
consecutive delimiters). Another way is to allow tab, semi-colon, and
vertical bar delimiters to support null values, i.e. two consecutive
delimiters implies a null value (e.g. RDB file). A successful parser
is one which returns a consistent number of columns for all rows, with
each column having a consistent data type. More than one parser can
be successful. For now, it is assumed that successful parsers all
return the same tokens for a given line. (Theoretically, there are
pathological cases, which will be taken care of as needed). Bad parsers
are discarded on the fly.
<P>
If the header does not exist, then names "col1", "col2", etc. are
assigned to the columns to allow filtering. Furthermore, data types
for each column are determined by the data types found in the columns
of the first data line, and can be one of the following: string, int,
and double. Thus, all of the above examples return the following
display:
<pre>
fundisp foo'[foo1>5]'
FOO1 FOO2 FOO3 FOOS
---------- --------------------- ---------- ------------
10 20.20000000 30 yyyy
</pre>
<H2>Comments Convert to Header Params</H2>
<P>
Comments which precede data rows are converted into header parameters and
will be written out as such using funimage or funhead. Two styles of comments
are recognized:
<P>
1. FITS-style comments have an equal sign "=" between the keyword and
value and an optional slash "/" to signify a comment. The strict FITS
rules on column positions are not enforced. In addition, strings only
need to be quoted if they contain whitespace. For example, the following
are valid FITS-style comments:
<pre>
# fits0 = 100
# fits1 = /usr/local/bin
# fits2 = "/usr/local/bin /opt/local/bin"
# fits3c = /usr/local/bin /opt/local/bin /usr/bin
# fits4c = "/usr/local/bin /opt/local/bin" / path dir
</pre>
Note that the fits3c comment is not quoted and therefore its value is the
single token "/usr/local/bin" and the comment is "opt/local/bin /usr/bin".
This is different from the quoted comment in fits4c.
<P>
2. Free-form comments can have an optional colon separator between the
keyword and value. In the absence of quote, all tokens after the
keyword are part of the value, i.e. no comment is allowed. If a string
is quoted, then slash "/" after the string will signify a comment.
For example:
<pre>
# com1 /usr/local/bin
# com2 "/usr/local/bin /opt/local/bin"
# com3 /usr/local/bin /opt/local/bin /usr/bin
# com4c "/usr/local/bin /opt/local/bin" / path dir
# com11: /usr/local/bin
# com12: "/usr/local/bin /opt/local/bin"
# com13: /usr/local/bin /opt/local/bin /usr/bin
# com14c: "/usr/local/bin /opt/local/bin" / path dir
</pre>
<P>
Note that com3 and com13 are not quoted, so the whole string is part of
the value, while comz4c and com14c are quoted and have comments following
the values.
<P>
Some text files have column name and data type information in the header.
You can specify the format of column information contained in the
header using the "hcolfmt=" specification. See below for a detailed
description.
<H2>Multiple Tables in a Single File</H2>
<P>
Multiple tables are supported in a single file. If an RDB-style file
is sensed, then a ^L (vertical tab) will signify end of
table. Otherwise, an end of table is sensed when a new header (i.e.,
all alphanumeric columns) is found. (Note that this heuristic does not
work for single column tables where the column type is ASCII and the
table that follows also has only one column.) You also can specify
characters that signal an end of table condition using the <b>eot=</b>
keyword. See below for details.
<P>
You can access the nth table (starting from 1) in a multi-table file
by enclosing the table number in brackets, as with a FITS extension:
<pre>
fundisp foo'[2]'
</pre>
The above example will display the second table in the file.
(Index values start at 1 in oder to maintain logical compatibility
with FITS files, where extension numbers also start at 1).
<H2>TEXT() Specifier</H2>
<P>
As with ARRAY() and EVENTS() specifiers for raw image arrays and raw
event lists respectively, you can use TEXT() on text files to pass
key=value options to the parsers. An empty set of keywords is
equivalent to not having TEXT() at all, that is:
<pre>
fundisp foo
fundisp foo'[TEXT()]'
</pre>
are equivalent. A multi-table index number is placed before the TEXT()
specifier as the first token, when indexing into a multi-table:
fundisp foo'[2,TEXT(...)]'
<P>
The filter specification is placed after the TEXT() specifier, separated
by a comma, or in an entirely separate bracket:
<pre>
fundisp foo'[TEXT(...),circle 512 512 .1]'
fundisp foo'[2,TEXT(...)][circle 512 512 .1]'
</pre>
<H2>Text() Keyword Options</H2>
<P>
The following is a list of keywords that can be used within the TEXT()
specifier (the first three are the most important):
<DL>
<P>
<DT> delims="[delims]"
<DD>Specify token delimiters for this file. Only a single parser having these
delimiters will be used to process the file.
<pre>
fundisp foo.fits'[TEXT(delims="!")]'
fundisp foo.fits'[TEXT(delims="\t%")]'
</pre>
<P>
<DT> comchars="[comchars]"
<DD> Specify comment characters. You must include "\n" to allow blank lines.
These comment characters will be used for all standard parsers (unless delims
are also specified).
<pre>
fundisp foo.fits'[TEXT(comchars="!\n")]'
</pre>
<P>
<DT> cols="[name1:type1 ...]"
<DD> Specify names and data type of columns. This overrides header
names and/or data types in the first data row or default names and
data types for header-less tables.
<pre>
fundisp foo.fits'[TEXT(cols="x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e")]'
</pre>
<P>
If the column specifier is the only keyword, then the cols= is not
required (in analogy with EVENTS()):
<pre>
fundisp foo.fits'[TEXT(x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e)]'
</pre>
Of course, an index is allowed in this case:
<pre>
fundisp foo.fits'[2,TEXT(x:I,y:I,pha:I,pi:I,time:D,dx:E,dy:e)]'
</pre>
<P>
<DT> eot="[eot delim]"
<DD> Specify end of table string specifier for multi-table files. RDB
files support ^L. The end of table specifier is a string and the whole
string must be found alone on a line to signify EOT. For example:
<pre>
fundisp foo.fits'[TEXT(eot="END")]'
</pre>
will end the table when a line contains "END" is found. Multiple lines
are supported, so that:
<pre>
fundisp foo.fits'[TEXT(eot="END\nGAME")]'
</pre>
will end the table when a line contains "END" followed by a line
containing "GAME".
<P>
In the absence of an EOT delimiter, a new table will be sensed when a new
header (all alphanumeric columns) is found.
<P>
<DT> null1="[datatype]"
<DD> Specify data type of a single null value in row 1.
Since column data types are determined by the first row, a null value
in that row will result in an error and a request to specify names and
data types using cols=. If you only have a one null in row 1, you don't
need to specify all names and columns. Instead, use null1="type" to
specify its data type.
<P>
<DT> alen=[n]
<DD>Specify size in bytes for ASCII type columns.
FITS binary tables only support fixed length ASCII columns, so a
size value must be specified. The default is 16 bytes.
<P>
<DT> nullvalues=["true"|"false"]
<DD>Specify whether to expect null values.
Give the parsers a hint as to whether null values should be allowed. The
default is to try to determine this from the data.
<P>
<DT> whitespace=["true"|"false"]
<DD> Specify whether surrounding white space should be kept as part of
string tokens. By default surrounding white space is removed from
tokens.
<P>
<DT> header=["true"|"false"]
<DD>Specify whether to require a header. This is needed by tables
containing all string columns (and with no row containing dashes), in
order to be able to tell whether the first row is a header or part of
the data. The default is false, meaning that the first row will be
data. If a row dashes are present, the previous row is considered the
column name row.
<P>
<DT> units=["true"|"false"]
<DD>Specify whether to require a units line.
Give the parsers a hint as to whether a row specifying units should be
allowed. The default is to try to determine this from the data.
<P>
<DT> i2f=["true"|"false"]
<DD>Specify whether to allow int to float conversions.
If a column in row 1 contains an integer value, the data type for that
column will be set to int. If a subsequent row contains a float in
that same column, an error will be signaled. This flag specifies that,
instead of an error, the float should be silently truncated to
int. Usually, you will want an error to be signaled, so that you can
specify the data type using cols= (or by changing the value of
the column in row 1).
<P>
<DT> comeot=["true"|"false"|0|1|2]
<DD>Specify whether comment signifies end of table.
If comeot is 0 or false, then comments do not signify end of table and
can be interspersed with data rows. If the value is true or 1 (the
default for standard parsers), then non-blank lines (e.g. lines
beginning with '#') signify end of table but blanks are allowed
between rows. If the value is 2, then all comments, including blank
lines, signify end of table.
<P>
<DT> lazyeot=["true"|"false"]
<DD>Specify whether "lazy" end of table should be permitted (default is
true for standard formats, except rdb format where explicit ^L is required
between tables). A lazy EOT can occur when a new table starts directly
after an old one, with no special EOT delimiter. A check for this EOT
condition is begun when a given row contains all string tokens. If, in
addition, there is a mismatch between the number of tokens in the
previous row and this row, or a mismatch between the number of string
tokens in the prev row and this row, a new table is assumed to have
been started. For example:
<PRE>
ival1 sval3
----- -----
1 two
3 four
jval1 jval2 tval3
----- ----- ------
10 20 thirty
40 50 sixty
</PRE>
Here the line "jval1 ..." contains all string tokens. In addition,
the number of tokens in this line (3) differs from the number of
tokens in the previous line (2). Therefore a new table is assumed
to have started. Similarly:
<PRE>
ival1 ival2 sval3
----- ----- -----
1 2 three
4 5 six
jval1 jval2 tval3
----- ----- ------
10 20 thirty
40 50 sixty
</PRE>
Again, the line "jval1 ..." contains all string tokens. The number of
string tokens in the previous row (1) differs from the number of
tokens in the current row(3). We therefore assume a new table as been
started. This lazy EOT test is not performed if lazyeot is explicitly
set to false.
<P>
<DT> hcolfmt=[header column format]
<DD> Some text files have column name and data type information in the header.
For example, VizieR catalogs have headers containing both column names
and data types:
<PRE>
#Column e_Kmag (F6.3) ?(k_msigcom) K total magnitude uncertainty (4) [ucd=ERROR]
#Column Rflg (A3) (rd_flg) Source of JHK default mag (6) [ucd=REFER_CODE]
#Column Xflg (I1) [0,2] (gal_contam) Extended source contamination (10) [ucd=CODE_MISC]
</PRE>
while Sextractor files have headers containing column names alone:
<PRE>
# 1 X_IMAGE Object position along x [pixel]
# 2 Y_IMAGE Object position along y [pixel]
# 3 ALPHA_J2000 Right ascension of barycenter (J2000) [deg]
# 4 DELTA_J2000 Declination of barycenter (J2000) [deg]
</PRE>
The hcolfmt specification allows you to describe which header lines
contain column name and data type information. It consists of a string
defining the format of the column line, using "$col" (or "$name") to
specify placement of the column name, "$fmt" to specify placement of the
data format, and "$skip" to specify tokens to ignore. You also can
specify tokens explicitly (or, for those users familiar with how
sscanf works, you can specify scanf skip specifiers using "%*").
For example, the VizieR hcolfmt above might be specified in several ways:
<PRE>
Column $col ($fmt) # explicit specification of "Column" string
$skip $col ($fmt) # skip one token
%*s $col ($fmt) # skip one string (using scanf format)
</PRE>
while the Sextractor format might be specified using:
<PRE>
$skip $col # skip one token
%*d $col # skip one int (using scanf format)
</PRE>
You must ensure that the hcolfmt statement only senses actual column
definitions, with no false positives or negatives. For example, the
first Sextractor specification, "$skip $col", will consider any header
line containing two tokens to be a column name specifier, while the
second one, "%*d $col", requires an integer to be the first token. In
general, it is preferable to specify formats as explicitly as
possible.
<P>
Note that the VizieR-style header info is sensed automatically by the
funtools standard VizieR-like parser, using the hcolfmt "Column $col
($fmt)". There is no need for explicit use of hcolfmt in this case.
<P>
<DT> debug=["true"|"false"]
<DD>Display debugging information during parsing.
</DL>
<H2>Environment Variables</H2>
<P>
Environment variables are defined to allow many of these TEXT() values to be
set without having to include them in TEXT() every time a file is processed:
<pre>
keyword environment variable
------- --------------------
delims TEXT_DELIMS
comchars TEXT_COMCHARS
cols TEXT_COLUMNS
eot TEXT_EOT
null1 TEXT_NULL1
alen TEXT_ALEN
bincols TEXT_BINCOLS
hcolfmt TEXT_HCOLFMT
</pre>
<H2>Restrictions and Problems</H2>
<P>
As with raw event files, the '+' (copy extensions) specifier is not
supported for programs such as funtable.
<P>
String to int and int to string data conversions are allowed by the
text parsers. This is done more by force of circumstance than by
conviction: these transitions often happens with VizieR catalogs,
which we want to support fully. One consequence of allowing these
transitions is that the text parsers can get confused by columns which
contain a valid integer in the first row and then switch to a
string. Consider the following table:
<PRE>
xxx yyy zzz
---- ---- ----
111 aaa bbb
ccc 222 ddd
</PRE>
The xxx column has an integer value in row one a string in row two,
while the yyy column has the reverse. The parser will erroneously
treat the first column as having data type int:
<PRE>
fundisp foo.tab
XXX YYY ZZZ
---------- ------------ ------------
111 'aaa' 'bbb'
1667457792 '222' 'ddd'
</PRE>
while the second column is processed correctly. This situation can be avoided
in any number of ways, all of which force the data type of the first column
to be a string. For example, you can edit the file and explicitly quote the
first row of the column:
<PRE>
xxx yyy zzz
---- ---- ----
"111" aaa bbb
ccc 222 ddd
[sh] fundisp foo.tab
XXX YYY ZZZ
------------ ------------ ------------
'111' 'aaa' 'bbb'
'ccc' '222' 'ddd'
</PRE>
You can edit the file and explicitly set the data type of the first column:
<PRE>
xxx:3A yyy zzz
------ ---- ----
111 aaa bbb
ccc 222 ddd
[sh] fundisp foo.tab
XXX YYY ZZZ
------------ ------------ ------------
'111' 'aaa' 'bbb'
'ccc' '222' 'ddd'
</PRE>
You also can explicitly set the column names and data types of all columns,
without editing the file:
<PRE>
[sh] fundisp foo.tab'[TEXT(xxx:3A,yyy:3A,zzz:3a)]'
XXX YYY ZZZ
------------ ------------ ------------
'111' 'aaa' 'bbb'
'ccc' '222' 'ddd'
</PRE>
The issue of data type transitions (which to allow and which to disallow)
is still under discussion.
<!-- =section funtext SEE ALSO -->
<!-- =text See funtools(n) for a list of Funtools help pages -->
<!-- =stop -->
<P>
<A HREF="./help.html">Go to Funtools Help Index</A>
<H5>Last updated: August 3, 2007</H5>
</BODY>
</HTML>
|