1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585
|
% This file is part of the Stanford GraphBase (c) Stanford University 1993
@i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES!
\def\title{GB\_\,IO}
@* Introduction. This is {\sc GB\_\,IO}, the input/output module used
by all GraphBase routines to access data~files. It doesn't actually do
any output; but somehow `input/output' sounds like a more useful title
than just `input'.
All files of GraphBase data are designed to produce identical results on
almost all existing computers and operating systems. Each line of each file
contains at most 79 characters. Each character is either a blank or a
digit or an uppercase letter or a lowercase letter or a standard punctuation
mark. Blank characters at the end of each line are ``invisible''; that is,
they have no perceivable effect. Hence identical results will be obtained on
record-oriented systems that pad every line with blanks.
The data is carefully sum-checked so that defective input files have little
chance of being accepted.
@ Changes might be needed when these routines are ported to different
systems. Sections of the program that are most likely to require such changes
are listed under `system dependencies' in the index.
A validation program is provided so that installers can tell if {\sc GB\_\,IO}
is working properly. To make the test, simply run \.{test\_io}.
@(test_io.c@>=
#include "gb_io.h"
/* all users of {\sc GB\_\,IO} should include this header file */
#define exit_test(m) /* we invoke this macro if something goes wrong */\
{@+fprintf(stderr,"%s!\n(Error code = %ld)\n",m,io_errors);@+return -1;@+}
@t\2@>@/
int main()
{
@<Test the |gb_open| routine; exit if there's trouble@>;
@<Test the sample data lines; exit if there's trouble@>;
@<Test the |gb_close| routine; exit if there's trouble@>;
printf("OK, the gb_io routines seem to work!\n");
return 0;
}
@ The external variable |io_errors| mentioned in the previous section
will be set nonzero if any anomalies are detected. Errors won't occur
in normal use of GraphBase programs, so no attempt has been made to
provide a user-friendly way to decode the nonzero values that
|io_errors| might assume. Information is simply gathered in binary
form; system wizards who might need to do a bit of troubleshooting
should be able to decode |io_errors| without great pain.
@d cant_open_file 0x1 /* bit set in |io_errors| if |fopen| fails */
@d cant_close_file 0x2 /* bit set if |fclose| fails */
@d bad_first_line 0x4 /* bit set if the data file's first line isn't legit */
@d bad_second_line 0x8 /* bit set if the second line doesn't pass muster */
@d bad_third_line 0x10 /* bit set if the third line is awry */
@d bad_fourth_line 0x20 /* guess when this bit is set */
@d file_ended_prematurely 0x40 /* bit set if |fgets| fails */
@d missing_newline 0x80 /* bit set if line is too long or |'\n'| is missing */
@d wrong_number_of_lines 0x100 /* bit set if the line count is wrong */
@d wrong_checksum 0x200 /* bit set if the checksum is wrong */
@d no_file_open 0x400 /* bit set if user tries to close an unopened file */
@d bad_last_line 0x800 /* bit set if final line has incorrect form */
@ The \CEE/ code for {\sc GB\_\,IO} doesn't have a main routine; it's just a
bunch of subroutines to be incorporated into programs at a higher level
via the system loading routine. Here is the general outline of \.{gb\_io.c}:
@p
@<Header files to include@>@;
@h
@<External declarations@>@;
@<Private declarations@>@;
@<Internal functions@>@;
@<External functions@>
@ Every external variable is declared twice in this \.{CWEB} file:
once for {\sc GB\_\,IO} itself (the ``real'' declaration for storage
allocation purposes) and once in \.{gb\_io.h} (for cross-references
by {\sc GB\_\,IO} users).
@<External declarations@>=
long io_errors; /* record of anomalies noted by {\sc GB\_\,IO} routines */
@ @(gb_io.h@>=
@<Header...@>@;
extern long io_errors;
/* record of anomalies noted by {\sc GB\_\,IO} routines */
@ We will stick to standard \CEE/-type input conventions. We'll also have
occasion to use some of the standard string operations.
@<Header...@>=
#include <stdio.h>
#ifdef SYSV
#include <string.h>
#else
#include <strings.h>
#endif
@* Inputting a line. The {\sc GB\_\,IO} routines get their input from
an array called |buffer|. This array is internal to {\sc
GB\_\,IO}---its contents are hidden from user programs. We make it 81
characters long, since the data is supposed to have at most 79
characters per line, followed by newline and null.
@<Private...@>=
static char buffer[81]; /* the current line of input */
static char *cur_pos=buffer; /* the current character of interest */
static FILE *cur_file; /* current file, or |NULL| if none is open */
@ Here's a basic subroutine to fill the |buffer|. The main feature of interest
is the removal of trailing blanks. We assume that |cur_file| is open.
Notice that a line of 79 characters (followed by |'\n'|) will just fit into
the buffer, and will cause no errors. A line of 80 characters will
be split into two lines and the |missing_newline|
message will occur, because of the way |fgets| is defined. A |missing_newline|
error will also occur if the file ends in the middle of a line, or if
a null character (|'\0'|) occurs within a line.
@<Internal...@>=
static void fill_buf()
{@+register char *p;
if (!fgets(buffer,sizeof(buffer),cur_file)) {
io_errors |= file_ended_prematurely; buffer[0]=more_data=0;
}
for (p=buffer; *p; p++) ; /* advance to first null character */
if (p--==buffer || *p!='\n') {
io_errors |= missing_newline; p++;
}
while (--p>=buffer && *p==' ') ; /* move back over trailing blanks */
*++p='\n'; *++p=0; /* newline and null are always present at end of line */
cur_pos=buffer; /* get ready to read |buffer[0]| */
}
@* Checksums. Each data file has a ``magic number,'' which is defined to be
$$\biggl(\sum_l 2^l c_l\biggr) \bmod p\,.$$
Here $p$ is a large prime number, and $c_l$ denotes the internal code
corresponding to the $l$th-from-last
data character read (including newlines but not nulls).
The ``internal codes'' $c_l$ are computed in a system-independent way:
Each character |c| in the actual encoding scheme being used has a
corresponding |icode|, which is the same on all systems. For example,
the |icode| of |'0'| is zero, regardless of whether |'0'| is actually
represented in ASCII or EBCDIC or some other scheme. (We assume that
every modern computer system is capable of printing at least 95
different characters, including a blank space.)
We will accept a data file as error-free if it has the correct number of
lines and ends with the proper magic number.
@<Private...@>=
static char icode[256]; /* mapping of characters to internal codes */
static long checksum_prime=(1L<<30)-83;
/* large prime such that $2p+|unexpected_char|$ won't overflow */
static long magic; /* current checksum value */
static long line_no; /* current line number in file */
static long final_magic; /* desired final magic number */
static long tot_lines; /* total number of data lines */
static char more_data; /* is there data still waiting to be read? */
@ The |icode| mapping is defined by a single string, |imap|, such that
character |imap[k]| has |icode| value~|k|. There are 96 characters
in |imap|, namely the 94 standard visible ASCII codes plus space
and newline. If EBCDIC code is used instead of ASCII, the
cents sign \rlap{\.{\kern.05em/}}\.c should take the place of single-left-quote
\.{\char`\`}, and \.{\char5}~should take the place of\/~\.{\char`\~}.
All characters that don't appear in |imap| are given the same |icode|
value, called |unexpected_char|. Such characters should be avoided in
GraphBase files whenever possible. (If they do appear, they can still
get into a user's data, but we don't distinguish them from each other
for checksumming purposes.)
The |icode| table actually plays a dual role, because we've rigged it so that
codes 0--15 come from the characters |"0123456789ABCDEF"|. This facilitates
conversion of decimal and hexadecimal data. We can also use it for
radices higher than 16.
@d unexpected_char 127 /* default |icode| value */
@<Private...@>=
static char *imap="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\
abcdefghijklmnopqrstuvwxyz_^~&@@,;.:?!%#$+-*/|\\<=>()[]{}`'\" \n";
@ Users of {\sc GB\_\,IO} can look at the |imap|, but they can't change it.
@<External fun...@>=
char imap_chr(d)
long d;
{
return d<0 || d>strlen(imap)? '\0': imap[d];
}
@#
long imap_ord(c)
char c;
{
@<Make sure that |icode| has been initialized@>;
return (c<0||c>255)? unexpected_char: icode[c];
}
@ @(gb_io.h@>=
#define unexpected_char @t\quad@> 127
extern char imap_chr(); /* the character that maps to a given character */
extern long imap_ord(); /* the ordinal number of a given character */
@ @<Make sure that |icode| has been initialized@>=
if (!icode['1']) icode_setup();
@ @<Internal...@>=
static void icode_setup()
{@+register long k;
register char *p;
for (k=0;k<256;k++) icode[k]=unexpected_char;
for (p=imap,k=0; *p; p++,k++) icode[*p]=k;
}
@ Now we're ready to specify some external subroutines that do
input. Calling |gb_newline()| will read the next line of
data into |buffer| and update the magic number accordingly.
@(gb_io.h@>=
extern void gb_newline(); /* advance to next line of the data file */
extern long new_checksum(); /* compute change in magic number */
@ Users can compute checksums as |gb_newline| does, but they can't
change the (private) value of |magic|.
@<External f...@>=
long new_checksum(s,old_checksum)
char *s; /* a string */
long old_checksum;
{@+register long a=old_checksum;
register char*p;
for (p=s; *p; p++)
a=(a+a+imap_ord(*p)) % checksum_prime;
return a;
}
@ The magic checksum is not affected by lines that begin with \.*.
@<External f...@>=
void gb_newline()
{
if (++line_no>tot_lines) more_data=0;
if (more_data) {
fill_buf();
if (buffer[0]!='*')
magic=new_checksum(buffer,magic);
}
}
@ Another simple routine allows a user to read (but not write) the
variable |more_data|.
@(gb_io.h@>=
extern long gb_eof(); /* has the data all been read? */
@ @<External f...@>=
long gb_eof() { return !more_data; }
@* Parsing a line. The user can input characters from the buffer in several
ways. First, there's a basic |gb_char()| routine, which returns
a single character. The character is |'\n'| if the last character on the
line has already been read (and it continues to be |'\n'| until the user calls
|gb_newline|).
The current position in the line, |cur_pos|, always advances when |gb_char|
is called, unless |cur_pos| was already at the end of the line.
There's also a |gb_backup()| routine, which moves |cur_pos| one place
to the left unless it was already at the beginning.
@(gb_io.h@>=
extern char gb_char(); /* get next character of current line, or |'\n'| */
extern void gb_backup(); /* move back ready to scan a character again */
@ @<External f...@>=
char gb_char()
{
if (*cur_pos) return (*cur_pos++);
return '\n';
}
@#
void gb_backup()
{
if (cur_pos>buffer)
cur_pos--;
}
@ There are two ways to read numerical data. The first, |gb_digit(d)|,
expects to read a single character in radix~|d|, using |icode| values
to specify digits greater than~9. (Thus, for example, |'A'| represents
the hexadecimal digit for decimal~10.)
If the next character is a valid |d|-git,
|cur_pos| moves to the next character and the numerical value is returned.
Otherwise |cur_pos| stays in the same place and $-1$ is returned.
The second routine, |gb_number(d)|, reads characters and forms an
unsigned radix-|d| number until the first non-digit is encountered.
The resulting number is returned; it is zero if no digits were found.
No errors are possible with this routine, because it uses
|unsigned long| arithmetic.
@(gb_io.h@>=
extern long gb_digit(); /* |gb_digit(d)| reads a digit between 0 and |d-1| */
extern unsigned long gb_number(); /* |gb_number(d)| reads a radix-|d| number */
@ The value of |d| should be at most 127, if users want their programs to be
portable, because \CEE/ does not treat larger |char| values in a
well-defined manner. In most applications, |d| is of course either 10 or 16.
@<External f...@>=
long gb_digit(d)
char d;
{
icode[0]=d; /* make sure |'\0'| is a nondigit */
if (imap_ord(*cur_pos)<d) return icode[*cur_pos++];
return -1;
}
@#
unsigned long gb_number(d)
char d;
{@+register unsigned long a=0;
icode[0]=d; /* make sure |'\0'| is a nondigit */
while (imap_ord(*cur_pos)<d)
a=a*d+icode[*cur_pos++];
return a;
}
@ The final subroutine for fetching data is |gb_string(p,c)|, which
stores a null-terminated string into locations starting at~|p|.
The string starts at |cur_pos| and ends just before the first appearance
of character |c|. If |c=='\n'|, the string will stop at the end of the line.
If |c| doesn't appear in the buffer at or after |cur_pos|, the last character
of the string will be the |'\n'| that is always inserted at the end
of a line, unless the entire line has already been read. (If the entire
line has previously been read, the empty string is always returned.)
After the string has been copied, |cur_pos| advances past it.
In order to use this routine safely, the user should first check that
there is room to store up to 81 characters beginning at location~|p|.
A suitable place to put the result, called |str_buf|, is provided
for the user's convenience.
The location following the stored string is returned. Thus, if the
stored string has length~|l| (not counting the null character that is
stored at the end), the value returned will be |p+l+1|.
@(gb_io.h@>=
#define STR_BUF_LENGTH 160
extern char str_buf[]; /* safe place to receive output of |gb_string| */
extern char *gb_string(); /* |gb_string(p,c)| reads a string delimited by |c|
into bytes starting at |p| */
@ @d STR_BUF_LENGTH 160
@<External f...@>=
char str_buf[STR_BUF_LENGTH]; /* users can put strings here if they wish */
char *gb_string(p,c)
char *p; /* where to put the result */
char c; /* character following the string */
{
while (*cur_pos && *cur_pos!=c)
*p++=*cur_pos++;
*p++=0;
return p;
}
@ Here's how we test those routines in \.{test\_io}: The first line of test
data consists of 79 characters, beginning with 64 zeroes and ending with
`\.{123456789ABCDEF}'. The second line is completely blank. The third
and final line says `\.{Oops:(intentional mistake)}'.
@<Test the sample data lines...@>=
if (gb_number(10)!=123456789)
io_errors |= 1L<<20; /* decimal number not working */
if (gb_digit(16)!=10)
io_errors |= 1L<<21; /* we missed the \.A following the decimal number */
gb_backup();@+ gb_backup(); /* get set to read `\.{9A}' again */
if (gb_number(16)!=0x9ABCDEF)
io_errors |= 1L<<22; /* hexadecimal number not working */
gb_newline(); /* now we should be scanning a blank line */
if (gb_char()!='\n')
io_errors |= 1L<<23; /* newline not inserted at end */
if (gb_char()!='\n')
io_errors |= 1L<<24; /* newline not implied after end */
if (gb_number(60)!=0)
io_errors |= 1L<<25; /* number should stop at null character */
{@+char temp[100];
if (gb_string(temp,'\n')!=temp+1)
io_errors |= 1L<<26; /* string should be null after end of line */
gb_newline();
if (gb_string(temp,':')!=temp+5 || strcmp(temp,"Oops"))
io_errors |= 1L<<27; /* string not read properly */
}
if (io_errors)
exit_test("Sorry, it failed. Look at the error code for clues");
if (gb_digit(10)!=-1) exit_test("Digit error not detected");
if (gb_char()!=':')
io_errors |= 1L<<28; /* lost synch after |gb_string| and |gb_digit| */
if (gb_eof())
io_errors |= 1L<<29; /* premature end-of-file indication */
gb_newline();
if (!gb_eof())
io_errors |= 1L<<30; /* postmature end-of-file indication */
@* Opening a file. The call |gb_raw_open("foo")| will open file |"foo"| and
initialize the checksumming process. If the file cannot be opened,
|io_errors| will be set to |cant_open_file|, otherwise
|io_errors| will be initialized to zero.
The call |gb_open("foo")| is a stronger version of |gb_raw_open|, which
is used for standard GraphBase data files like |"words.dat"| to make
doubly sure that they have not been corrupted. It returns the current value
of |io_errors|, which will be nonzero if any problems were detected
at the beginning of the file.
@<Test the |gb_open| routine...@>=
if (gb_open("test.dat")!=0)
exit_test("Can't open test.dat");
@ @d gb_raw_open gb_r_open /* abbreviation for Procrustean external linkage */
@(gb_io.h@>=
#define gb_raw_open gb_r_open
extern void gb_raw_open(); /* open a file for GraphBase input */
extern long gb_open(); /* open a GraphBase data file; return 0 if OK */
@ @<External f...@>=
void gb_raw_open(f)
char *f;
{
@<Make sure that |icode|...@>;
@<Try to open |f|@>;
if (cur_file) {
io_errors=0;
more_data=1;
line_no=magic=0;
tot_lines=0x7fffffff; /* allow ``infinitely many'' lines */
fill_buf();
}@+else io_errors=cant_open_file;
}
@ Here's a possibly system-dependent part of the code: We try first to
open the data file by using the file name itself as the path name;
failing that, we try to prefix the file name with the name of the
standard directory for GraphBase data, if the program has been compiled
with |DATA_DIRECTORY| defined.
@^system dependencies@>
@<Try to open |f|@>=
cur_file=fopen(f,"r");
@^system dependencies@>
#ifdef DATA_DIRECTORY
if (!cur_file && (strlen(DATA_DIRECTORY)+strlen(f)<STR_BUF_LENGTH)) {
sprintf(str_buf,"%s%s",DATA_DIRECTORY,f);
cur_file=fopen(str_buf,"r");
}
#endif
@ @<External f...@>=
long gb_open(f)
char *f;
{
strncpy(file_name,f,sizeof(file_name)-1);
/* save the name for use by |gb_close| */
gb_raw_open(f);
if (cur_file) {
@<Check the first line; return if unsuccessful@>;
@<Check the second line; return if unsuccessful@>;
@<Check the third line; return if unsuccessful@>;
@<Check the fourth line; return if unsuccessful@>;
gb_newline(); /* the first line of real data is now in the buffer */
}
return io_errors;
}
@ @<Private...@>=
static char file_name[20]; /* name of the data file, without a prefix */
@ The first four lines of a typical data file should look something like this:
$$\halign{\hskip5em\.{#}\hfill\cr
* File "words.dat" from the Stanford GraphBase (C) 1993 Stanford University\cr
* A database of English five-letter words\cr
* This file may be freely copied but please do not change it in any way!\cr
* (Checksum parameters 5757,526296596)\cr}$$
We actually verify only that the first four lines of a data file named |"foo"|
begin respectively with the characters
$$\halign{\hskip5em\.{#}\hfill\cr
* File "foo"\cr
*\cr
*\cr
* (Checksum parameters $l$,$m$)\cr}$$
where $l$ and $m$ are decimal numbers. The values of $l$ and~$m$
are stored away as |tot_lines| and |final_magic|, to be matched at the
end of the file.
@<Check the first line...@>=
sprintf(str_buf,"* File \"%s\"",f);
if (strncmp(buffer,str_buf,strlen(str_buf)))
return (io_errors |= bad_first_line);
@ @<Check the second line...@>=
fill_buf();
if (*buffer!='*') return (io_errors |= bad_second_line);
@ @<Check the third line...@>=
fill_buf();
if (*buffer!='*') return (io_errors |= bad_third_line);
@ @<Check the fourth line; return if unsuccessful@>=
fill_buf();
if (strncmp(buffer,"* (Checksum parameters ",23))
return (io_errors |= bad_fourth_line);
cur_pos +=23;
tot_lines=gb_number(10);
if (gb_char()!=',')
return (io_errors |= bad_fourth_line);
final_magic=gb_number(10);
if (gb_char()!=')')
return (io_errors |= bad_fourth_line);
@* Closing a file. After all data has been input, or should have been input,
we check that the file was open and that it had the correct number of
lines, the correct magic number, and a correct final line. The
subroutine |gb_close|, like |gb_open|, returns the value of
|io_errors|, which will be nonzero if at least one problem was noticed.
@<Test the |gb_close| routine; exit if there's trouble@>=
if (gb_close()!=0)
exit_test("Bad checksum, or difficulty closing the file");
@ @<External f...@>=
long gb_close()
{
if (!cur_file)
return (io_errors |= no_file_open);
fill_buf();
sprintf(str_buf,"* End of file \"%s\"",file_name);
if (strncmp(buffer,str_buf,strlen(str_buf)))
io_errors |= bad_last_line;
more_data=buffer[0]=0;
/* now the {\sc GB\_\,IO} routines are effectively shut down */
/* we have |cur_pos=buffer| */
if (fclose(cur_file)!=0)
return (io_errors |= cant_close_file);
cur_file=NULL;
if (line_no!=tot_lines+1)
return (io_errors |= wrong_number_of_lines);
if (magic!=final_magic)
return (io_errors |= wrong_checksum);
return io_errors;
}
@ There is also a less paranoid routine, |gb_raw_close|, that
closes user-generated files. It simply closes the current file, if any,
and returns the value of the |magic| checksum.
Example: The |restore_graph| subroutine in {\sc GB\_\,SAVE} uses
|gb_raw_open| and |gb_raw_close| to provide system-independent input
that is almost as foolproof as the reading of standard GraphBase data.
@ @d gb_raw_close gb_r_close /* for Procrustean external linkage */
@(gb_io.h@>=
#define gb_raw_close gb_r_close
extern long gb_close(); /* close a GraphBase data file; return 0 if OK */
extern long gb_raw_close(); /* close file and return the checksum */
@ @<External f...@>=
long gb_raw_close()
{
if (cur_file) {
fclose(cur_file);
more_data=buffer[0]=0;
cur_pos=buffer;
cur_file=NULL;
}
return magic;
}
@* Index. Here is a list that shows where the identifiers of this program are
defined and used.
|