1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413
|
/* ------------------------------------------------------------------------
@NAME : bibtex.g
@DESCRIPTION: PCCTS-based lexer and parser for BibTeX files. (Or rather,
for the BibTeX data description language. This parser
enforces nothing about the structure and contents of
entries; that's up to higher-level processors. Thus, there's
nothing either particularly bibliographic or TeXish about
the language accepted by this parser, apart from the affinity
for curly braces.)
There are a few minor differences from the language accepted
by BibTeX itself, but these are generally improvements over
BibTeX's behaviour. See the comments in the grammar, at least
until I write a decent description of the language.
I have used Gerd Neugebauer's BibTool (yet another BibTeX
parser, along with a prettyprinter and specialized language
for a common set of bibhacks) as another check of correctness
-- there are a few screwball things that BibTeX accepts and
BibTool doesn't, so I felt justified in rejecting them as
well. In general, this parser is a little stricter than
BibTeX, but a little looser than BibTool. YMMV.
Another source of inspiration is Nelson Beebe's bibclean, or
rather Beebe's article describing bibclean (from TUGboat
vol. 14 no. 4; also included with the bibclean distribution).
The product of the parser is an abstract syntax tree that can
be traversed to be printed in a simple form (see
print_entry() in bibparse.c) or perhaps transformed to a
format more convenient for higher-level languages (see my
Text::BibTeX Perl module for an example).
Whole files may be parsed by entering the parser at `bibfile';
in this case, the parser really returns a forest (list of
ASTs, one per entry). Alternately, you can enter the parser
at `entry', which reads and parses a single entry.
@GLOBALS : the usual DLG and ANTLR cruft
@CALLS :
@CREATED : first attempt: May 1996, Greg Ward
second attempt (complete rewrite): July 25-28 1996, Greg Ward
@MODIFIED : Sep 1996, GPW: changed to generate an AST rather than print
out each entry as it's encountered
Jan 1997, GPW: redid the above, because it was lost when
my !%&$#!@ computer was stolen
Jun 1997, GPW: greatly simplified the lexer, and added handling
of %-comments, @comment and @preamble entries,
and proper scanning of between-entry junk
@VERSION : $Id: bibtex.g 640 1999-11-29 01:13:10Z greg $
@COPYRIGHT : Copyright (c) 1996-99 by Gregory P. Ward. All rights reserved.
This file is part of the btparse library. This library is
free software; you can redistribute it and/or modify it under
the terms of the GNU Library General Public License as
published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
-------------------------------------------------------------------------- */
#header
<<
#define ZZCOL
#define USER_ZZSYN
#include "config.h"
#include "btparse.h"
#include "attrib.h"
#include "lex_auxiliary.h"
#include "error.h"
#include "my_dmalloc.h"
extern char * InputFilename; /* for zzcr_ast call in pccts/ast.c */
>>
/*
* The lexer has three modes -- START (between entries, hence it's what
* we're in initially), LEX_ENTRY (entered once we see an '@' at
* top-level), and LEX_STRING (for scanning quoted strings). Note that all
* the functions called from lexer actions can be found in lex_auxiliary.c.
*
* The START mode just looks for '@', discards comments and whitespace,
* counts lines, and keeps track of any other junk. The "keeping track"
* just consists of counting the number of junk characters, which is then
* reported at the next '@' sign. This will hopefully let users clean up
* "old style" implicit comments, and possibly catch some legitimate errors
* in their files (eg. a complete entry that's missing an '@').
*/
#token AT "\@" << at_sign (); >>
#token "\n" << newline (); >>
#token COMMENT "\%~[\n]*\n" << comment (); >>
#token "[\ \r\t]+" << zzskip (); >>
#token "~[\@\n\ \r\t]+"<< toplevel_junk (); >>
#lexclass LEX_ENTRY
/*
* The LEX_ENTRY mode is where most of the interesting stuff is -- these
* tokens control most of the syntax of BibTeX. First, we duplicate most
* of the START lexer, in order to handle newlines, comments, and
* whitespace.
*
* Next comes a "number", which is trivial. This is needed because a
* BibTeX simple value may be an unquoted digit string; it has to precede
* the definition of "name" tokens, because otherwise a digit string would
* be a legitimate "name", which would cause an ambiguity inside entries
* ("is this a macro or a number?")
*
* Then comes the regexp for a BibTeX "name", which is used for entry
* types, entry keys, field names, and macro names. This is basically the
* same as BibTeX's definition of such "names", with two differences. The
* key, fundamental difference is that I have defined names by inclusion
* rather than exclusion: this regex lists all characters allowed in a
* type/key/field name/macro name, rather than listing those characters not
* allowed (as the BibTeX documentation does). The trivial difference is
* that I have disallowed a few extra characters: @ \ ~. Allowing @ could
* cause confusing BibTeX syntax, and allowing \ or ~ can cause bogus TeX
* code: try putting "\cite{foo\bar}" in your LaTeX document and see what
* happens! I'm also rather skeptical about some of the more exotic
* punctuation characters being allowed, but since people have been using
* BibTeX's definition of "names" for a decade or so now, I guess we're
* stuck with it. I could always amend name() to warn about any exotic
* punctuation that offends me, but that should be an option -- and I don't
* have a mechanism for user selectable warnings yet, so it'll have to
* wait.
*
* Also note that defining "number" ahead of "name" precludes a string of
* digits from being a name. This is usually a good thing; we don't want
* to accept digit strings as article types or field names (BibTeX
* doesn't). However -- dubious as it may seem -- digit strings are
* legitimate entry keys, so we should accept them there. This is handled
* by the grammar; see the `contents' rule below.
*
* Finally, it should be noted that BibTeX does not seem to apply the same
* lexical rules to entry types, entry keys, and field names -- so perhaps
* doing so here is not such a great idea. One immediate manifestation of
* this is that my grammar in its unassisted state would accept a field
* name with leading digits; BibTeX doesn't accept this. I correct this
* with the check_field_name() function, called from the `field' rule in
* the grammar and defined in parse_auxiliary.c.
*/
#token "\n" << newline (); >>
#token COMMENT "\%~[\n]*\n" << comment (); >>
#token "[\ \r\t]+" << zzskip (); >>
#token NUMBER "[0-9]+"
#token NAME "[a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+"
<< name (); >>
/*
* Now come the (apparently) easy tokens, i.e. punctuation. There are a
* number of tricky bits here, though. First, '{' can have two very
* different meanings: at top-level, it's an entry delimiter, and inside an
* entry it's a string delimiter. This is handled (in lbrace()) by keeping
* track of the "entry state" (top-level, after '@', after type, in
* comment, or in entry) and using that to determine what to do on a '{'.
* If we're in an entry, lbrace() will switch to the string lexer by
* calling start_string(); if we're immediately after an entry type token
* (which is just a name following a top-level '@'), then we force the
* current token to ENTRY_OPEN, so that '{' and '(' appear identical to the
* parser. (This works because the scanner generated by DLG just happens
* to assign the token number first, and then executes the action.)
* Anywhere else (ie. at top level or immediately after an '@', we print a
* warning and leave the token as LBRACE, which will cause a syntax error
* (because LBRACE is not used anywhere in the grammar).
*
* '(' has some similarities to '{', but it's different enough that it
* has its own function. In particular, it may be an entry opener just
* like '{', but in one particular case it may be a string opener. That
* particular case is where it follows '@' and 'comment'; in that case,
* lparen() will call start_string() to enter the string lexer.
*
* The other delimiter characters are easier, but still warrant an
* explanation. '}' should only occur inside an entry, and if found there
* the token is forced to ENTRY_CLOSER; anywhere else, a warning is printed
* and the parser should find a syntax error. ')' should only occur inside
* an entry, and likewise will trigger a warning if seen elsewhere.
* (String-closing '}' and ')' are handled by the string lexer, below.)
*
* The other punctuation characters are trivial. Note that a double quote
* can start a string anywhere (except at top-level!), but if it occurs in
* a weird place a syntax error will eventually occur.
*/
#token LBRACE "\{" << lbrace (); >>
#token RBRACE "\}" << rbrace (); >>
#token ENTRY_OPEN "\(" << lparen (); >>
#token ENTRY_CLOSE "\)" << rparen (); >>
#token EQUALS "="
#token HASH "\#"
#token COMMA ","
#token "\"" << start_string ('"'); >>
#lexclass LEX_STRING
/*
* Here's a reasonably decent attempt at lexing BibTeX strings. There are
* a couple of sneaky tricks going on here that aren't strictly necessary,
* but can make the user's life a lot easier.
*
* First, here's what a simple and straightforward BibTeX string lexer
* would do:
* - keep track of brace-depth by incrementing/decrementing a counter
* whenever it sees `{' or `}'
* - if the string was started with a `{' and it sees a `}' which
* brings the brace-depth to 0, end the string
* - if the string was started with a `"' and it sees another `"' at
* brace-depth 0, end the string
* - any other characters are left untouched and become part of the
* string
*
* (Note that the simple act of counting braces makes this lexer
* non-regular -- there's a bit more going on here than you might
* think from reading the regexps. So sue me.)
*
* The first, most obvious refinement to this is to check for newlines
* and other whitespace -- we should convert either one to a single
* space (to simplify future processing), as well as increment zzline on
* newline. Note that we don't do any collapsing of whitespace yet --
* newlines surrounded by spaces make that rather tricky to handle
* properly in the lexer (because newlines are handled separately, in
* order to increment zzline), so I put it off to a later stage. (That
* also gives us the flexibility to collapse whitespace or not,
* according to the user's whim.)
*
* A PCCTS lexer to handle these requirements would look something like this:
*
* #token "\n" << newline_in_string (); >>
* #token "[\r\t]" << zzreplchar (' '); zzmore (); >>
* #token "\{" << open_brace(); >>
* #token "\}" << close_brace(); >>
* #token "\"" << quote_in_string (); >>
* #token "~[\n\{\}\"]+" << zzmore (); >>
*
* where the functions called are the same as currently in lex_auxiliary.c.
*
* However, I've added some trickery here that lets us heuristically detect
* runaway strings. The heuristic is as follows: anytime we have a newline
* in a string, that's reason to suspect a runaway. We follow up on this
* suspicion by slurping everything that could reasonably be part of the
* string and still be in the same line (i.e., a string of anything except
* newline, braces, parentheses, double-quote, and backslash), and then
* calling check_runaway_string(). This function then "backs up" to the
* beginning of the slurped string (the newline), and scans ahead looking
* for one of two patterns: "@name[{(]", or "name=" (with optional
* whitespace between the "tokens"). (Actually, it first makes a pass over
* the string to convert all whitespace characters -- including the sole
* newline -- to spaces. So, it's effectively looking for "\ *\@\ *NAME\
* *[\{\(]" (DLG regexp syntax) or "\ *NAME\ *=", where
* NAME="[a-z][a-z0-9+/:'.-]*" -- that is, something that looks like the
* start of an entry or a new field, but in a string (where they almost
* certainly shouldn't occur). Of course, there are no explicit regexps
* there -- it's all coded as a little hand-crafted automaton in C.
*
* At any rate, if either one of these patterns is matched,
* check_runaway_string() prints a warning and sets a flag so that we don't
* print that warning -- or indeed, even scan for the suspect patterns --
* more than once for the current string. (Because chances are if it
* occurs once, it'll occur again and again and again.)
*
* There is also some trickery going on to deal with '@comment' entries.
* Syntactically, these are just AT NAME STRING, where NAME must be
* 'comment'. This means that an '@comment' entry has no delimiters, it
* just has a string. To make them look a bit more like the other kinds of
* entries (which are delimited with '{' ... '}' or '(' ... ')', the STRING
* here is special: it's delimited either by braces or parentheses, rather
* than by the usual braces or double-quotes. Thus, we treat parentheses
* much like braces in this lexer, to handle the '@comment(...)' case. And
* there's an explicit check for the erroneous '@comment"..."' case in
* start_string(), just to be complete.
*
* So that explains all the regexps in this lexer: the first one (starting
* with newline) triggers the check for a runaway string. Then, we have a
* pattern to convert any single whitespace char (apart from newline) to a
* space; note that any whitespace chars that are matched in the
* newline-regexp will be converted by check_runaway_string(), and won't be
* matched by the whitespace regexp here. Then, we check for braces;
* open_brace() and close_brace() take care of counting brace-depth and
* determining if we have hit the end of the string. lparen_in_string()
* and rparen_in_string() do the same for parentheses, to handle
* '@comment(...)'. Then, if a double quote is seen, we call
* quote_in_string(); this takes care of ending strings quoted by double
* quotes. Finally, the "fall-through" regexp handles most strings (except
* for stuff that comes after a newline).
*/
#token "\n~[\n\{\}\(\)\"\\]*" << check_runaway_string (); >>
#token "[\r\t]" << zzreplchar (' '); zzmore (); >>
#token "\{" << open_brace (); >>
#token "\}" << close_brace (); >>
#token "\(" << lparen_in_string (); >>
#token "\)" << rparen_in_string (); >>
#token STRING "\"" << quote_in_string (); >>
#token "~[\n\{\}\(\)\"]+" << zzmore (); >>
#lexclass START
/* At last, the grammar! After that lexer, this is a snap. */
/*
* `bibfile' is the rule to recognize an entire BibTeX file. Note that I
* don't actually use this as the start rule myself; I have a function
* bt_parse_entry() (in input.c), which takes care of setting up the lexer
* and parser state in such a way that the parser can be entered multiple
* times (at the `entry' rule) on the same input stream. Then, the user
* calls bt_parse_entry() until end of file is reached, at which point it
* cleans up its mess. The `bibfile' rule should work, but I never
* actually use it, so it hasn't been tested in quite a while.
*/
bibfile! : << AST *last; #0 = NULL; >>
( entry
<< /* a little creative forestry... */
if (#0 == NULL)
#0 = #1;
else
last->right = #1;
last = #1;
>>
)* ;
/*
* `entry' is the rule that I actually use to enter the parser -- it parses
* a single entry from the input stream (that is, the lexer scans past
* junk until an '@' is seen at top-level, and that '@' becomes the AT
* token which starts an entry).
*
* `entry_metatype()' returns the value of a global variable maintained by
* lex_auxiliary.c that tells us how to parse the entry. This is needed
* because, while the different things that look like BibTeX entries
* (string definition, preamble, actual entry, etc.) have a similar lexical
* makeup, the syntax is different. In `entry', we just use the entry
* metatype to determine the nodetype field of the AST node for the entry;
* below, in `body' and `contents', we'll actually use it (in the form of
* semantic predicates) to select amongst the various syntax options.
*/
entry : << bt_metatype metatype; >>
AT! NAME^
<<
metatype = entry_metatype();
#1->nodetype = BTAST_ENTRY;
#1->metatype = metatype;
>>
body[metatype]
;
/*
* `body' is what comes after AT NAME: either a single string, delimited by
* {} or () (where NAME == 'comment'), or the more usual case of the entry
* contents, delimited by an entry 'opener' and 'closer' (either
* parentheses or braces).
*/
body [bt_metatype metatype]
: << metatype == BTE_COMMENT >>?
STRING << #1->nodetype = BTAST_STRING; >>
| ENTRY_OPEN! contents[metatype] ENTRY_CLOSE!
;
/*
* `contents' is where we select and accept the syntax for the guts of the
* entry, based on the type of entry that we're parsing. We find this
* out from the `nodetype' field of the top AST node for the entry, which
* is passed in as `entry_type'. General entries (ie. any unrecognized
* entry type) and `modify' entries have a name (the key), a comma, and
* list of "field = value" assignments. Macro definitions ('@string') are
* similar, but without the key-comma pair. Preambles have just a single
* value, and aliases have a single "field = value" assignment. (Note that
* '@modify' and '@alias' are BibTeX 1.0 additions -- I'll have to check
* the compatibility of my syntax with BibTeX 1.0 when it is released.)
* '@comment' entries are handled differently, by the `body' rule above.
*/
contents [bt_metatype metatype]
: << metatype == BTE_REGULAR /* || metatype == BTE_MODIFY */ >>?
( NAME | NUMBER ) << #1->nodetype = BTAST_KEY; >>
COMMA!
fields
| << metatype == BTE_MACRODEF >>?
fields
| << metatype == BTE_PREAMBLE >>?
value
// | << metatype == BTE_ALIAS >>?
// field
;
/*
* `fields' is a comma-separated list of fields. Note that BibTeX has a
* little wart in that it allows a single extra comma after the last field
* only. This is easy enough to handle, we just have to do it in the
* traditional BNFish way (loop by recursion) rather than use EBNF
* trickery.
*/
fields : field { COMMA! fields }
| /* epsilon */
;
/* `field' recognizes a single "field = value" assignment. */
field : NAME^
<< #1->nodetype = BTAST_FIELD; check_field_name (#1); >>
EQUALS! value
<<
#if DEBUG > 1
printf ("field: fieldname = %p (%s)\n"
" first val = %p (%s)\n",
#1->text, #1->text, #2->text, #2->text);
#endif
>>
;
/* `value' is a sequence of simple_values, joined by the '#' operator. */
value : simple_value ( HASH! simple_value )* ;
/* `simple_value' is a single string, number, or macro invocation. */
simple_value : STRING << #1->nodetype = BTAST_STRING; >>
| NUMBER << #1->nodetype = BTAST_NUMBER; >>
| NAME << #1->nodetype = BTAST_MACRO; >>
;
|