File: bibtex.g

package info (click to toggle)
biber 2.21-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 12,476 kB
  • sloc: perl: 17,643; sh: 1,069; xml: 896; makefile: 11
file content (413 lines) | stat: -rw-r--r-- 20,907 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
/* ------------------------------------------------------------------------
@NAME       : bibtex.g
@DESCRIPTION: PCCTS-based lexer and parser for BibTeX files.  (Or rather,
              for the BibTeX data description language.  This parser
              enforces nothing about the structure and contents of
              entries; that's up to higher-level processors.  Thus, there's
              nothing either particularly bibliographic or TeXish about
              the language accepted by this parser, apart from the affinity
              for curly braces.)

              There are a few minor differences from the language accepted
              by BibTeX itself, but these are generally improvements over
              BibTeX's behaviour.  See the comments in the grammar, at least
              until I write a decent description of the language.

              I have used Gerd Neugebauer's BibTool (yet another BibTeX
              parser, along with a prettyprinter and specialized language
              for a common set of bibhacks) as another check of correctness
              -- there are a few screwball things that BibTeX accepts and
              BibTool doesn't, so I felt justified in rejecting them as
              well.  In general, this parser is a little stricter than
              BibTeX, but a little looser than BibTool.  YMMV.

              Another source of inspiration is Nelson Beebe's bibclean, or
              rather Beebe's article describing bibclean (from TUGboat
              vol. 14 no. 4; also included with the bibclean distribution).

              The product of the parser is an abstract syntax tree that can
              be traversed to be printed in a simple form (see
              print_entry() in bibparse.c) or perhaps transformed to a
              format more convenient for higher-level languages (see my
              Text::BibTeX Perl module for an example).

              Whole files may be parsed by entering the parser at `bibfile';
              in this case, the parser really returns a forest (list of
              ASTs, one per entry).  Alternately, you can enter the parser
              at `entry', which reads and parses a single entry.
@GLOBALS    : the usual DLG and ANTLR cruft
@CALLS      : 
@CREATED    : first attempt: May 1996, Greg Ward
              second attempt (complete rewrite): July 25-28 1996, Greg Ward
@MODIFIED   : Sep 1996, GPW: changed to generate an AST rather than print
                             out each entry as it's encountered
              Jan 1997, GPW: redid the above, because it was lost when
                             my !%&$#!@ computer was stolen
              Jun 1997, GPW: greatly simplified the lexer, and added handling
                             of %-comments, @comment and @preamble entries,
                             and proper scanning of between-entry junk
@VERSION    : $Id: bibtex.g 640 1999-11-29 01:13:10Z greg $
@COPYRIGHT  : Copyright (c) 1996-99 by Gregory P. Ward.  All rights reserved.

              This file is part of the btparse library.  This library is
              free software; you can redistribute it and/or modify it under
              the terms of the GNU Library General Public License as
              published by the Free Software Foundation; either version 2
              of the License, or (at your option) any later version.
-------------------------------------------------------------------------- */

#header
<<
#define ZZCOL
#define USER_ZZSYN

#include "config.h"
#include "btparse.h"
#include "attrib.h"
#include "lex_auxiliary.h"
#include "error.h"
#include "my_dmalloc.h"

extern char * InputFilename;            /* for zzcr_ast call in pccts/ast.c */
>>

/*
 * The lexer has three modes -- START (between entries, hence it's what
 * we're in initially), LEX_ENTRY (entered once we see an '@' at
 * top-level), and LEX_STRING (for scanning quoted strings).  Note that all
 * the functions called from lexer actions can be found in lex_auxiliary.c.
 *
 * The START mode just looks for '@', discards comments and whitespace,
 * counts lines, and keeps track of any other junk.  The "keeping track"
 * just consists of counting the number of junk characters, which is then
 * reported at the next '@' sign.  This will hopefully let users clean up
 * "old style" implicit comments, and possibly catch some legitimate errors
 * in their files (eg. a complete entry that's missing an '@').
 */

#token AT           "\@"            << at_sign (); >>
#token              "\n"            << newline (); >>
#token COMMENT      "\%~[\n]*\n"    << comment (); >>
#token              "[\ \r\t]+"     << zzskip (); >>
#token              "~[\@\n\ \r\t]+"<< toplevel_junk (); >>

#lexclass LEX_ENTRY

/*
 * The LEX_ENTRY mode is where most of the interesting stuff is -- these
 * tokens control most of the syntax of BibTeX.  First, we duplicate most
 * of the START lexer, in order to handle newlines, comments, and
 * whitespace.
 *
 * Next comes a "number", which is trivial.  This is needed because a
 * BibTeX simple value may be an unquoted digit string; it has to precede
 * the definition of "name" tokens, because otherwise a digit string would
 * be a legitimate "name", which would cause an ambiguity inside entries
 * ("is this a macro or a number?")
 * 
 * Then comes the regexp for a BibTeX "name", which is used for entry
 * types, entry keys, field names, and macro names.  This is basically the
 * same as BibTeX's definition of such "names", with two differences.  The
 * key, fundamental difference is that I have defined names by inclusion
 * rather than exclusion: this regex lists all characters allowed in a
 * type/key/field name/macro name, rather than listing those characters not
 * allowed (as the BibTeX documentation does).  The trivial difference is
 * that I have disallowed a few extra characters: @ \ ~.  Allowing @ could
 * cause confusing BibTeX syntax, and allowing \ or ~ can cause bogus TeX
 * code: try putting "\cite{foo\bar}" in your LaTeX document and see what
 * happens!  I'm also rather skeptical about some of the more exotic
 * punctuation characters being allowed, but since people have been using
 * BibTeX's definition of "names" for a decade or so now, I guess we're
 * stuck with it.  I could always amend name() to warn about any exotic
 * punctuation that offends me, but that should be an option -- and I don't
 * have a mechanism for user selectable warnings yet, so it'll have to
 * wait.
 * 
 * Also note that defining "number" ahead of "name" precludes a string of
 * digits from being a name.  This is usually a good thing; we don't want
 * to accept digit strings as article types or field names (BibTeX
 * doesn't).  However -- dubious as it may seem -- digit strings are
 * legitimate entry keys, so we should accept them there.  This is handled
 * by the grammar; see the `contents' rule below.
 * 
 * Finally, it should be noted that BibTeX does not seem to apply the same
 * lexical rules to entry types, entry keys, and field names -- so perhaps
 * doing so here is not such a great idea.  One immediate manifestation of
 * this is that my grammar in its unassisted state would accept a field
 * name with leading digits; BibTeX doesn't accept this.  I correct this
 * with the check_field_name() function, called from the `field' rule in
 * the grammar and defined in parse_auxiliary.c.
 */
#token              "\n"            << newline (); >>
#token COMMENT      "\%~[\n]*\n"    << comment (); >>
#token              "[\ \r\t]+"     << zzskip (); >>
#token NUMBER       "[0-9]+"
#token NAME         "[a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+"
                                    << name (); >>

/* 
 * Now come the (apparently) easy tokens, i.e. punctuation.  There are a
 * number of tricky bits here, though.  First, '{' can have two very
 * different meanings: at top-level, it's an entry delimiter, and inside an
 * entry it's a string delimiter.  This is handled (in lbrace()) by keeping
 * track of the "entry state" (top-level, after '@', after type, in
 * comment, or in entry) and using that to determine what to do on a '{'.
 * If we're in an entry, lbrace() will switch to the string lexer by
 * calling start_string(); if we're immediately after an entry type token
 * (which is just a name following a top-level '@'), then we force the
 * current token to ENTRY_OPEN, so that '{' and '(' appear identical to the
 * parser.  (This works because the scanner generated by DLG just happens
 * to assign the token number first, and then executes the action.)
 * Anywhere else (ie. at top level or immediately after an '@', we print a
 * warning and leave the token as LBRACE, which will cause a syntax error
 * (because LBRACE is not used anywhere in the grammar).
 *
 * '(' has some similarities to '{', but it's different enough that it 
 * has its own function.  In particular, it may be an entry opener just 
 * like '{', but in one particular case it may be a string opener.  That
 * particular case is where it follows '@' and 'comment'; in that case,
 * lparen() will call start_string() to enter the string lexer.
 *
 * The other delimiter characters are easier, but still warrant an
 * explanation.  '}' should only occur inside an entry, and if found there
 * the token is forced to ENTRY_CLOSER; anywhere else, a warning is printed
 * and the parser should find a syntax error.  ')' should only occur inside
 * an entry, and likewise will trigger a warning if seen elsewhere.
 * (String-closing '}' and ')' are handled by the string lexer, below.)
 *
 * The other punctuation characters are trivial.  Note that a double quote
 * can start a string anywhere (except at top-level!), but if it occurs in
 * a weird place a syntax error will eventually occur.
 */
#token LBRACE       "\{"            << lbrace (); >>
#token RBRACE       "\}"            << rbrace (); >>
#token ENTRY_OPEN   "\("            << lparen (); >>
#token ENTRY_CLOSE  "\)"            << rparen (); >>
#token EQUALS       "="
#token HASH         "\#"
#token COMMA        ","
#token              "\""            << start_string ('"'); >>


#lexclass LEX_STRING

/*
 * Here's a reasonably decent attempt at lexing BibTeX strings.  There are
 * a couple of sneaky tricks going on here that aren't strictly necessary,
 * but can make the user's life a lot easier.
 *
 * First, here's what a simple and straightforward BibTeX string lexer 
 * would do:
 *   - keep track of brace-depth by incrementing/decrementing a counter
 *     whenever it sees `{' or `}'
 *   - if the string was started with a `{' and it sees a `}' which
 *     brings the brace-depth to 0, end the string
 *   - if the string was started with a `"' and it sees another `"' at
 *     brace-depth 0, end the string
 *   - any other characters are left untouched and become part of the
 *     string
 *
 * (Note that the simple act of counting braces makes this lexer
 * non-regular -- there's a bit more going on here than you might
 * think from reading the regexps.  So sue me.)
 *
 * The first, most obvious refinement to this is to check for newlines
 * and other whitespace -- we should convert either one to a single
 * space (to simplify future processing), as well as increment zzline on
 * newline.  Note that we don't do any collapsing of whitespace yet --
 * newlines surrounded by spaces make that rather tricky to handle
 * properly in the lexer (because newlines are handled separately, in
 * order to increment zzline), so I put it off to a later stage.  (That
 * also gives us the flexibility to collapse whitespace or not,
 * according to the user's whim.)
 * 
 * A PCCTS lexer to handle these requirements would look something like this:
 * 
 * #token     "\n"             << newline_in_string (); >>
 * #token     "[\r\t]"         << zzreplchar (' '); zzmore (); >>
 * #token     "\{"             << open_brace(); >>
 * #token     "\}"             << close_brace(); >>
 * #token     "\""             << quote_in_string (); >>
 * #token     "~[\n\{\}\"]+"   << zzmore (); >>
 *
 * where the functions called are the same as currently in lex_auxiliary.c.
 * 
 * However, I've added some trickery here that lets us heuristically detect
 * runaway strings.  The heuristic is as follows: anytime we have a newline
 * in a string, that's reason to suspect a runaway.  We follow up on this
 * suspicion by slurping everything that could reasonably be part of the
 * string and still be in the same line (i.e., a string of anything except
 * newline, braces, parentheses, double-quote, and backslash), and then
 * calling check_runaway_string().  This function then "backs up" to the
 * beginning of the slurped string (the newline), and scans ahead looking
 * for one of two patterns: "@name[{(]", or "name=" (with optional
 * whitespace between the "tokens").  (Actually, it first makes a pass over
 * the string to convert all whitespace characters -- including the sole
 * newline -- to spaces.  So, it's effectively looking for "\ *\@\ *NAME\
 * *[\{\(]" (DLG regexp syntax) or "\ *NAME\ *=", where
 * NAME="[a-z][a-z0-9+/:'.-]*" -- that is, something that looks like the
 * start of an entry or a new field, but in a string (where they almost
 * certainly shouldn't occur).  Of course, there are no explicit regexps
 * there -- it's all coded as a little hand-crafted automaton in C.
 *
 * At any rate, if either one of these patterns is matched,
 * check_runaway_string() prints a warning and sets a flag so that we don't
 * print that warning -- or indeed, even scan for the suspect patterns --
 * more than once for the current string.  (Because chances are if it
 * occurs once, it'll occur again and again and again.)
 *
 * There is also some trickery going on to deal with '@comment' entries.
 * Syntactically, these are just AT NAME STRING, where NAME must be
 * 'comment'.  This means that an '@comment' entry has no delimiters, it
 * just has a string.  To make them look a bit more like the other kinds of
 * entries (which are delimited with '{' ... '}' or '(' ... ')', the STRING
 * here is special: it's delimited either by braces or parentheses, rather
 * than by the usual braces or double-quotes.  Thus, we treat parentheses
 * much like braces in this lexer, to handle the '@comment(...)' case.  And
 * there's an explicit check for the erroneous '@comment"..."' case in
 * start_string(), just to be complete.
 *
 * So that explains all the regexps in this lexer: the first one (starting
 * with newline) triggers the check for a runaway string.  Then, we have a
 * pattern to convert any single whitespace char (apart from newline) to a
 * space; note that any whitespace chars that are matched in the
 * newline-regexp will be converted by check_runaway_string(), and won't be
 * matched by the whitespace regexp here.  Then, we check for braces;
 * open_brace() and close_brace() take care of counting brace-depth and
 * determining if we have hit the end of the string.  lparen_in_string()
 * and rparen_in_string() do the same for parentheses, to handle
 * '@comment(...)'.  Then, if a double quote is seen, we call
 * quote_in_string(); this takes care of ending strings quoted by double
 * quotes.  Finally, the "fall-through" regexp handles most strings (except
 * for stuff that comes after a newline).
 */
#token        "\n~[\n\{\}\(\)\"\\]*" << check_runaway_string (); >>
#token        "[\r\t]"           << zzreplchar (' '); zzmore (); >>
#token        "\{"               << open_brace (); >>
#token        "\}"               << close_brace (); >>
#token        "\("               << lparen_in_string (); >>
#token        "\)"               << rparen_in_string (); >>
#token STRING "\""               << quote_in_string (); >>
#token        "~[\n\{\}\(\)\"]+" << zzmore (); >>

#lexclass START


/* At last, the grammar!  After that lexer, this is a snap. */

/* 
 * `bibfile' is the rule to recognize an entire BibTeX file.  Note that I
 * don't actually use this as the start rule myself; I have a function
 * bt_parse_entry() (in input.c), which takes care of setting up the lexer
 * and parser state in such a way that the parser can be entered multiple
 * times (at the `entry' rule) on the same input stream.  Then, the user
 * calls bt_parse_entry() until end of file is reached, at which point it
 * cleans up its mess.  The `bibfile' rule should work, but I never
 * actually use it, so it hasn't been tested in quite a while.
 */
bibfile!     : << AST *last; #0 = NULL; >>
               ( entry
                 <<                       /* a little creative forestry... */
                    if (#0 == NULL)
                       #0 = #1;
                    else
                       last->right = #1;
                    last = #1;
                 >>    
               )* ;

/*
 * `entry' is the rule that I actually use to enter the parser -- it parses
 * a single entry from the input stream (that is, the lexer scans past
 * junk until an '@' is seen at top-level, and that '@' becomes the AT 
 * token which starts an entry).
 *
 * `entry_metatype()' returns the value of a global variable maintained by
 * lex_auxiliary.c that tells us how to parse the entry.  This is needed
 * because, while the different things that look like BibTeX entries
 * (string definition, preamble, actual entry, etc.) have a similar lexical
 * makeup, the syntax is different.  In `entry', we just use the entry
 * metatype to determine the nodetype field of the AST node for the entry;
 * below, in `body' and `contents', we'll actually use it (in the form of
 * semantic predicates) to select amongst the various syntax options.
 */
entry        : << bt_metatype metatype; >>
               AT! NAME^
               <<
                  metatype = entry_metatype();
                  #1->nodetype = BTAST_ENTRY;
                  #1->metatype = metatype;
               >>
               body[metatype]
             ;

/*
 * `body' is what comes after AT NAME: either a single string, delimited by
 * {} or () (where NAME == 'comment'), or the more usual case of the entry
 * contents, delimited by an entry 'opener' and 'closer' (either
 * parentheses or braces).
 */
body [bt_metatype metatype]
             : << metatype == BTE_COMMENT >>?
               STRING     << #1->nodetype = BTAST_STRING; >>
             | ENTRY_OPEN! contents[metatype] ENTRY_CLOSE!
             ;

/* 
 * `contents' is where we select and accept the syntax for the guts of the
 * entry, based on the type of entry that we're parsing.  We find this
 * out from the `nodetype' field of the top AST node for the entry, which
 * is passed in as `entry_type'.  General entries (ie. any unrecognized
 * entry type) and `modify' entries have a name (the key), a comma, and
 * list of "field = value" assignments.  Macro definitions ('@string') are
 * similar, but without the key-comma pair.  Preambles have just a single
 * value, and aliases have a single "field = value" assignment.  (Note that
 * '@modify' and '@alias' are BibTeX 1.0 additions -- I'll have to check
 * the compatibility of my syntax with BibTeX 1.0 when it is released.)
 * '@comment' entries are handled differently, by the `body' rule above.
 */
contents [bt_metatype metatype]
             : << metatype == BTE_REGULAR /* || metatype == BTE_MODIFY */ >>?
               ( NAME | NUMBER ) << #1->nodetype = BTAST_KEY; >> 
               COMMA!
               fields
             | << metatype == BTE_MACRODEF >>?
               fields
             | << metatype == BTE_PREAMBLE >>?
               value
//           | << metatype == BTE_ALIAS >>?
//             field
             ;

/*
 * `fields' is a comma-separated list of fields.  Note that BibTeX has a
 * little wart in that it allows a single extra comma after the last field
 * only.  This is easy enough to handle, we just have to do it in the
 * traditional BNFish way (loop by recursion) rather than use EBNF
 * trickery.
 */
fields       : field { COMMA! fields }
             | /* epsilon */
             ;

/* `field' recognizes a single "field = value" assignment. */
field        : NAME^
               << #1->nodetype = BTAST_FIELD; check_field_name (#1); >>
               EQUALS! value
               << 
#if DEBUG > 1
                  printf ("field: fieldname = %p (%s)\n"
                          "       first val = %p (%s)\n",
                          #1->text, #1->text, #2->text, #2->text);
#endif
               >>
             ;

/* `value' is a sequence of simple_values, joined by the '#' operator. */
value        : simple_value ( HASH! simple_value )* ;

/* `simple_value' is a single string, number, or macro invocation. */
simple_value : STRING      << #1->nodetype = BTAST_STRING; >>
             | NUMBER      << #1->nodetype = BTAST_NUMBER; >>
             | NAME        << #1->nodetype = BTAST_MACRO; >>
             ;