1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380
|
@* Parsing the program code. @:title@>
The most intricate part of \.{\me.} is its mechanism for converting
\Cee-like code into \TeX\ code, and we shall consider this aspect of the
program now. This parsing mechanism must be able to deal with fragmentary
constructions whose overall ``part of speech'' is not known (e.g., those
within~`\pb'), which implies that recognition must in principle be able to
proceed without any context information. Therefore a ``bottom-up'' approach
is used, that collects successive tokens, and only takes action once it has
seen enough to recognise a complete match with a syntax rule. Bottom-up
parsing is a very powerful technique (it can handle grammars that are not
tractable by many other parsing methods); it is one of the many significant
contributions to computer science that the designer of the original \.{WEB}
system, D.~E. Knuth, has made. Even so, the technique used here is less
powerful than that of traditional bottom-up parsers produced by parser
generators, since those can use information derived from the full
left-context (and a small amount of right-context) to help decide difficult
situations. In practice such context is not often required to unambiguously
parse program fragments, and most situations where there is a possibility of
incorrect recognition can be avoided be careful specification of the syntax
rules to be used; this does however make formulation of the grammar a
slightly subtle matter.
As we have already seen, the input is represented as a sequence of
{\sl scraps}, each of which specifies a {\sl category\/} and a
{\sl translation}. The category defines a syntactic class, and the
translation is a token list that represents \TeX\ code; the latter is
effectively just a single string of characters, although it is stored in a
different form for efficiency reasons. Rules of syntax and semantics tell us
how to combine adjacent scraps into larger ones, and if we are lucky an
entire \Cee\ text that starts out as hundreds of small scraps will join
together into one gigantic scrap whose translation is the desired
\TeX\ code. If we are unlucky (i.e., if the input is not properly formed
for recognition by our parser), we will be left with several scraps that
don't combine; their translations will simply be output, one by one.
@*1 Introduction to the grammar.
The parsing process is governed by a set of reduction rules, each of which
specifies how a sequence of consecutive scraps with certain given categories
can be combined into a single new scrap, and it also tells what category the
new scrap has and how its translation is obtained from those of the original
scraps. The set of reduction rules is given by a table, which is actually
an initialiser for a static array, starting in section@#rules@>.
In many cases a rule is simply given by listing the category codes
forming its pattern (its left hand side) and the category code of the
resulting scrap (the right hand side). For example, one of the reduction
rules could be represented as
$$
\hbox{|expression| |binop| |expression|}\Rightarrow\hbox{|expression|}.
$$
For such simple rules the translations of the original scraps are simply
concatenated to form the translation of the resulting scrap. The line in the
table corresponding to this rule actually looks like this:
$$
\hbox{|{2,{{expression,binop,expression}},{expression, NULL}}|}.
$$
The `2' is the number used to identify the rule, which is useful when
tracing the activities of the parser, and the |NULL| signifies that the
default translation applies; the braces reflect the structure used to
represent syntax rules, which has some additional components that are not
specified explicitly here, and that are therefore initialised to~0.
For other rules specific additional items are to be inserted in between the
translations of the scraps that match the pattern. For instance, here is
another rule present in the table
$$
\hbox{|{ 6,{{expression, comma, expression}},{expression,"__p1_"}}|}.
$$
Here the translation is specified by the ``format string'' |"__p1_"|, where
the underscores stand for the translations of the three participating
scraps, the `\.{p1}' for an |opt| token with argument~`\.1' (eventually
producing `\.{\\penalty10}') that comes after the translation of the second
scrap, with category |comma|. That translation will almost always be |","|,
since there are no rules with |comma| as their right hand side, but if a
comment follows the comma, it will have been incorporated into its scrap
while preparing for the reduction, and in this case the optional break will
follow the comment (and have no effect, because there is already a forced
break after comments). The interpretation of all characters that can appear
in the format strings can be found in section@#format@>.
To handle all cases that can arise in programs, in combination with a large
number of formatting options that might be selected, our mechanism for rules
of a somewhat more general form than those presented above, but the
extensions are used only for a few rules. Consider the following rule:
$$
\hbox{|{182,{{lpar,statement,statement},1},
{statement,"_B_"},forced_statements}|}.
$$
Here, the `1' indicates that the first scrap from the left in the pattern,
|lpar|, serves as context only, and does not take part in the actual
reduction. The rule therefore reduces a sequence of two consecutive
statements into a single one, inserting a |break_space| (from the `\.B'
format character) between their translations, but only if the statements are
preceded by a left parenthesis (category~|lpar|); this restriction means
that the rule will only apply to the first two semicolon-separated
expressions following `|for|~|(|'. The entry |forced_statements| means
moreover that the rule will not be used at all unless a `\.{+f}'~or~`\.{+a}'
command line option was specified, which forces line breaks between
consecutive statements; this rule then avoids such line breaks between the
controlling expressions of a |for|-loop, since it takes precedence (see
below) over the rule that normally combines statements.
@ The rules are applied from left to right, as follows. Suppose we are
currently working on the sequence of scraps with categories $c_1\,c_2\ldots
c_n$. We try first to find a rule whose pattern matches an initial substring
$c_1\,c_2\ldots\,$, and if no such rule exists, we try to find a rule
applicable to the next substring $c_2\,c_3\ldots\,$; if that fails, we try
to match $c_3\,c_4\ldots\,$, etc. When several patterns match, starting at
the same scrap, the longest one is selected (no two rules can have identical
patterns). For instance, there is a rule that reduces `|struct_like|
|expression| |lbrace|' to `|struct_head|', and another one that reduces
`|struct_like| |expression|' to `|int_like|'; the latter is only applied
when no |lbrace| scrap follows. This is a sensible rule, since the longer
pattern is the more specific one, and without such a rule it could never
match. One should be aware however that this only works because the scrap
with category |lbrace| represents a single token that requires no reduction
to create it, for otherwise the two-scrap reduction would be applied before
the three-scrap reduction would have a chance to match.
One might say that we use a left-to-right eager strategy for choosing
reductions; this strategy is chosen on heuristic grounds, and there is no
guarantee that it will find a successful sequence of reductions if one
exists. In other words, if we interchange left and right hand sides of the
rules and view them as {\sl production\/} rules of a context free grammar,
then the language generated by this grammar can be much larger than the one
recognised by our parser. For instance, the context free grammar would, from
the rules given above, generate expressions with multiple operators where
implicit parentheses are placed arbitrarily (and therefore it would be
ambiguous); our parsing strategy however will always place implicit
parentheses to the left. (We don't care if implicit parentheses are placed
incorrectly, because it does not influence typesetting.) It is an
interesting theoretic problem to find an algorithm that will transform a
set of reductions rules into a context free grammar generating exactly the
language recognised; that would help to verify that the rules will work as
intended. We must be cautious in formulating the rules, not only to give
rules that are sufficient to reduce all desired programs, but also not to
specify rules that could in certain circumstances match with precedence over
the ones we intended for the situation.
Some of the formatting of the output is controlled by the \TeX~format in the
file \.{cwebxmac.tex}, rather than directly by \.{\me.}. For instance, the
`\.{\\penalty10}' mentioned above is actually written as `\.{\\31}'; the
macro `\.{\\3}' will in fact raise the penalty for any enclosing pair of
braces or parentheses; thus breaks in argument lists or initialiser lists
will be avoided in favour of ones outside those lists. There are other
effects not visible in the grammar, like optional breaks automatically
associated with certain operators. In the whole it is a delicate interplay
between the set of reduction rules, the reduction mechanism, the macro
definitions, and the (mathematical) typesetting rules of \TeX\ that
determines how programs will be typeset. This division of labour is quite
convenient (e.g., it would tremendously complicate the task of \.{\me.} if
it had to decide which optional breaks should actually be taken), but it has
the unfortunate consequence that it is not easy to gain a comprehensive
understanding of the translation process. Indeed there have been quite a few
surprises during the construction of the system, and we cannot be sure that
there are no more to come; yet we believe the situation is more transparent
than in the original |WEAVE|, since we have avoided rules that are oriented
towards the target language (\TeX) rather than towards the source language
(\Cee).
@*2 Categories of scraps.
Here is a list of the category codes that scraps can have. The category codes
that apply to reserved words (e.g., |while_like|, but also |declaration| for
|va_dcl|) as well as |expression| (that is used for |type_defined| identifiers
in their typedef declaration) are also used as their |ilk| value, and they are
sufficiently high that they can be distinguished from |ilk| values that are
not category codes, like |type_defined|, |TeX_like|, |NULL_like|,
|const_like|, |typedef_like| and |namespace_like|; thus reserved words will
never be mistaken for identifiers with such an ilk, and we can safely
substitute certain such |ilk| values by a different category code as happens
in section@#ilk change@>. A number of categories can occur in scraps, but do
not occur in any of the reduction rules, since they are handled by other
means; they have values $\geq{}$|min_nonrule_cat|. The macro |valid_cat|
checks whether |c| is a category that might match in a rule; it uses its
argument twice, so its argument should not cause a side effect. If this
section is changed, section@#category output@> should be changed
correspondingly.
@d max_category end_expr /* largest scrap category */
@d min_nonrule_cat lproc /* first category forbidden in rules */
@d valid_cat(c) ((c)>0 && (c)<min_nonrule_cat)
@<Typedef and enum...@>=
enum @/ @:categories@>
{ unop = 1, /* a unary operator like `|!|' */
binop, /* a binary operator like `|<|' */
unorbinop, /* an operator that can be either, like `|-|' */
selectx, /* structure selection: `|.|' or `|->|' */
question, /* a question mark operator */
lbrace, rbrace, lpar, rpar, lbrack, rbrack,
/* `|{|', \ `|}|', `(', \ `)', `[', \ `]' */
comma, semi, colon, colcol, magic, /* `,', `;', `:', `$\CC$', \:; */
subscript, /* an array subscript, like `|[]|' or `|[i++]|' */
struct_head, /* the beginning of a struct specifier, like `|struct s{|' */
short_lbrace, short_struct_head, /* from `\.{\{@@;}', for one-liners */
compound_statement, /* a complete compound statement */
statement, /* a complete statement, possibly compound */
function, /* a complete function definition */
function_head, /* a function identifier followed by formal parameters */
parameters,
/* parameters in function declaration, or casting operator like `|(int)|' */
label, /* a statement label */
if_head, /* `|if|' followed by a (parenthesised) expression */
if_else_head, /* \lq|if @t\dots@>@; else|', \lq|while(@t\dots@>)|'
or \lq|switch(@t\dots@>)|' */
do_head, /* `|do @t\dots@>@; while|' */
mod_scrap, /* module name */
declarator, /* abstract declarator, like `|(*)(int,char*[])|' */
declaration, /* a complete declaration */
expression, /* an expression, possibly a single identifier */
while_like, /* `|for|', `|while|', `|switch|' */
do_like,
if_like,
else_like,
extern_like,
throw_like, try_like, catch_like, /* single tokens */
int_like, /* `|int|', `|char|', \dots */
case_like, /* `|case|', `|default|' */
sizeof_like, /* `|sizeof|', `|const|', `\&{new}', `\&{delete} */
struct_like, /* `|struct|', `|union|', `|enum|',
`\&{class}', `\&{typename}' */
return_like, /* `|return|', `|break|', `|continue|', `|goto|' */
template_like, langle, rangle,
/* `\&{template}', `$\langle$', `$\rangle$', for \Cpp */
templ_params, /* template parameters */
lproc, /* `\&\#' and following identifier starting preprocessor directive */
rproc, /* end of a preprocessor directive */
insert, /* comment or other syntactically inert item */
begin_expr, end_expr /* \:[ and \:] */
};
@ As we have already seen, tokens are converted to elementary scraps by
the function |C_read|; these scraps form the `terminal symbols' of our
grammar. The translation of tokens to scraps is largely governed by the
static array |trans_ini|, whose initialisation values we shall now give.
@d yes_math 1 /* should be in math mode */
@d no_math 2 /* should be in horizontal mode */
@d maybe_math 0 /* works in either horizontal or math mode */
@< Initialiser for |trans_ini| @>=
{ '!', unop, yes_math, "\\R" }, @/ @.\\R@>
{ '~', unop, yes_math, "\\CM" }, @/ @.\\CM@>
{ '/', binop, yes_math, "/" }, @/
{ '<', binop, yes_math, "<" }, @/
{ '>', binop, yes_math, ">" }, @/
{ '.', selectx, yes_math, "." }, @/
{ '=', binop, yes_math, "\\K" }, @/ @.\\K@>
{ '|', binop, yes_math, "\\OR" }, @.\\OR@>
{ or, binop, yes_math, "\\OR" }, @/
{ '^', binop, yes_math, "\\XOR" }, @/ @.\\XOR@>
{ '%', binop, yes_math, "\\MOD" }, @/ @.\\MOD@>
{ '+', unorbinop, yes_math, "+" }, @/
{ '-', unorbinop, yes_math, "-" }, @/
{ '*', unorbinop, yes_math, "*" }, @/
{ '&', unorbinop, yes_math, "\\AND" }, @/ @.\\AND@>
{ '?', question, yes_math, "\\?" }, @/
{ '(', lpar, yes_math, "(" }, @/
{ ')', rpar, yes_math, ")" }, @/
{ '[', lbrack, maybe_math, "[" }, @/
{ ']', rbrack, maybe_math, "]" }, @/
{ '{', lbrace, yes_math, "\\{" }, @/
{ '}', rbrace, yes_math, "\\}" }, @/
{ ',', comma, yes_math, "," }, @/
{ ';', semi, yes_math, ";" }, @/
{ ':', colon, maybe_math, ":" }, @/
{ '#', insert, maybe_math, "\\#" },
/* this should occur only in macro definitions */ @/ @.\\\#@>
{ at_sign_image, insert, maybe_math, "@@" },
/* this should not occur in legal \Cee~text */
@)
{ not_eq, binop, yes_math, "\\I" }, @/ @.\\I@>
{ lt_eq, binop, yes_math, "\\Z" }, @/ @.\\Z@>
{ gt_eq, binop, yes_math, "\\G" }, @/ @.\\G@>
{ eq_eq, binop, yes_math, "\\E" }, @/ @.\\E@>
{ and_and, binop, yes_math, "\\W" }, @/ @.\\W@>
{ or_or, binop, yes_math, "\\V" }, @/ @.\\V@>
{ plus_plus, unop, yes_math, "\\PP" }, @/ @.\\PP@>
{ minus_minus, unop, yes_math, "\\MM" }, @/ @.\\MM@>
{ minus_gt, selectx, yes_math, "\\MG" }, @/ @.\\MG@>
{ gt_gt, binop, yes_math, "\\GG" }, @/ @.\\GG@>
{ lt_lt, binop, yes_math, "\\LL" }, @/ @.\\LL@>
{ mul_assign, binop, yes_math, "\\KK*" }, @/ @.\\KK@>
{ div_assign, binop, yes_math, "\\KK/" }, @/
{ mod_assign, binop, yes_math, "\\KK\\MOD" }, @/ @.\\MOD@>
{ plus_assign, binop, yes_math, "\\KK+" }, @/
{ minus_assign, binop, yes_math, "\\KK-" }, @/
{ left_assign, binop, yes_math, "\\KK\\LL" }, @/
{ right_assign, binop, yes_math, "\\KK\\GG" }, @/
{ and_assign, binop, yes_math, "\\KK\\AND" }, @/ @.\\AND@>
{ xor_assign, binop, yes_math, "\\KK\\XOR" }, @/ @.\\XOR@>
{ or_assign, binop, yes_math, "\\KK\\OR" }, @/ @.\\OR@>
{ thin_space, insert, yes_math, "\\," }, @/ @.\\,@>
{ pseudo_semi, magic, maybe_math, "" }, @/
{ force_expr_open, begin_expr, maybe_math, "" }, @/
{ force_expr_close, end_expr, maybe_math, "" }, @/
{ join, insert, no_math, "\\J" }, @/ @.\\J@>
{ ellipsis, int_like, yes_math, "\\ldots" }, @/
{ sh_sh, binop, yes_math, "\\SS" }, @/ @.\\SS@>
{ colon_colon, colcol, yes_math, "\\CC" }
@ For \Cpp\ we give `\.<' and `\.>' separate categories, which is necessary to
be able to recognise template arguments.
@< Fix the categories of angle brackets for \Cpp @>=
{@; token_trans['<'].cat=langle; token_trans['>'].cat=rangle; }
@ Certain tokens that lead to fixed scraps are not included in the
|trans_ini| array because their translations involve non-character tokens.
Since there are only a few of them the easiest solution is to install each
one explicitly into the |token_trans| array.
@d start_scrap(s,c,m) p=&token_trans[s],p->cat=c, p->mathness=5*(m)
@d end_scrap p->trans=text_ptr, freeze_text();
@< Install the translations of tokens involving line breaks @>=
{ scrap* p;
start_scrap(math_break,insert,maybe_math); /* \:\v */
app(opt), app('0'); end_scrap;
@/start_scrap(line_break,insert,no_math); /* \:/ */
app(force); end_scrap;
@/start_scrap(end_preproc,rproc,no_math); /* end of preprocessor directive */
app(force); end_scrap;
@/start_scrap(' ',insert,no_math); /* space within preprocessor directive */
app(break_space); end_scrap;
@/start_scrap(big_line_break,insert,no_math); /* \:) */
app(big_force); end_scrap;
@/start_scrap(backup_line,insert,no_math); /* \:\\ */
app(backup); end_scrap;
@/start_scrap(no_line_break,insert,no_math); /* \:+ */
app(cancel),app(relax),app(break_space),app(relax),app(cancel); end_scrap;
@/start_scrap(include_preproc,insert,yes_math); /* \:p */
app(force),app_str("\\ATP"),app(force); end_scrap; @.\\ATP@>
}
@ When \.{\me.} is compiled with the |DEBUG| switch, it can display
its parsing steps. The order of strings in |cat_name| must match that
in the |enum| declaration in section@#categories@>.
@c
@:category output@>
#ifdef DEBUG
void print_cat (int c) /* symbolic printout of a category */
{ static char* cat_name[]=
{ "unop", "binop", "op", "selectx"
, "?", "{", "}", "(", ")", "[", "]", ",", ";", ":", "::", "@@;"
, "subscr", "struct_head", "short_{", "short_struct_head"
, "cmp_stmt", "stmt"
, "function", "function_head", "params", "label"
, "if_head", "if_else_head", "do_head"
, "mod_name", "declarator", "decl", "exp"
, "for", "do", "if", "else", "extern"
, "throw", "try", "catch,"
, "int", "case", "sizeof", "struct", "return"
, "template", "<", ">", "templ_params"
, "#{", "#}", "insert", "@@[", "@@]"
};
if (c<=max_category && c>0) printf("%s",cat_name[c-1]);
else printf ("IMPOSSIBLE");
}
#endif /* |DEBUG| */
@ Another major class of terminal symbols is formed by the reserved words.
If a name exists in the hash table with an |ilk| specifying a reserved word,
then |id_lookup| will return the reserved word when called with that name,
and |C_read| will use the |ilk| to set the category of the resulting scrap.
So all that has to be done is to get all the reserved words into the hash
table with the right ilks initially. The simplest way to do this is to call
|id_lookup| for all reserved words with the proper |ilk| at the beginning of
each run of \.{\me.}. Fortunately there are not too many reserved words.
This code below uses the fact that instead of using pointers to beginning
and end of a string for |id_lookup|, one may also pass a single pointer to a
null-terminated string provided the other pointer is null.
@^reserved words@>
@<Store all the reserved words@>=
{ int i; static char* int_likes[]=
{ "auto","char","double","float","int","long","register"
, "short","signed","static","unsigned","void" };
static char* defined_types[] =
{ "FILE", "size_t", "ptrdiff_t", "wchar_t"
, "jmp_buf", "sig_atomic_t", "fpos_t", "div_t", "ldiv_t"
, "clock_t","time_t"
, "va_list"
};
static char* return_likes[]=
{"break","continue","goto","return"};
int int_like_nr=array_size(int_likes),
defined_type_nr=array_size(defined_types),
return_like_nr=array_size(return_likes);
for (i=0; i<int_like_nr; ++i) id_lookup(int_likes[i],NULL,int_like);
for (i=0; i<defined_type_nr; ++i)
id_lookup(defined_types[i],NULL,type_defined);
for (i=0; i<return_like_nr; ++i) id_lookup(return_likes[i],NULL,return_like);
id_lookup("case", NULL, case_like);
id_lookup("const", NULL, const_like);
id_lookup("default", NULL, case_like);
id_lookup("do", NULL, do_like);
id_lookup("else", NULL, else_like);
id_lookup("enum", NULL, struct_like);
id_lookup("extern", NULL, C_plus_plus ? extern_like : int_like);
id_lookup("for", NULL, while_like);
id_lookup("if", NULL, if_like);
id_lookup("sizeof", NULL, sizeof_like);
id_lookup("struct", NULL, struct_like);
id_lookup("switch", NULL, while_like);
id_lookup("typedef", NULL, typedef_like);
id_lookup("union", NULL, struct_like);
id_lookup("va_dcl",NULL, declaration);
id_lookup("volatile", NULL, const_like);
id_lookup("while", NULL, while_like);
id_lookup("NULL", NULL, NULL_like);
id_lookup("TeX", NULL, TeX_like);
if (C_plus_plus) @<Store reserved words for \Cpp@>
}
@ One main difference between \Cee\ and \Cpp, as far as \.{\me.} is
concerned, is that the latter has a number of additional reserved words. Most
of them are sufficiently like some \Cee-reserved word (or category) that we
can simply make it behave like that \Cee~symbol, without changing the syntax.
For `\&{new}', `\&{delete}', and `\&{operator}', some additional syntax rules
will be needed however; nevertheless we do not need to extend the set of
syntactic categories. For `\&{operator}' we abuse the category |case_like|,
since its proper use is rather restricted (`|case|' it is always followed by
an expression, while `|default|', `\&{private}' and its relatives are always
followed by a colon), so there will be no confusion with `\&{operator}', which
is always followed by an operator symbol. Similarly, the different cast
operators are always followed by an operator (`$<$') which |sizeof| is never,
so they should not cause confusion.
@<Store reserved words...@>=
{ int i;
static char* cpp_types[] =
{ "exception", "bad_exception", "bad_cast", "bad_typeid", "logic_error",
"domain_error", "invalid_argument", "length_error", "out_of_range",
"bad_alloc", "runtime_error", "range_error", "overflow_error",
"underflow_error",
"string",
"iterator", "const_iterator", "reverse_iterator", "const_reverse_iterator",
"size_type", "value_type",
"ios_base","ios", "istream", "ostream", "iostream",
"istringstream", "ifstream", "ostringstream", "ofstream",
"stringstream", "fstream",
"streambuf", "stringbuf", "filebuf",
"streamoff", "streampos",
"input_iterator_tag", "output_iterator_tag", "forward_iterator_tag",
"bidirectional_iterator_tag", "random_access_iterator_tag",
"pair", "auto_ptr", "allocator", "raw_storage_iterator",
"vector", "list", "deque",
"set", "multiset", "map", "multimap",
"stack", "queue", "priority_queue", "bitset",
"shared_ptr", "weak_ptr", "unique_ptr"
};
for (i=0; i<array_size(cpp_types); ++i)
id_lookup(cpp_types[i],NULL,type_defined);
id_lookup("asm", NULL, int_like);
id_lookup("and", NULL, and_like);
id_lookup("bool", NULL, int_like);
id_lookup("catch", NULL, catch_like);
id_lookup("class", NULL, struct_like);
id_lookup("delete", NULL, sizeof_like);
id_lookup("explicit", NULL, int_like);
id_lookup("false", NULL, expression);
id_lookup("friend", NULL, int_like);
id_lookup("inline", NULL, int_like);
id_lookup("namespace", NULL, namespace_like);
id_lookup("new", NULL, sizeof_like);
id_lookup("not", NULL, not_like);
id_lookup("operator", NULL, case_like);
id_lookup("or", NULL, and_like);
id_lookup("private", NULL, case_like);
id_lookup("protected", NULL, case_like);
id_lookup("public", NULL, case_like);
id_lookup("template", NULL, template_like);
id_lookup("this", NULL, expression);
id_lookup("throw", NULL, throw_like);
id_lookup("true", NULL, expression);
id_lookup("try", NULL, try_like);
id_lookup("typeid", NULL, sizeof_like);
id_lookup("typename", NULL, typename_like);
id_lookup("using", NULL, int_like);
id_lookup("virtual", NULL, int_like);
id_lookup("xor", NULL, and_like);
id_lookup("const_cast", NULL, sizeof_like);
id_lookup("static_cast", NULL, sizeof_like);
id_lookup("dynamic_cast", NULL, sizeof_like);
id_lookup("reinterpret_cast", NULL, sizeof_like);
}
@ There are a few more kinds of elementary scraps that the functions we have
given before can produce, which we mention here for completeness. Ordinary
identifiers get category |expression|, and their names will be expanded on
output as argument to a control sequence that provides the proper formatting.
For strings, constants, verbatim constructions, and \TeX~strings, the
applicable control sequences and the constituent characters (escaped with
backslashes where necessary) are written explicitly into token memory;
their scraps also have category |expression|. Comments are converted to
scraps of category |insert|, and their contents are also stored literally;
in case of `\pb' fragments a reference to a text marked with
|inner_text_flag| is stored, for the production of which the parsing
mechanism has in fact already been invoked. By contrast module names are
stored by reference to the name table, just like identifiers, and their
scraps have category |mod_scrap|; in their case the parsing mechanism may
be called during the {\sl output\/} process, if any `\pb' constructions
occur.
The final \TeX\ output produced for elementary scraps will often be marked
with special control sequences. Ordinary multi-character identifiers are
enclosed in `\.{\\\\\{}$\,\ldots\,$\.\}' (single character identifiers are
merely preceded by a space; they will be set in math italic), identifiers
whose |ilk| is |TeX_like| will become control sequences that are
also enclosed in `\.{\\\\\{}$\,\ldots\,$\.\}' (to establish italic type),
reserved words are enclosed in `\.{\\\&\{}$\,\ldots\,$\.\}', strings and
all-caps identifiers in `\.{\\.\{}$\,\ldots\,$\.\}', constants in
`\.{\\T\{}$\,\ldots\,$\.\}', and verbatim constructions in
`\.{\\vb\{}$\,\ldots\,$\.\}'. Comments are enclosed in
`\.{\\C\{}$\,\ldots\,$\.\}' and usually followed by `\.{\\6}' (a forced
break), and module names take the form `\.{\\X$n$:}$\,\ldots\,$\.{\\X}'
where |n| is the section number (since module names have
|mathness==yes_math|, there is no danger that the final `\.{\\X}' will
disable a following space when coming from a `\hbox{\.{\v@@< ... @@>\v}}'
construction).
@*1 Internal representation of the grammar. @:title@>
We will now consider the how the reduction rules themselves are
represented and used. As we have seen, a rule must define a sequence of
categories for its left hand side, and for its right hand side a category
and a prescription for constructing its translation. In addition, some
categories of the left hand side may be marked as context, so that they will
not take part in the reduction, and there is a way to specify conditional
loading of rules. A few more pieces of information are included for
convenience and efficiency.
Individual reduction rules are stored in a structure called |reduction|.
It is organised in a way that allows for semi-static initialisation, i.e.,
the essential parts of information are stored near the beginning of the
structure or of one of its sub-structures, so that they can be defined by
an initialiser expression, while some further fields are computed from them
and assigned at startup time. Within the fields that are statically
initialised some fields that usually are~0 are put at the end, so that in
the default case they can be omitted from the initialiser.
The field |id| holds an identification number for the rule, which is used in
debugging. Then follows the left hand side information, consisting of an
array of at most |max_lhs_length| categories (which include those of a
possible context; if less than the maximal number of categories are present
they are padded with zeros), followed by integers |context| and |length|.
The field |context| specifies which categories, if any, form the context:
this can be a sequence of one or more categories at either end of the left
hand side of the rule, but not at both ends. If |context==0| (as is the
case if no explicit initialiser is specified), there are no context
categories; when |context>0|, the first |context| categories form the left
context, and when |context<0|, the last |abs(context)| categories form the
right context. (In practice it is wise to use only token categories (ones
that do not require reduction to be formed) for a right context, unless one
can be quite sure that no unintended reduction will affect the categories
taking part in the intended reduction before the right context has been
reduced.) There must be at least one category that is not part of the
context, lest a ``reduction'' would increase the number of scraps. The right
hand side of a rule specifies a category and a string used as a format to
build up the translation; for the common case that the translation is formed
by concatenating the translations of all scraps of the left hand side (not
including those those of the context), a null pointer may be given instead
of a format string. The following field |mask| can be used to specify
selective loading of rule at startup time: any bit set in it will suppress
loading under some condition dependent on the setting of option flags in the
call of \.{\me.}. The field |displacement| is computed at startup time to
record the number of positions (usually negative) by which the parsing
pointer~|pp| should be changed after application of the rule to have any
chance of subsequently matching another (or the same) rule.
@d max_lhs_length 4
@< Typedef and enumeration declarations @>=
typedef struct
{ short id; /* for debugging */
struct
{@; eight_bits category[max_lhs_length]; signed char context,length; } lhs;
struct {@; eight_bits category; char *translation; } rhs;
sixteen_bits mask;
short displacement;
} reduction;
@ We shall organise the rules in a ``trie'' structure for fast matching. Such
a structure is a set of nodes, where each node represents a sequence of
categories that occurs as an initial part of at least one rule; each node
contains links to other nodes that may be reached by adding one more category
to the sequence, and possibly to a rule if its sequence of categories is a
complete left hand side. The very first node is the root, and represents the
empty sequence of categories; each node can be reached by an unique sequence
of links from this root, so the graph structure defined by the links is that
of a directed tree. This kind of structure is suitable for rapid searching,
since at each point where some categories match, the question whether any of
the rules can extend that match with the next category found can be answered
by testing the presence of a single link. If the match fails one could imagine
having some pointer to the node that would be reached by the largest final
sub-sequence of the one already matched that could be the starting point for
the next attempt at a match, which would speed up matching even more. However
we do not include such pointers in our structure, so after a failed match we
must start again at the root of our trie after advancing the place where we
will try to match the first category.
Here are the details of our trie structure. If |q| points to a trie node
reached after matching some sequence of categories, and that sequence
corresponds to the left hand side (including context) of some rule (which
should be unique), then |q->rule| points to that rule, otherwise
|q->rule==NULL|. If that sequence of categories is a proper prefix of the left
hand side of a rule (which may happen whether or not |q->rule==NULL|), and $c$
is the next category in that left hand side (which implies |valid_cat(c)|, and
in particular $c\neq0$), then |q->next[c-1]| is the index of the trie node
reached after a further match of~$c$. The entries of |q->next| that do not
correspond to any such successor node are set to~0, which is unambiguous
because the root of the trie does not figure as a successor of any node. We do
not attempt a sparse representation (which would avoid storage of such 0's),
but we do use a relatively compact |sixteen_bits| representation for the
entries of |q->next|, which saves a considerable amount of space, since there
is a total of |(min_nonrule_cat-1)*max_no_of_nodes| such entries? (The entries
used to be |eight_bits| number, but for a decent grammar for \Cpp\ a limit of
256 trie nodes was too restrictive. One could extend the possibilities of such
small numbers a but by using the fact that the successor of a trie node always
has a strictly greater index that the node itself, so one could store the
difference and change the logic for computing successors below, hoping that
even with more than 256 nodes, one would not need strides larger than~255.
However anno~2006 when memory is measured in gigabytes, so much parsimony of
memory seems ridiculous.)
@<Typedef...@>=
typedef struct {@; reduction* rule; sixteen_bits next[min_nonrule_cat-1]; }
trie_node;
@ Trie nodes are allocated from an array |trie_node_mem|, with the root of
the trie at |trie_node_mem[0]|. The root of the trie is special in that is
allocated initially (namely |node_no| is initialised to~|1|), but it is not
initialised. It corresponds to the state in which no scraps have been matched
yet, and only its |next| field is used; its entries will be set as a
by-product of installing reduction rules. In general a rule will be determined
by the |rule| field of its final node, and appropriate entries of the |next|
fields of certain nodes, which determine the unique path to that final
node.
We introduce some macros that will help us find our way around the trie. The
address the successor of the trie node pointed to by |q| for category |c| can
be written as |successor(q,c)|. The absence of such a successor can be found
by testing |no_successor(q,c)|. If |x| is the address of any node in the tree
(except the root), then we can make that node the successor of |q| for
category |c| by invoking |set_successor(q,c,x)|.
@d trie_root (&trie_node_mem[0])
@d successor(q,c)
(&trie_node_mem[(q)->next[(c)-1]]) /* category codes start at 1 */
@d no_successor(q,c) ((q)->next[(c)-1]==0)
@d set_successor(q,c,x) ((q)->next[(c)-1]=(sixteen_bits)((x)-trie_node_mem))
@<Global...@>=
trie_node trie_node_mem[max_no_of_nodes];
int node_no = 1; /* number of trie nodes allocated */
#ifdef DEBUG
boolean install_failed=false;
#endif
@ Trie nodes are allocated in a straightforward sequential way. We don't
trust that uninitialised statically allocated pointers will be |NULL|
(although they should), especially not on machines where |NULL| is not
represented as ``all bits cleared'', so we do a bit of extra work here. For
the entries of the array |next| we expect no problems however (since
|eight_bits| is an integral type), so we do not explicitly initialise them.
@c trie_node *get_new_trie_node(void)
{ if (node_no>=max_no_of_nodes)
overflow("trie node"); @.trie node capacity exceeded@>
trie_node_mem[node_no].rule=NULL;
return &trie_node_mem[node_no++];
}
@ The function |install_rule| installs a reduction, and if \.{\me.} was
compiled with |DEBUG| set, it also performs some checks on the validity of
the rule; if any check fails the variable |install_failed| is set to |true|.
Since |print| has a variable number of arguments, the macro |rule_error|
does not incorporate them but just prepends |print| to the argument list;
the replacement text of |rule_error| can therefore not be parenthesised, and
the macro should be used with some care. Note however that a call to
|rule_error| (with argument list) can safely by used as the then-branch of an
|if|-statement.
@d rule_error install_failed=true,print
/* report a problematic syntax rule */
@ A global variable |rule_mask| is set at startup time according to the
relevant option flags; any rules~|r| for which |rule_mask & r->mask!=0| are
suppressed. For each set of mutually exclusive settings, a number of bits in
|rule_mask| is reserved equal to the size of the set; the current setting
will have its corresponding bit set while the others are cleared. Therefore
the bits set in |r->mask| specify the option settings for which the rule is
disabled, and the default state |r->rule==0| means that the rule always
applies, regardless of any optional settings.
@< Global... @>= sixteen_bits rule_mask;
@ The function |install_rule| enters a rule into the trie structure, and if
|DEBUG| is defined, performs some sanity checks on the rule.
@< Prototypes @>=
void install_rule (reduction* rule);
@~The length of rules can be found because category~0 is not used for scraps.
@c
void install_rule(reduction *rule)
{ if ((rule_mask & rule->mask)==0)
{ eight_bits* p=rule->lhs.category, i=0;
while (i<max_lhs_length && p[i]!=0) ++i;
rule->lhs.length=i;
#ifdef DEBUG
@< Check left-hand side of |rule| @>
@< Check right-hand side of |rule| @>
#endif
@< Install |rule| in the trie structure @>
@< Compute |rule->displacement|, and if context is specified modify
|rule->lhs.length| and |rule->context| @>
}
}
@ The left-hand side should not be only context, and all categories should
be legal ones.
@< Check left... @>=
{ if (rule->lhs.length<=abs(rule->lhs.context))
rule_error("\nNo scraps to replace in rule %d.\n", rule->id);
@.No scraps to replace...@>
for(i=0; i<rule->lhs.length; ++i)
if (!valid_cat(p[i]))
rule_error("\nUnknown category %d in LHS of rule %d.\n", p[i], rule->id);
@.Unknown category...@>
}
@ The right-hand side should have a valid category, and unless its
translation is |NULL| (or |""| which we treat as if it were |NULL|), it
should contain as many times |'_'| as there are non-context categories in
the left-hand side.
@< Check right... @>=
{ int c=rule->rhs.category; char* s=rule->rhs.translation;
if (!valid_cat(c))
rule_error("\nUnknown category %d in RHS of rule %d.\n", c, rule->id);
@.Unknown category...@>
if (s!=NULL)
{ if (*s=='\0') s=rule->rhs.translation=NULL; /* replace empty string */
else
{ i=0;
do
if (*s!='p') i+= *s++=='_'; /* count underscores */
else if (++s,isdigit((eight_bits)*s)) ++s; /* skip digit and advance */
else rule_error("\nDigit should follow 'p' in format of rule %d.\n"
@.Digit should follow 'p'...@> , rule->id);
while (*s!='\0');
if (i!=rule->lhs.length-abs(rule->lhs.context))
rule_error("\nCount of '_' not equal to length LHS in rule %d.\n"
@.Count of '\_' ...@> , rule->id);
}
}
}
@ Since trie nodes are not represented sparsely, insertion is easy. Just
before moving to a successor node we make sure it exists, by adding one to the
trie if necessary. It does not matter if the final nodes already existed, but
there should not already be a rule attached to it, which would mean that a
previous rule has an identical left hand side (here context is taken into
account to determine equality).
@<Install |rule|...@>=
{ trie_node* q=trie_root;
for (i=0; i<rule->lhs.length; ++i)
{ if (no_successor(q,p[i])) set_successor(q,p[i],get_new_trie_node());
q=successor(q,p[i]);
}
#ifdef DEBUG
if (q->rule!=NULL)
rule_error("\nIdentical left-hand sides in rules %d and %d.\n"
@.Identical left-hand sides...@> , q->rule->id, rule->id);
#endif
q->rule=rule;
}
@ We compute |displacement| conservatively, based on local considerations;
alternatively we might also consider the whole set of rules to find larger
(less negative) values that would make parsing go a bit faster. A rule can
have a left hand side of length |max_lhs_length|. This means that it is safe
to move |pp| so that it will afterwards be |max_lhs_length-1| positions to the
left of the first scrap that may have obtained a new category. That scrap is
the one corresponding to first non-context category in the left hand side of
the rule, unless that category is the same as the category of the right hand
side, in which case it is the next scrap (this is for reduction rules whose
effect on the category sequence is the ``absorption'' of some categories
further to the right; then the mentioned next scrap did not participate in the
reduction and we cannot predict its category (unless there was right context,
but we do not attempt to push that exceptional case any further)).
After a rule has been installed, there is no need to record the full length
of the left hand side, including context, since this is implicit from the
place in the trie where the pointer to this rule is located; rather we store
the number of scraps that will be replaced. Similarly it is more useful to
know the offset of the first scrap to be replaced (which is~0 in case of a
right context) rather than the value of |context| as stored at
initialisation. This means that if no context is specified nothing changes,
and if there are $k$ categories of context then $k$ is subtracted from
|rule->lhs.length| and the offset will be~$k$ in case of a left context,
and~$0$ otherwise. The code below calculates the offset in~|k|, and the
non-positive displacement is obtained by subtracting |max_lhs_length-1| from
|k|, or from |k+1| in the absorption case mentioned above. We check that an
absorption rule actually absorbs at least one scrap; a rule that violates this
requirement would lead to a fruitless reduction, and what is worse to a value
|displacement>0| that would invalidate the logic of the reduction code below.
@< Compute |rule->displacement|... @>=
{ int k=rule->lhs.context,d;
if (k<0) {@; rule->lhs.length+=k; k=0; } @+else rule->lhs.length-=k;
d=k-(max_lhs_length-1); /* this cannot be positive */
if (rule->lhs.category[k]==rule->rhs.category) /* no category change */
{ ++d;
#ifdef DEBUG
if (rule->lhs.length==1)
rule_error("\nNo categories change in rule %d.\n", rule->id);
@.No categories change...@>
#endif
}
rule->lhs.context=k;
rule->displacement=d; /* if positive, an error was reported */
}
@ Once the trie has been installed, the function |match| tests whether the
category pointed to by~|p|, and its successors, match the categories in the
trie structure, starting at the root of the trie, and up to a node that
contains a rule. Only valid categories are taken into account, so in
particular the category~0 used to mark the end of the scrap sequence will stop
the matching loop. Multiple matches are possible, in which case the longest
one takes precedence; this means we use the last non-null |rule| field
encountered. This need not be the one in the last trie node seen, since we may
have successfully matched a category that only occurs in rules that require
some further scrap of a category that is not present (it seems unlikely that
this actually occurs in the given set of rules, but we do not count on it);
therefore we maintain a variable |rule| to store last non-null |rule| field
seen. In any case it may happen that no rule at all has completely matched,
possibly even that no single scrap has matched; in such cases |match|
returns~|NULL|. Note the we avoid using a side effect in the argument of
|valid_cat|, as we must.
@c
reduction *match (scrap_pointer p)
{ trie_node* q=trie_root; reduction* rule=NULL; int c;
while (c=p++->cat,valid_cat(c) && !no_successor(q,c))
if ((q=successor(q,c))->rule!=NULL) rule=q->rule;
return rule;
}
@*1 The parsing mechanism. @:title@>
Conceptually the working sequence of scraps is like a deck of cards, in
which we repeatedly replace a sequence of consecutive cards by a single new
card. Since such replacements never increase the number of cards, we can use
sequential allocation for the current sequence of scraps, and our only
difficulty will be how to conveniently fill the holes that might be left after
each reduction step. Now reduction usually takes place near the beginning of
the scrap sequence (assuming that the scrap sequence makes syntactical sense)
because that is where we are looking first, and we want to avoid shifting
down the whole remainder of the scrap sequence each time. Therefore the
sequence of scraps, which initially occupies the positions from |scrap_base|
to |scrap_ptr|, is allowed to have a hole in its middle, the low and high
end of which are pointed to by variables |lo_ptr| and~|hi_ptr|. There is
also a variable that points to the place where reductions are currently
taking place, which is the parsing pointer~|pp|. It will always point into
the area below the hole, and when it approaches the hole so closely that a
potential reduction might involve scraps from above, the situation is
remedied by sliding down scraps to the lower region, effectively raising the
hole. Therefore the scraps in the higher region are those that have never
been considered for a reduction yet. Eventually all scraps have been moved
down (i.e., we have |hi_ptr==scrap_ptr|), and after that has happened a
scrap with category~0 (which is not otherwise used) is copied down to signal
the imminent end of the reduction process. When finally no more rules
match the scraps in the lower region, the parsing stops.
@< Global variables @>=
scrap_pointer pp; /* current position for reducing scraps */
scrap_pointer lo_ptr; /* end of sequence of scraps that have been examined */
scrap_pointer hi_ptr; /* first scrap that has not been examined */
@ The |mathness| is an attribute of scraps that says whether their
translation is to be processed in a math mode context or not. Since the
translation can be concatenated from a large number of other scraps, there
can be switches in and out of math inside the translation, and we need to
specify the mathness at each of the boundaries. For some scraps it either
makes no difference whether their translation is processed in math mode or
not, or the required mathness is to be determined by the grammatical context
rather than by the scrap itself. Such scraps have mathness |maybe_math| at
both ends; otherwise a definite mathness is specified at either end. The
least significant pair of bits of the |mathness| field of a scrap controls
the right boundary, and the pair of bits to its left controls the left
boundary.
@d left_math(a) (a->mathness>>2)
@d right_math(a) (a->mathness&0x3)
@ If we combine two scraps neither of which has mathness |maybe_math| at its
boundaries, then a `\.\$' is inserted in between if and only if the
mathnesses at the common boundary do not agree; if a scrap with |maybe_math|
joins one with a definite mathness, that mathness is propagated across the
former scrap to its other boundary. In order to implement this, we maintain
two mathness values while building up a text: |init_mathness| and
|cur_mathness| which represent the values at the left and right boundaries
of the part contributed so far; these are local variables of whatever
function is concatenating translations, but they should be called by these
names since they are addressed by the macros below. As a consequence of the
left-to-right order of combining translations, a |maybe_math| scrap that is
combined with scraps with definite mathnesses, will actually be set in the
mode inherited from its left (unless it appears as the leftmost scrap in a
reduction); this can be used to make certain symbols, such as colons, behave
in two slightly different ways depending on their syntactic function. (This
method is not infallible however, as a comment following the symbol will
always force it to be processed in horizontal mode; this happens because
|insert| scraps are tacked onto the scrap before them before any ordinary
reduction can affect it.)
Before scraps requiring some definite mathness are contributed, we invoke
either |set_mode(yes_math)| or |set_mode(no_math)| as appropriate; the first
time this happens will determine the value of |init_mathness|.
@d set_mode(x)
if (cur_mathness==maybe_math) cur_mathness=init_mathness=x;
else if (cur_mathness!=x) {@; app('$'); cur_mathness=x; }
@q $ emacs-cookie @>
else @; /* expect semicolon */
@ The macro |app_trans| is invoked with a |scrap_pointer| as argument,
and appends its translation as a single token; |add_trans| will in addition
to this administrate |init_mathness| and |cur_mathness|, and interpolate
any necessary math shifts.
@d app_trans(a) app_tok(text_flag+text_index((a)->trans))
@d add_trans(a)
{ scrap_pointer scr=a; /* so that evaluating |a| may have side effects */
if (left_math(scr)!=maybe_math)
{ if (cur_mathness==maybe_math) init_mathness=left_math(scr);
else if (cur_mathness!=left_math(scr)) app('$');
@q $ emacs-cookie @>
cur_mathness=right_math(scr);
}
app_trans(scr);
}
@ The function call |fuse(s,n)| will concatenate the translations of |n|
scraps starting with |*s|, taking care of the mathnesses, and install the
resulting text into |s->trans|.
@c
void fuse (scrap_pointer s, int n)
{ int cur_mathness=maybe_math, init_mathness=maybe_math; scrap_pointer p=s;
check_toks(n); check_text();
do add_trans(p++)@; while (--n>0); /* gather all the translations */
s->trans=text_ptr; freeze_text();
s->mathness=(init_mathness<<2)+cur_mathness;
}
@ When a matching rule has been found, the function |reduce| is called to
perform the corresponding actions. At that point |pp| points to the first
scrap involved in the match, and the argument |rule| to |reduce| points to
the matching rule.
If a rule has a left hand side of length~1 (not counting context) and also
the default translation (plain concatenation), then all that is to be done
is to change the category of a scrap, and part of the processing can be
skipped.
@c
@< Utility functions needed during reduction @>@;
void reduce (reduction* rule)
{ int k=rule->lhs.context, l=rule->lhs.length;
scrap_pointer s = pp+k, p=s;/* position of the new scrap */
char* f=rule->rhs.translation; /* format string for translation */
s->cat=rule->rhs.category;
if (l>1 || f!=NULL) /* otherwise ready already */
{ if (f==NULL) fuse(s,l), p+=l; /* default translation */
else @<Generate token list according to format string |f| @>
if (l>1) @<Fill vacant scrap positions@>
}
@<Print a snapshot of the scrap list if debugging@>
@<Change |pp| to |max(pp+rule->displacement,scrap_base)| @>
}
@ When we have applied a reduction to the sequence of scraps, we usually
remove scraps (we never create more scraps than we remove), thereby creating
a small ``hole'' in the sequence. We fix that hole by sliding scraps
downward across it, thereby moving the hole upwards, until it reaches the
``official'' hole at |lo_ptr|; then |lo_ptr| is adjusted so that the small
hole is incorporated in the official hole. During the translation process
the pointer~|p| was moved across all scraps that took part in the reduction,
so the scraps to move are at positions~|i| with |p<=i<lo_ptr|.
@<Fill vacant...@>=
{ scrap_pointer q=s+1; /* position after the newly formed scrap */
while (p<lo_ptr) *q++=*p++;
lo_ptr=q;
}
@ Here we are careful not to create any pointers below |scrap_base|,
since they are not guaranteed to exist (but mentioning them in the module
name will do no harm). The code below uses the fact that
|rule->displacement<=0|.
@<Change |pp| to...@>=
if (pp<scrap_base-rule->displacement) pp=scrap_base; @+
else pp+=rule->displacement;
@ We need no extensive coding mechanism for describing translations, since
they all follow a similar pattern. In all cases all the translations of the
scraps in the left hand side (not including the context scraps) are used
exactly once, in left to right order (violation of these principles would
result in very strange effects indeed for the printed output). The only
things that need to be added are formatting controls like |indent|, |force|
or |break_space|, math shifts (but these are already taken care of by
|add_trans|), and white space. We may also specify calls of
|make_underlined| or |make_unreserved| for certain scraps in the left hand
side. Although the items inserted are of a modest variety, one should
realise that their presence is the only reason we need to parse at all;
without them the translations could have been computed by purely lexical
means.
In format strings an underscore indicates the translation of the next scrap
of the left hand side. Other characters each encode a formatting control; the
character `\.p' encodes |opt| and is followed by a digit that becomes its
argument. The characters `\.!' and `\.\$' respectively cause
|make_underlined| and |make_nonreserved| to be called for the next scrap;
`\.\ ' and~`\.\~' produce a space in the translation, where the latter is
non-breakable and the former forces horizontal mode. To just force
horizontal or math mode there are `\.h'~and~`\.m'; the latter avoids the
possibility of a completely empty formula by adding a space inside math
mode. The precise meaning of these and other formatting characters is easily
read off from the code below. The `\.o' and `\.r' format characters (the
latter is used only in compatibility mode) affect the math category used by
\TeX\ for the next symbol (character or control sequence); the syntax rules
using them do not put braces around that symbol, since these could also
capture a following comment, causing a \TeX\ error.
The number of free tokens we require to be available by calling |check_toks|
is a conservative estimate, based on a hypothetic ``worst case'' reduction
rule with a left hand side of length~4 and with 8 additional items in its
translation (counting format codes with multi-character translations with
multiplicity), whose mathnesses alternate, so that a maximal number of math
shifts is required.
@< Generate token list... @>=
{ int cur_mathness=maybe_math, init_mathness=maybe_math;
check_toks(23); check_text();
do
switch (*f++) @:format@>
{ case '+': app(indent); break;
case '-': app(outdent); break;
case 'p': app(opt); app(*f++); break; /* penalty with numeric argument */
case 'f': set_mode(no_math); app(force); break;
case 'F': set_mode(no_math); app(big_force); break;
case 'b': set_mode(no_math); app(backup); break;
case 'B': set_mode(no_math); app(break_space); break;
case 't': set_mode(yes_math); app_str("\\a"); break; @.\\a@>
/* next item in tab space */
case ',': set_mode(yes_math); app_str("\\,"); break; @.\\,@>
/* thin space */
case 'h': set_mode(no_math); break; /* force horizontal mode */
case 'm': set_mode(yes_math); app(' '); break;
/* force math mode, avoid `\.{\$\$}' */
case 'o': set_mode(yes_math); app_str("\\m"); break; @.\\m@>
/* make ``mathord'' */
case 'r': set_mode(yes_math); app_str("\\MRL"); break; @.\\MRL@>
/* make ``mathrel'' */
case 'a': set_mode(yes_math); app_str("\\ang"); break; @.\\ang@>
/* change `\.<' or `\.>' to `$\ang<$' or~$\ang>$' */
case '!': make_underlined(p); break;
case '$': make_nonreserved(p); break; @q $ emacs-cookie @>
case ' ': set_mode(no_math); app(' '); break;
case '~': case '{': case '}': app(f[-1]); break;
/* insert character literally */
default: printf("%c: ",f[-1]);
confusion("illegal character in format string");
case '_': add_trans(p++);
}
while (*f!='\0');
s->trans=text_ptr; freeze_text();
s->mathness=(init_mathness<<2)+cur_mathness;
}
@ An |int_like| identifier following a |struct_like| token, a selection
operator (`|.|' or `|->|'), or the special code \:;, is to be typeset as an
ordinary identifier. The function |make_nonreserved| alters the flag of the
token representing such an identifier occurrence. The scrap representing the
|int_like| identifier should not be formed by any reduction, but come directly
from |C_read|, so in principle we expect the translation of our scrap to be an
unnested text consisting of a single token, and this text is only used as
translation of our scrap, so we can modify its token in place. However, a
comment (or other |insert|) directly following the |int_like| identifier may
complicate this picture slightly, because |insert| scraps are tacked onto the
previous scrap before it gets the chance to take part in any reduction; this
means our token may be buried inside one or more levels of text nesting, but
still is the very first token of the translation. Any such levels of nesting
are soaked off by the |while| loop below; after this process a single-token
text should remain, possibly containing a reserved word. If this is the case
we replace |res_flag| by |id_flag| to make it print as an ordinary identifier.
@< Utility functions needed during reduction @>=
void make_nonreserved (scrap_pointer p)
{ text_pointer q=p->trans; token t;
while (text_flag<=(t=text_begin(q)[0]) && t<inner_text_flag)
q=text_at(t-text_flag);
if (text_end(q)==text_begin(q)+1 && res_flag<=t && t<mod_flag)
text_begin(q)[0] = t-res_flag+id_flag;
}
@ We wish to mark defining occurrences of identifiers, triggered by the
matching of certain reduction rules. Thanks to the unsurpassed
declaration syntax of~\Cee, it is a non-trivial task to locate the
identifier in question, since it may already have been wrapped up in a more
or less complicated declarator or function heading. Fortunately the
identifier is always the first one occurring in that expression
\unskip\footnote{${}^\dagger$}
{Due to the possible occurrence of the keywords |const| and |volatile| in
strange places within declarations, this is not entirely true. We will
arrange things however in such a way that unless we are forced otherwise by
enclosing parentheses, these keywords will remain outside the text that is
passed to |first_ident|. If we ignore the possibility of specifying
redundant parentheses, this only causes problems when one wishes to
declare certain peculiar things like constant function pointers.
}, so a left-to-right scan through the, possibly nested, text forming the
translation of the expression should reveal it. We accept either an ordinary
identifier or a reserved word; the latter possibility is needed for typedef
declarations. @^recursion@>
@< Utility functions needed during reduction @>=
id_pointer first_ident(text_pointer p)
{ token_pointer q; token t;
if (p>=text_ptr) confusion("first_ident"); @.first\_ident@>
for (q=text_begin(p); q<text_end(p); ++q)
if (id_flag<=(t=*q) && t<mod_flag) return id_at(t%id_flag);
else if (text_flag<=t) /* text or inner text */
{@; id_pointer r=first_ident(text_at(t%id_flag));
if (r!=NULL) return r;
}
return NULL;
}
@ In the following situations we want to mark the occurrence of an identifier
as a definition: in a declaration of a variable, as |r| in `|xref_pointer
*r@;|', when a label is declared, as |found| in `|found:|', in the declaration
of a function, as |f| in `|char* f(int n){@t\dots@>@;}|', after a
|struct_like| token (declaring a structure-, union-, or enumeration tag) and
for all identifiers in an enumeration list. This is accomplished by the
invocation of |make_underlined| at appropriate times, which modifies the cross
reference list for this identifier.
@< Utility functions needed during reduction @>=
void make_underlined (scrap_pointer p)
/* underline entry for first identifier in |p->trans| */
{ id_pointer name=first_ident(p->trans); /* name of first identifier */
if (name==NULL) return;
/* this happens for unsyntactical things like `|int 3;|' */
{ sixteen_bits head=xref_index(name->xref),* r=&head; int n;
while ((n=xnum(*r)&num_mask)!=0 && n<section_count) r=&xlink(*r);
if (n==section_count) xnum(*r)|=def_flag;
else /* this may happen for one-letter identifiers */
{ make_xref(section_count+def_flag,*r);
if (r==&head) name->xref=xref_ptr; @+ else *r=xref_index(xref_ptr);
}
}
}
@ We have now seen how a match is made, and what is done once a matching
rule is found; we still have to consider how everything is set up properly,
and how rules are repeatedly applied until no more reduction is possible,
responding properly to successful and failing matches. All this is
performed by the function |translate|; as we have seen it is called by
|do_C| and |finish_C| after scraps have been stored from |scrap_base|
to |scrap_ptr| of the |scrap_info| array, and it returns a pointer to the
text representing the result of parsing all those scraps.
We start with appending a dummy scrap if either no scraps are present at
all, or some tokens remain that have not been packed into a scrap yet; the
latter can only be due to `\.{@@t...@@>}' as a final item in compatibility
mode. Then we set |lo_ptr| and |hi_ptr| appropriately, and begin to apply
rules as long as possible. When this is done and more than one scrap
remains, their translations are wrapped rather bluntly together to a single
text.
@:translate definition@>
@c
text_pointer translate (void) /* converts a sequence of scraps */
{ pp=lo_ptr=hi_ptr=scrap_base;
if (scrap_ptr==pp || dangling_tokens()) /* then append dummy scrap */
{@; check_scrap(); pack_scrap(insert,no_math); }
@< If tracing, print an indication of where we are @>
@< Reduce the scraps using the rules until no more rules apply @>
@< Combine the irreducible scraps that remain,
and |return| the resulting text @>
}
@ Before applying |match|, we must make sure it has good input (at least
|max_lhs_length| scraps). If a match at |pp| exists, |reduce| will perform
the required processing and updating of |pp| (in this case |pp| is
certainly not increased), if not, we move to the right and try again.
@< Reduce the scraps... @>=
do
{ reduction *rule;
@< Make sure the entries |pp| through |pp+max_lhs_length-1|
of |scrap_info| are defined, or that |lo_ptr->cat==0| @>
if ((rule=match(pp))!=NULL) reduce(rule);
else
{@; ++pp;
@< Take special action if |pp->cat| is |end_expr| or |rproc| @>
}
}
while (pp<lo_ptr);
@ If we get to the end of the scrap list, we set |lo_ptr->cat=0|, which will
prevent any matches to scraps beyond those that are actually present. All
scraps of category |insert| pass across the hole between |lo_ptr| and
|hi_ptr|, and we take the opportunity to remove them, tacking their
translation to the scrap below the hole. We even make sure that
|hi_ptr->cat!=insert|, so that the scraps with which the |insert| scrap is
combined will not have undergone any ordinary reduction yet. The only
possible remaining |insert| scrap is one at the very start of the list; it
will be handled at the end of all reductions.
@< Make sure the entries... @>=
{ scrap_pointer lo_min = pp+max_lhs_length;
while (lo_ptr<lo_min && lo_ptr->cat!=0)
if (hi_ptr>=scrap_ptr) lo_ptr->cat=0;
else
{ *lo_ptr++ = *hi_ptr++;
while (hi_ptr<scrap_ptr && hi_ptr->cat==insert)
{@; *lo_ptr = *hi_ptr++; fuse(lo_ptr-1,2); }
}
}
@ The category pairs |lproc|--|rproc| and |begin_expr|--|end_expr| are
special in that the don't occur in any rules, but rather serve only as
markers. When the material in between will not reduce any further the whole
construction will be wrapped up, their translations concatenated, and the
result treated as an |insert| respectively an |expression|. We do not
actually form an |insert| scrap if it is not at the start of the scrap
sequence, but rather combine everything directly with the scrap immediately
before the |lproc| scrap, leaving the category of that scrap as it is; this
is necessary, because the code that combines ordinary |insert| scraps with
their predecessor only looks at scraps when they cross the hole from |hi_ptr|
to~|lo_ptr|. We use here that |rproc-lproc==end_expr-begin_expr==1|. The
variable names |s| and |p| are chosen with the same meaning as in |reduce|,
so that we could reuse a module.
@< Take special action... @>=
if (pp->cat==end_expr || pp->cat==rproc)
{ int start=pp->cat-1; /* the opening category matching |pp->cat| */
scrap_pointer s=pp, p=pp+1;
while ((--s)->cat!=start && s>scrap_base) {}
if (s->cat==start) /* if opening symbol is missing, take no action */
{ if (start==begin_expr) s->cat=expression;
else if (s==scrap_base) s->cat=insert; @+
else --s; /* position of new scrap */
fuse(s,(int)(p-s));
@< Fill vacant scrap positions @> /* using values of |p| and |s| */
pp= s-scrap_base<max_lhs_length ? scrap_base : s+1-max_lhs_length;
}
}
@ If the initial sequence of scraps does not reduce to a single scrap,
we concatenate the translations of all remaining scraps, separated by
blank spaces, with dollar signs surrounding the translations of scraps
where appropriate. Because of this last action we must build a new text
even if only one scrap remains.
@< Combine the irreducible... @>=
{ scrap_pointer j;
if (scrap_base->cat==insert && lo_ptr>scrap_base+1)
{ fuse(scrap_base,2); /* merge initial |insert| into the next scrap */
j=scrap_base; j->cat=j[1].cat; --lo_ptr; while (++j<lo_ptr) *j=j[1];
}
@<If semi-tracing, show the irreducible scraps@>
check_toks(1);
for (j=scrap_base; j<lo_ptr; j++)
{ if (j!=scrap_base) app_char_tok(' ');
if (left_math(j)==yes_math) app('$');
app_trans(j);
if (right_math(j)==yes_math) app('$');
}
freeze_text();
return text_ptr-1;
}
@ If \.{\me.} is compiled with |DEBUG| defined, it can be put into
debugging mode by setting |tracing| to a positive value by means of \:1,
\:2, or \:3. When |tracing| is set to level~1 any sequence of two or more
irreducible scraps remaining at the end of a call to |translate| will be
printed out. When |tracing| is set to level~2, the rule numbers and current
scrap categories will be printed out after each reduction that takes place,
and if it is set to level~3, then mathnesses at the boundaries of scraps
will also be indicated.
@<Global...@>=
#ifdef DEBUG
int tracing; /* how much parsing details to show */
#endif
@ When parsing fails to reduce everything to a single scrap, pleasing results
will probably not be obtained; it is therefore advisable to run
\.{\me.} with |tracing==trace1| before a final version of a |CWEB|
program is fixed. In order to allow this without changing the source file
itself, we initialise |tracing| to |trace1| if the flag `\.{+d}' is supplied
to \.{\me.}.
@< Set initial values @>=
#ifdef DEBUG
tracing = flags['d'] ? trace1 : trace0;
#endif
@ The following code is activated by the `\.{+d}' flag or the \:1 control
code.
@<If semi-tracing, show the irreducible scraps@>=
#ifdef DEBUG
{ if (tracing==trace1 && lo_ptr>=scrap_base+2)
{ print("\nIrreducible scrap sequence at line %d in section %d:\n"
@.Irreducible scrap sequence...@> ,cur_line, section_count);
mark_harmless();
for (j=scrap_base; j<lo_ptr-1; j++)
print_cat(j->cat), putchar(' ');
print_cat(j->cat); new_line(); /* |term_line_empty| is still valid */
}
}
#endif
@ When full tracing is enabled the following message indicates which piece
of \Cee~text is being parsed (but for section bodies it will generally show
the first line of the next section, since that has already been fetched).
@<If tracing,...@>=
#ifdef DEBUG
{ if (tracing>=trace2)
{ print("\nTracing after l.%d:\n", cur_line); @.Tracing after...@>
if (loc>buffer+50) /* shorten long lines to keep within margins */
{@; printf("..."); term_write (loc-50,50); }
else term_write(buffer,loc-buffer);
new_line(); /* |term_line_empty| is still valid */
}
}
#endif
@ After each reduction, full tracing will print a line starting with the
rule number, followed by a display of all the categories of scraps which
have been considered until now, i.e., those at positions below |lo_ptr|.
The scrap that was produced by this reduction, which is pointed to by~|s|,
has its category highlighted by enclosing it in inverted angle brackets.
If tracing is set to~3, extra-full tracing is active, and mathnesses at the
boundaries of scraps are indicated.
@d math_char(x) ((x)==yes_math ? '+' : (x)==no_math ? '-' : '?')
@<Print a snapsh...@>=
#ifdef DEBUG
{ scrap_pointer k; /* pointer into |scrap_info| */
if (tracing>=trace2)
{ print("\n%3d:", rule->id);
for (k=scrap_base; k<lo_ptr; k++)
{ putchar (' ');
if (tracing==trace3) putchar(math_char(left_math(k)));
if (k==s) putchar('>'), print_cat(k->cat), putchar('<');
else print_cat(k->cat);
if (tracing==trace3) putchar(math_char(right_math(k)));
}
print("%s\n", hi_ptr<scrap_ptr ? " ..." : ".");
}
}
#endif
|