1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
|
PE Grammar Tree, after ExprMode Computation
===========================================
This file documents the augmentations to a Normalized PE Grammar Tree
as inserted by the transformational package 'pg::peg::emodes'. The
input to the transformation is assumed to be a 'Normalized PE Grammar
Tree' as described in 'doc_normalize.txt', possibly with augmentations
from other transformation, as long as they are not in conflict with
the augmentations specified here.
Background
----------
The purpose of the computations and transformations specified here is
to determine as much static information as we can about the generation
of semantic values from terminal and non-terminal matchers. Such
information can then be used to reduce the places in a parser which
generate and/or handle AST nodes to the bare minimum. I.e we wish to
avoid the generation of AST nodes by the matcher for which we are sure
that they will be thrown away by the matcher either immediately, or
later on.
The fundamental information from which everything else is derived are
the optional 'mode's which can specified for non-terminal symbols (See
nonterminal 'Attribute' in the PEG Grammar). They are the hints to the
matcher on how semantic values are generated and used.
The mode of a nonterminal symbol controls three things at the same
time: What it does with semantic values generated by its associated
expression, if it generates its own semantic value, and how. These
three aspects are interelated to each other, hence using one piece of
information to set all of them. For this discussion the aspect of
"how" is not relevant, except where it intersects with the other two.
For the first aspect, the nonterminal has two choices, to either keep
the semantic value from its expression, or to discard it. I.e this is
about accepting versus rejection of the value. This aspect is
'Acceptance'.
For the second aspect the nonterminal has three choices, to either
generate a value, to not generate a value at all, or to simply pass
through whatever was done by its expression (a maybe). This aspect is
'Generation'.
The interelation of these aspects shows up as impossible
combinations when selecting a value from each set, as made explicit in
the next table:
Generation x Acceptance -> Mode
---------- ---------- ----
Maybe Yes value
Yes Yes value
No Yes -- impossible --
Maybe No -- impossible --
Yes No match, leaf [1]
No No discard
---------- ---------- ----
[1] Here the third aspect, the "how" of the generation comes
into play to distinguish these two modes. As said, this
distinction is not relevant for the current discussion.
What we now wish to achieve is to define the vales for these two
aspects for all expression and definition modes, which, starting from
the settings of the modes by the input grammar, is the most definite,
restrictive and conservative as possible.
- Most definite, try to remove as many Maybe's as possible in favor of
Yes or No.
- Most restrictive, prefer a No over Yes when trying to remove a Maybe.
- Most conversative, do not try to remove a Maybe if we are not truly
sure about its value.
The two node properties computes are
* gen(X): Node -> {Yes, No, Maybe}
Yes: The node X definitely generates a semantic
value.
No: The node X definitely does not generate a
semantic value.
Maybe: The node X passes semantic values, should they
be generated by a child of X. We do not know
if values are generated or not.
A derivative function is gen', defined as
gen'(X,G): Node x {Yes, No, Pass} -> Bool
gen'(X,G) = (gen(X) == G)
Conversely we could define gen in terms of gen' as
gen (X) = G, iff gen'(X,G)
* acc(X): Node -> Bool
True: The node X will keep semantic values generated
by its children.
False: The node X will throw any semantic values
generated by its children away.
General information
-------------------
* The specification of 'doc_normalize.txt' applies fully.
Structure
---------
* The structure of the tree is unchanged.
Attributes
----------
Name Type Details
---- ---- -------
pg::peg::emodes
none
Root node only. The presence of this attribute
indicates that the emode computation has been been
executed on the grammar and that we have valid
results.
Transformations whose execution invalidates the emode
information have to remove this marker.
==== ==== =======
gen enum Expression and definition nodes only. Values are in
{yes, no, maybe}. Contains the value of gen(X).
---- ---- -------
acc bool Expression and definition nodes only. Contains the
value of acc(X).
---- ---- -------
Transformation
--------------
It is possible to specify a mode transformation based on the mode
computation. It would resolve discontinuities in the gen/acc stati
(Use of expressions generating a value where none is accepted) by
duplicating relevant nonterminal definitions and forcing them into
specific modes (discard). If expressions to be duplicated contain
calls to undefined nonterminal symbols then the new definitions will
do so as well.
This transformation will not be written for now. The reason for that
is that it essentially defeats packrat parsing.
The definitions which are duplicated are used in both contexts which
accept and such which do not accept the generated value, otherwise the
regular analysis would have mode the definition itself non-accepting
and non-generating.
While it is possible to put the match status of both the original and
duplicated definition into the packrat cache under the same label
there is one important distinction which cannot be avoided: The
duplicated definition does not generate a semantic value. And we
cannot exclude the possibilities that either a cached semantic value
is used in a non-accepting context which is not prepared to handle
this value (by discarding it), or that an accepting context uses
cached information which is short of one expected semantic value.
So the two definitions have to cache their information under different
labels to avoid a mixup. But now it becomes possible that the match
engine has to fully reparse a segment of input during backtracking
despite actually having information about matchability, just under a
different label in the cache. And this then costs us the O(n) of the
packrat parser, pushing it back exponential time-complexity.
Conclusion: The described transformation can be applied if and only if
we have ensured that the matcher will never backtrack in the input for
the grammar in question. In other words, other transformation like
left-factorisation, eliminiation of left-recursion etc. have to be
applied first. In even other words, the grammar has to be LL(1) (which
implies that it will not use any lookahead operators either).
|