File: doc_emodes.txt

package info (click to toggle)
tcllib 2.0%2Bdfsg-5
links: PTS, VCS
area: main
in suites: forky, sid
size: 83,560 kB
sloc: tcl: 306,798; ansic: 14,272; sh: 3,035; xml: 1,766; yacc: 1,157; pascal: 881; makefile: 124; perl: 84; f90: 84; python: 33; ruby: 13; php: 11
file content (180 lines) | stat: -rw-r--r-- 6,784 bytes
parent folder | download | duplicates (11)
PE Grammar Tree, after ExprMode Computation
===========================================

This file documents the augmentations to a Normalized PE Grammar Tree
as inserted by the transformational package 'pg::peg::emodes'. The
input to the transformation is assumed to be a 'Normalized PE Grammar
Tree' as described in 'doc_normalize.txt', possibly with augmentations
from other transformation, as long as they are not in conflict with
the augmentations specified here.

Background
----------

The purpose of the computations and transformations specified here is
to determine as much static information as we can about the generation
of semantic values from terminal and non-terminal matchers. Such
information can then be used to reduce the places in a parser which
generate and/or handle AST nodes to the bare minimum. I.e we wish to
avoid the generation of AST nodes by the matcher for which we are sure
that they will be thrown away by the matcher either immediately, or
later on.

The fundamental information from which everything else is derived are
the optional 'mode's which can specified for non-terminal symbols (See
nonterminal 'Attribute' in the PEG Grammar). They are the hints to the
matcher on how semantic values are generated and used.

The mode of a nonterminal symbol controls three things at the same
time: What it does with semantic values generated by its associated
expression, if it generates its own semantic value, and how. These
three aspects are interelated to each other, hence using one piece of
information to set all of them. For this discussion the aspect of
"how" is not relevant, except where it intersects with the other two.

For the first aspect, the nonterminal has two choices, to either keep
the semantic value from its expression, or to discard it. I.e this is
about accepting versus rejection of the value. This aspect is
'Acceptance'.

For the second aspect the nonterminal has three choices, to either
generate a value, to not generate a value at all, or to simply pass
through whatever was done by its expression (a maybe). This aspect is
'Generation'.

The interelation of these aspects shows up as impossible
combinations when selecting a value from each set, as made explicit in
the next table:

	Generation x	Acceptance ->	Mode
	----------	----------	----
	Maybe		Yes		value
	Yes		Yes		value
	No		Yes		-- impossible --
	Maybe		No		-- impossible --
	Yes		No		match, leaf   [1]
	No		No		discard
	----------	----------	----

	[1] Here the third aspect, the "how" of the generation comes
	    into play to distinguish these two modes. As said, this
	    distinction is not relevant for the current discussion.


What we now wish to achieve is to define the vales for these two
aspects for all expression and definition modes, which, starting from
the settings of the modes by the input grammar, is the most definite,
restrictive and conservative as possible.

- Most definite, try to remove as many Maybe's as possible in favor of
  Yes or No.

- Most restrictive, prefer a No over Yes when trying to remove a Maybe.

- Most conversative, do not try to remove a Maybe if we are not truly
  sure about its value.

The two node properties computes are

*	gen(X): Node -> {Yes, No, Maybe}

		Yes:	The node X definitely generates a semantic
			value.

		No:	The node X definitely does not generate a
			semantic value.

		Maybe:	The node X passes semantic values, should they
			be generated by a child of X. We do not know
			if values are generated or not.

	A derivative function is gen', defined as

		gen'(X,G): Node x {Yes, No, Pass} -> Bool
		gen'(X,G) = (gen(X) == G)

	Conversely we could define gen in terms of gen' as

		gen (X) = G, iff gen'(X,G)

*	acc(X): Node -> Bool

		True:	The node X will keep semantic values generated
			by its children.

		False:	The node X will throw any semantic values
			generated by its children away.

General information
-------------------

* The specification of 'doc_normalize.txt' applies fully.

Structure
---------

* The structure of the tree is unchanged.

Attributes
----------

Name	Type	Details
----	----	-------
pg::peg::emodes
	none
		Root node only. The presence of this attribute
		indicates that the emode computation has been been
		executed on the grammar and that we have valid
		results.

		Transformations whose execution invalidates the emode
		information have to remove this marker.
====	====	=======
gen	enum	Expression and definition nodes only. Values are in
		{yes, no, maybe}. Contains the value of gen(X).
----	----	-------
acc	bool	Expression and definition nodes only. Contains the
		value of acc(X).
----	----	-------

Transformation
--------------

It is possible to specify a mode transformation based on the mode
computation. It would resolve discontinuities in the gen/acc stati
(Use of expressions generating a value where none is accepted) by
duplicating relevant nonterminal definitions and forcing them into
specific modes (discard). If expressions to be duplicated contain
calls to undefined nonterminal symbols then the new definitions will
do so as well.

This transformation will not be written for now. The reason for that
is that it essentially defeats packrat parsing.

The definitions which are duplicated are used in both contexts which
accept and such which do not accept the generated value, otherwise the
regular analysis would have mode the definition itself non-accepting
and non-generating.

While it is possible to put the match status of both the original and
duplicated definition into the packrat cache under the same label
there is one important distinction which cannot be avoided: The
duplicated definition does not generate a semantic value. And we
cannot exclude the possibilities that either a cached semantic value
is used in a non-accepting context which is not prepared to handle
this value (by discarding it), or that an accepting context uses
cached information which is short of one expected semantic value.

So the two definitions have to cache their information under different
labels to avoid a mixup. But now it becomes possible that the match
engine has to fully reparse a segment of input during backtracking
despite actually having information about matchability, just under a
different label in the cache. And this then costs us the O(n) of the
packrat parser, pushing it back exponential time-complexity.

Conclusion: The described transformation can be applied if and only if
we have ensured that the matcher will never backtrack in the input for
the grammar in question. In other words, other transformation like
left-factorisation, eliminiation of left-recursion etc. have to be
applied first. In even other words, the grammar has to be LL(1) (which
implies that it will not use any lookahead operators either).