1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201
|
# Parse trees
Parse tree is a first structure you get from a successful parse.
---
Parse tree or concrete syntax tree is a tree structure built from the input
string during parsing. It represent the structure of the input string. Each
node in the parse tree is either a [terminal](#terminal-nodes) or
[non-terminal](#non-terminal-nodes). Terminals are the leafs of the tree while
the inner nodes are non-terminals.
Here is an example parse tree for the `calc` grammar and the expression
`-(4-1)*5+(2+4.67)+5.89/(.2+7)`:
<a href="../images/calc_parse_tree.dot.png" target="_blank"><img src="../images/calc_parse_tree.dot.png"/></a>
Each non-leaf node is non-terminal. The name in this nodes are the names of the
grammar PEG rules that created them.
The leaf nodes are terminals and they are matched by the _string match_ or _regex
match_ rules.
In the square brackets is the location in the input stream where the
terminal/non-terminal is recognized.
Each parse tree node has the following attributes:
- **rule** - the parsing expression that created this node.
- **rule_name** - the name of the rule if it was the root rule or empty string
otherwise.
- **position** - the position in the input stream where this node was
recognized.
- **position_end** - the end of the node in the input stream. This index is one
char behind the last char that belongs to this node. Thus, `position_end -
position == length of the node`.
If you want to get line and column from position you can use `pos_to_linecol`
parser method.
```python
line, col = parser.pos_to_linecol(node.position)
```
## Terminal nodes
Terminals in Arpeggio are created by the specializations of the parsing
expression `Match` class. There are two specialization of `Match` class:
- `StrMatch` if the literal string is matched from the input or
- `RegExMatch` if a regular expression is used to match input.
To get the matched string from the terminal object just convert it to string
(e.g. `str(t)` where `t` is of `Terminal` type).
## Non-terminal nodes
Non-terminal nodes are non-leaf nodes of the parse tree. They are created by PEG
grammar rules. Children of non-terminals can be other non-terminals or
terminals.
For example, nodes with the labels `expression`, `factor` and `term` from
the above parse tree are non-terminal nodes created by the rules with the same
names.
`NonTerminal` inherits from `list`. The elements of `NonTerminal` are its
children nodes. So, you can use index access:
```python
child = pt_node[2]
```
Or iteration:
```python
for child in pt_node:
...
```
Additionally, you can access children by the child rule name:
For example:
```python
# Grammar
def foo(): return "a", bar, "b", baz, "c", ZeroOrMore(bar)
def bar(): return "bar"
def baz(): return "baz"
# Parsing
parser = ParserPython(foo)
result = parser.parse("a bar b baz c bar bar bar")
# Accessing parse tree nodes. All asserts will pass.
# Index access
assert result[1].rule_name == 'bar'
# Access by rule name
assert result.bar.rule_name == 'bar'
# There are 8 children nodes of the root 'result' node.
# Each child is a terminal in this case.
assert len(result) == 8
# There is 4 bar matched from result (at the beginning and from ZeroOrMore)
# Dot access collect all NTs from the given path
assert len(result.bar) == 4
# You could call dot access recursively, e.g. result.bar.baz if the
# rule bar called baz. In that case all bars would be collected from
# the root and for each bar all baz will be collected.
# Verify position
# First 'bar' is at position 2 and second is at position 14
assert result.bar[0].position == 2
assert result.bar[1].position == 14
```
## Printing parse trees
Elements of the parse tree has `tree_str` method which can be used to print the
tree.
For example:
```python
...
parse_tree = parser.parse(some_input)
print(parse_tree.tree_str())
```
Result might look something like:
```
simpleLanguage=Sequence [0-191]
keyword=Kwd(function) [0-8]: function
symbol=RegExMatch(\w+) [9-12]: fak
parameterlist=Sequence [13-20]
symbol=RegExMatch(\w+) [13-14]: n
symbol=RegExMatch(\w+) [16-17]: x
symbol=RegExMatch(\w+) [19-20]: p
block=Sequence [22-191]
StrMatch({) [22-23]: {
statement=Sequence [28-189]
ifstatement=Sequence [28-188]
keyword=Kwd(if) [28-30]: if
StrMatch(() [31-32]: (
expression=OrderedChoice [32-36]
operation=Sequence [32-36]
symbol=RegExMatch(\w+) [32-33]: n
operator=RegExMatch(\+|\-|\*|\/|\=\=) [33-35]: ==
literal=RegExMatch(\d*\.\d*|\d+|".*?") [35-36]: 0
StrMatch()) [36-37]: )
block=Sequence [38-93]
StrMatch({) [38-39]: {
statement=Sequence [78-87]
returnstatement=Sequence [78-86]
keyword=Kwd(return) [78-84]: return
expression=OrderedChoice [85-86]
literal=RegExMatch(\d*\.\d*|\d+|".*?") [85-86]: 0
StrMatch(;) [86-87]: ;
```
Each line contains information about the rule and the span of the input where
the match occurred.
## Parse tree reduction
Parser can be configured to create a reduced parse tree. More information can be
found [here](configuration.md#parse-tree-reduction).
## Suppressing parse tree nodes
This feature is available for `ParserPython` only. Rule classes may have a class
level `suppress` attribute set to `True`. Parse tree nodes produced by these
rules will be removed from the resulting parse tree. This may be handy to reduce
syntax noise from the punctuation and such in the resulting parse tree.
```python
class SuppressStrMatch(StrMatch):
suppress = True
def grammar():
return "one", "two", SuppressStrMatch("three"), "four"
parser = ParserPython(grammar)
result = parser.parse("one two three four")
assert len(result) == 3
assert result[1] == "two"
assert result[2] == "four"
```
This feature can nicely be combined with [overriding of special rule
classes](grammars.md#overriding-of-special-rule-classes).
|