File: parse_trees.md

package info (click to toggle)
arpeggio 2.0.2-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 3,480 kB
  • sloc: python: 3,198; javascript: 54; sh: 19; makefile: 9
file content (201 lines) | stat: -rw-r--r-- 6,026 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
# Parse trees

Parse tree is a first structure you get from a successful parse.

---

Parse tree or concrete syntax tree is a tree structure built from the input
string during parsing.  It represent the structure of the input string. Each
node in the parse tree is either a [terminal](#terminal-nodes) or
[non-terminal](#non-terminal-nodes). Terminals are the leafs of the tree while
the inner nodes are non-terminals.

Here is an example parse tree for the `calc` grammar and the expression
`-(4-1)*5+(2+4.67)+5.89/(.2+7)`:

<a href="../images/calc_parse_tree.dot.png" target="_blank"><img src="../images/calc_parse_tree.dot.png"/></a>


Each non-leaf node is non-terminal. The name in this nodes are the names of the
grammar PEG rules that created them.

The leaf nodes are terminals and they are matched by the _string match_ or _regex
match_ rules.

In the square brackets is the location in the input stream where the
terminal/non-terminal is recognized.

Each parse tree node has the following attributes:

- **rule** - the parsing expression that created this node.
- **rule_name** - the name of the rule if it was the root rule or empty string
  otherwise.
- **position** - the position in the input stream where this node was
  recognized.
- **position_end** - the end of the node in the input stream. This index is one
  char behind the last char that belongs to this node. Thus, `position_end -
  position == length of the node`.

If you want to get line and column from position you can use `pos_to_linecol`
parser method.

```python
  line, col = parser.pos_to_linecol(node.position)
```


## Terminal nodes

Terminals in Arpeggio are created by the specializations of the parsing
expression `Match` class.  There are two specialization of `Match` class:

- `StrMatch` if the literal string is matched from the input or
- `RegExMatch` if a regular expression is used to match input.

To get the matched string from the terminal object just convert it to string
(e.g. `str(t)` where `t` is of `Terminal` type).


## Non-terminal nodes

Non-terminal nodes are non-leaf nodes of the parse tree. They are created by PEG
grammar rules.  Children of non-terminals can be other non-terminals or
terminals.

For example, nodes with the labels `expression`, `factor` and `term` from
the above parse tree are non-terminal nodes created by the rules with the same
names.

`NonTerminal` inherits from `list`. The elements of `NonTerminal` are its
children nodes.  So, you can use index access:

```python
child = pt_node[2]
```

Or iteration:

```python
for child in pt_node:
  ...
```

Additionally, you can access children by the child rule name:

For example:

```python
# Grammar
def foo(): return "a", bar, "b", baz, "c", ZeroOrMore(bar)
def bar(): return "bar"
def baz(): return "baz"

# Parsing
parser = ParserPython(foo)
result = parser.parse("a bar b baz c bar bar bar")

# Accessing parse tree nodes. All asserts will pass.
# Index access
assert result[1].rule_name  == 'bar'
# Access by rule name
assert result.bar.rule_name == 'bar'

# There are 8 children nodes of the root 'result' node.
# Each child is a terminal in this case.
assert len(result) == 8

# There is 4 bar matched from result (at the beginning and from ZeroOrMore)
# Dot access collect all NTs from the given path
assert len(result.bar) == 4
# You could call dot access recursively, e.g. result.bar.baz if the
# rule bar called baz. In that case all bars would be collected from
# the root and for each bar all baz will be collected.

# Verify position
# First 'bar' is at position 2 and second is at position 14
assert result.bar[0].position == 2
assert result.bar[1].position == 14

```


## Printing parse trees

Elements of the parse tree has `tree_str` method which can be used to print the
tree.

For example:

```python
...
parse_tree = parser.parse(some_input)
print(parse_tree.tree_str())
```

Result might look something like:

```
simpleLanguage=Sequence [0-191]
  keyword=Kwd(function) [0-8]: function
  symbol=RegExMatch(\w+) [9-12]: fak
  parameterlist=Sequence [13-20]
    symbol=RegExMatch(\w+) [13-14]: n
    symbol=RegExMatch(\w+) [16-17]: x
    symbol=RegExMatch(\w+) [19-20]: p
  block=Sequence [22-191]
    StrMatch({) [22-23]: {
    statement=Sequence [28-189]
      ifstatement=Sequence [28-188]
        keyword=Kwd(if) [28-30]: if
        StrMatch(() [31-32]: (
        expression=OrderedChoice [32-36]
          operation=Sequence [32-36]
            symbol=RegExMatch(\w+) [32-33]: n
            operator=RegExMatch(\+|\-|\*|\/|\=\=) [33-35]: ==
            literal=RegExMatch(\d*\.\d*|\d+|".*?") [35-36]: 0
        StrMatch()) [36-37]: )
        block=Sequence [38-93]
          StrMatch({) [38-39]: {
          statement=Sequence [78-87]
            returnstatement=Sequence [78-86]
              keyword=Kwd(return) [78-84]: return
              expression=OrderedChoice [85-86]
                literal=RegExMatch(\d*\.\d*|\d+|".*?") [85-86]: 0
            StrMatch(;) [86-87]: ;

```

Each line contains information about the rule and the span of the input where
the match occurred.


## Parse tree reduction

Parser can be configured to create a reduced parse tree. More information can be
found [here](configuration.md#parse-tree-reduction).


## Suppressing parse tree nodes

This feature is available for `ParserPython` only. Rule classes may have a class
level `suppress` attribute set to `True`. Parse tree nodes produced by these
rules will be removed from the resulting parse tree. This may be handy to reduce
syntax noise from the punctuation and such in the resulting parse tree.

```python
class SuppressStrMatch(StrMatch):
    suppress = True

def grammar():
    return "one", "two", SuppressStrMatch("three"), "four"

parser = ParserPython(grammar)

result = parser.parse("one two three four")
assert len(result) == 3
assert result[1] == "two"
assert result[2] == "four"
```

This feature can nicely be combined with [overriding of special rule
classes](grammars.md#overriding-of-special-rule-classes).