File: ParsingBasics.md

package info (click to toggle)
swiftlang 6.0.3-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 2,519,992 kB
  • sloc: cpp: 9,107,863; ansic: 2,040,022; asm: 1,135,751; python: 296,500; objc: 82,456; f90: 60,502; lisp: 34,951; pascal: 19,946; sh: 18,133; perl: 7,482; ml: 4,937; javascript: 4,117; makefile: 3,840; awk: 3,535; xml: 914; fortran: 619; cs: 573; ruby: 573
file content (202 lines) | stat: -rw-r--r-- 8,329 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
# Parsing Basics

Discover the basics of parsing the Swift grammar

## Overview

The Swift programming language follows a regular structure called a _grammar_
that is composed of rules that direct a conforming implementation on what
inputs constitute valid Swift programs. These rules also come with a set of 
conditions under which they can be executed. We refer to a rule that has syntax
as its output as a _production_. The production rule for parsing an optional
type from the Swift Book is reproduced below:

```
optional-type → type '?'
```

The `optional-type` production directs us to first parse a `type` production,
then parse a `?` token. Productions may be recursive - as in the reference to
`type` which contains `optional-type` as one of its child productions -
or conditional. This is usually denoted with a `|` as in:

```
metatype-type → type '.' 'Type' | type '.' 'Protocol'
```

This production directs us to first parse a type, then a '.' character,
and finally attempt to parse the identifier 'Type'. If 'Type' cannot be found,
we are to reset and try the right-side rule which directs us to instead parse
'Protocol'. 

- Note: The left and right conditions of the `metatype-type` production 
share the same structure up to the last identifier, it is natural to implement 
this rule as first parsing a type, then a dot, then checking to see if 'Type' or 
'Protocol' is present, rather than backing up and trying again when the first
rule fails.

A production can also account for syntactic elements that are allowed to be
absent from the input stream - denoted by `?`. Here, the `class-body` production
represents a class body that may contain zero or more `class-member` elements
surrounded by curly braces:

```
class-body → '{' class-members? '}'
```

Finally, a production can describe a sequence of syntactic elements by making
recursive reference to itself. The `class-members` production directs us to
parse a `class-member`, then to recursively parse another `class-member` if
possible:

```
class-members → class-member class-members?
```

## Productions as Syntax Nodes

The Swift Parser is built atop a framework that faithfully represents Swift
source code called SwiftSyntax. SwiftSyntax can be viewed as an encoding of the 
production rules of the Swift grammar in a tree-shaped data structure called a
syntax tree.

To take the examples above, `RawOptionalTypeSyntax` stands in for the output of the
`optional-type` production, and `RawMetatypeTypeSyntax` stands in for the output
of the `metatype-type` production. The structure of these nodes reflects the
structure of the productions that define them:

- `RawOptionalTypeSyntax` has a `wrappedType` property to retrieve its child `RawTypeSyntax`
- `RawMetatypeTypeSyntax` has a `baseType` property to retrieve its child `RawTypeSyntax`

These nodes also have accessors for associated tokens like the `?`
token or the `.`, `Type`, and `Protocol` tokens.

For sequences of syntax elements, SwiftSyntax provides corresponding
`SyntaxCollection` types.

- Note: Many sources refer to this structure as an "Abstract Syntax Tree" or 
        AST. Abstract, in these contexts, refers to the tree dropping some 
        amount of input structure that is not needed for analysis or compilation
        to proceed. Because the syntax trees in SwiftSyntax are designed to 
        faithfully represent source text, they are more accurately referred to 
        as "Concrete Syntax Trees" or just "Syntax Trees".

## Parsing Source Code into Syntax Nodes

Parsing is a fruitful sub-field of programming language research, with a wide
variety of techniques for dealing with many classes of inputs. One of the
simplest approaches to parsing languages like Swift is parsing by 
_recursive descent_. Just as SwiftSyntax encodes the results of productions,
a recursive descent parser encodes the content of production rules as a set of
mutually-recursive functions.

```swift
extension Parser {
  mutating func parseOptionalType() -> OptionalTypeSyntax {
    // First, recursively parse a type
    let base = self.parseType()
    // Then, parse a postfix question mark token
    let mark = self.eat(.postfixQuestionMark)
    // Finally, yield the optional type syntax node.
    return RawOptionalTypeSyntax(
      wrappedType: base, 
      questionMark: mark, 
      arena: self.arena
    )
  }
}
```

This simple function introduces many of the basic concepts that form the
backbone of the parser's implementation. The `Parser.eat(_:)` method
provides a function to examine the input stream and advance one step if the
provided token kind is present. To form the node, a call to the initializer
is made, which acts to wire up all of the sub-nodes into a single 
`RawOptionalTypeSyntax`.

### Unconditional Parsing

The `Parser.eat(_:)` method unconditionally consumes a token of the given
type, and an assertion is raised if the input token's kind does not match.
This function is most appropriate for encoding structural invariants during
the parse. For example, the decision to parse a `FunctionDeclSyntax` node is
made by examining the input stream for the `func` keyword. It is reasonable to
expect that the production implementing function parsing would `eat` its `func`
keyword. This ensures that callers always check for the `func` keyword before
calling the function parsing method.

### Conditional Parsing

To model conditional productions, the syntax tree uses `Optional`-typed
syntax nodes, and the parser uses the `Parser.consume(if:)` method. 
For a Swift declaration item, a trailing semicolon is optional:

```swift
extension Parser {
  mutating func parseDeclarationItem() -> RawMemberDeclListItemSyntax {
    // First, recursively parse a declaration
    let parsedDecl = self.parseDeclaration()
    // Next, consume the semicolon - if there is one.
    let semicolon = self.consume(if: .semicolon)
    return RawMemberDeclListItemSyntax(
      decl: parsedDecl, 
      semicolon: semicolon,
      arena: parser.arena)
  }
}
```

Unlike `Parser.eat(_:)`, if the parser does not encounter a token of the
given type, a `nil` token is returned and the input is left unconsumed.

### Sequence Parsing

To consume a sequence of syntax elements, a loop and a condition are needed.
Many sequences of elements in Swift are delimited by an inter-element token
like a comma or a period. A type's inheritance clause is a prime example of
a comma-delimited sequence of type syntax elements:

```swift
extension Parser {
  mutating func parseInheritance() -> RawTypeInheritanceClauseSyntax {
    // Eat the colon character.
    let colon = self.eat(.colon)
    // Start parsing a list of inherited types.
    var elements = [RawInheritedTypeSyntax]()
    do {
      var keepGoing: RawTokenSyntax? = nil
      repeat {
        let type = self.parseType()
        keepGoing = self.consume(if: .comma)
        elements.append(RawInheritedTypeSyntax(
          typeName: type, trailingComma: keepGoing, arena: self.arena))
      } while keepGoing != nil
    }
    // Construct the syntax for the list of inherited types.
    let inheritedTypes = RawInheritedTypeListSyntax(
      elements: elements, arena: self.arena)
    return RawTypeInheritanceClauseSyntax(
      colon: colon,
      inheritedTypeCollection: inheritedTypes,
      arena: self.arena)
  }
}
```

This function populates an array of `RawInheritedTypeSyntax` elements separated
by commas. Since the commas must be present in the syntax tree, the `keepGoing`
variable plays double duty both as syntax and as the loop condition. It also
composes all of the parsing elements we have seen thus far, using both
unconditional parsing to assert an invariant, and conditional parsing to 
define the loop condition.

### Putting It All Together

Recursive descent parsing applies these three parsing techniques at scale. The
resulting parser very closely mirrors the structure of the grammar that it is
parsing, is relatively fast, and easy to tweak. When adding methods to
parse new productions, it can often be helpful to work backwards from the
call to the initializer of the returned syntax node type. Each input to the 
initializer can be obtained by recursively calling functions to parse nodes of
the appropriate type and applying the parsing techniques above to fill in 
the rest.