1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
|
= Hacker's guide to the `NodePattern` compiler
This documentation is aimed at anyone wanting to understand / modify the `NodePattern` compiler.
It assumes some familiarity with the syntax of https://github.com/rubocop/rubocop-ast/blob/master/docs/modules/ROOT/pages/node_pattern.adoc[`NodePattern`], as well as the AST produced by the `parser` gem.
== High level view
The `NodePattern` compiler uses the same techniques as the `parser` gem:
* a `Lexer` that breaks source into tokens
* a `Parser` that uses tokens and a `Builder` to emit an AST
* a `Compiler` that converts this AST into Ruby code
Example:
* Pattern: `+(send nil? {:puts :p} $...)+`
* Tokens: `+'(', [:tNODE_TYPE, :send], [:tPREDICATE, :nil?], '{', ...+`
* AST: `+s(:sequence, s(:node_type, :send), s(:predicate, :nil?), s(:union, ...+`
* Ruby code:
+
[source,ruby]
----
node.is_a?(::RuboCop::AST::Node) && node.children.size >= 2 &&
node.send_type? &&
node.children[0].nil?() &&
(union2 = node.children[1]; ...
----
The different parts are described below
== Vocabulary
*"node pattern"*: something that can be matched against a single `AST::Node`.
While `(int 42)` and `#is_fun?` both correspond to node patterns, `+...+` (without the parenthesis) is not a node pattern.
*"sequence"*: a node pattern that describes the sequence of children of a node (and its type): `+(type first_child second_child ...)+`
*"variadic"*: element of a sequence that can match a variable number of children.
`+(send _ int* ...)+` has two variadic elements (`int*` and `+...+`).
`(send _ :name)` contains no variadic element.
Note that a sequence is itself never variadic.
*"atom"*: element of a pattern that corresponds with a simple Ruby object.
`(send nil?
:puts (str 'hello'))` has two atoms: `:puts` and `'hello'`.
== Lexer
The `lexer.rb` defines `Lexer` and has the few definitions needed for the lexer to work.
The bulk of the processing is in the inherited class that is generated by https://github.com/seattlerb/oedipus_lex[`oedipus_lex`]
[discrete]
==== Rules
https://github.com/seattlerb/oedipus_lex[`oedipus_lex`] generates the Ruby file `lexer.rex.rb` from the rules defined in `lexer.rex`.
These rules map a Regexp to code that emits a token.
`oedipus_lex` aims to be simple and the generated file is readable.
It uses https://ruby-doc.org/stdlib-2.7.1/libdoc/strscan/rdoc/StringScanner.html[`StringScanner`] behind the scene.
It selects the first rule that matches, contrary to many lexing tools that prioritize longest match.
[discrete]
==== Tokens
The `Lexer` emits tokens with types that are:
* string for the syntactic symbols (e.g.
`'('`, `'$'`, `+'...'+`)
* symbols of the form `:tTOKEN_TYPE` for the rest (e.g.
`:tPREDICATE`)
Tokens are stored as `[type, value]`, or `[type, [value, location]]` if locations are emitted.
[discrete]
==== Generation
Use `rake generate:lexer` to generate the `lexer.rex.rb` from `lexer.rex` file.
This is done automatically by `rake spec`.
NOTE: the `lexer.rex.rb` is not under source control, but is included in the gem.
== Parser
Similarly to the `Lexer`, the `parser.rb` defines `Parser` and has the few definitions needed for the parser to work.
The bulk of the processing is in the inherited class `parser.racc.rb` that is generated by https://ruby-doc.org/stdlib-2.7.0/libdoc/racc/parser/rdoc/Racc.html#module-Racc-label-Writing+A+Racc+Grammar+File[`racc`] from the rules in `parser.y`.
[discrete]
==== Nodes
The `Parser` emits `NodePattern::Node` which are similar to RuboCop's node.
They both inherit from ``parser``'s `Parser::AST::Source::Node`, and share additional methods too.
Like for RuboCop's nodes, some nodes have specicialized classes (e.g.
`Sequence`) while other nodes use the base class directly (e.g.
`s(:number, 42)`)
[discrete]
==== Rules
The rules follow closely the definitions above.
In particular a distinction between `node_pattern_list`, which is a list of node patterns (each term can match a single node), while the more generic `variadic_pattern_list` is a list of elements, some of which could be variadic, others simple node patterns.
[discrete]
==== Generation
Similarly to the lexer, use `rake generate:parser` to generate the `parser.racc.rb` from `parser.y` file.
This is done automatically by `rake spec`.
NOTE: the `parser.racc.rb` is not under source control, but is included in the gem.
== Compiler
The compiler's core is the `Compiler` class.
It holds the global state (e.g.
references to named arguments).
The goal of the compiler is to produce `matching_code`, Ruby code that can be run against an `AST::Node`, or any Ruby object for that matter.
Packaging of that `matching_code` into code for a `lambda`, or method `def` is handled separately by the `MethodDefiner` module.
The compilation itself is handled by three subcompilers:
* `NodePatternSubcompiler`
* `AtomSubcompiler`
* `SequenceSubcompiler`
=== Visitors
The subcompilers use the visitor pattern [https://en.wikipedia.org/wiki/Visitor_pattern]
The methods starting with `visit_` are used to process the different types of nodes.
For a node of type `:capture`, the method `visit_capture` will be called, or if none is defined then `visit_other_type` will be called.
No argument is passed, as the visited node is accessible with the `node` attribute reader.
=== NodePatternSubcompiler
Given any `NodePattern::Node`, it generates the Ruby code that can return `true` or `false` for the given node, or node type for sequence head.
==== `var` vs `access`
The subcompiler can be called with the current node stored either in a variable (provided with the `var:` keyword argument) or via a Ruby expression (e.g.
`access: 'current_node.children[2]'`).
The subcompiler will not generate code that executes this `access` expression more than once or twice.
If it might access the node more than that, `multiple_access` will store the result in a temporary variable (e.g.
`union`).
==== Sequences
Sequences are the most difficult elements to handle and are deferred to the `SequenceSubcompiler`.
==== Atoms
Atoms are handled with `visit_other_type`, which defers to the `AtomSubcompiler` and converts that result to a node pattern by appending `=== cur_node` (or `=== cur_node.type` if in sequence head).
This way, the two arguments in `(_ #func?(%1) %2)` would be compiled differently;
`%1` would be compiled as `param1`, while `%2` gets compiled as `param2 === node.children[1]`.
==== Precedence
The code generated has higher or equal precedence to `&&`, so as to make chaining convenient.
=== AtomSubcompiler
This subcompiler produces Ruby code that gets evaluated to a Ruby object.
E.g.
`"42"`, `:a_symbol`, `param1`.
A good way to think about it is when it has to be passed as arguments to a function call.
For example:
[source,ruby]
----
# Pattern '#func(42, %1)' compiles to
func(node, 42, param1)
----
Note that any node pattern can be output by this subcompiler, but those that don't correspond to a Ruby literal will be output as a lambda so they can be combined.
For example:
[source,ruby]
----
# Pattern '#func(int)' compiles to
func(node, ->(compare) { compare.is_a?(::RuboCop::AST::Node) && compare.int_type? })
----
=== SequenceSubcompiler
The subcompiler compiles the sequences' terms in turn, keeping track of which children of the `AST::Node` are being matched.
==== Variadic terms
The complexity comes from variadic elements, which have complex processing _and_ may make it impossible to know at compile time which children are matched by the subsequent terms.
*Example* (no variadic terms)
----
(_type int _ str)
----
First child must match `int`, third child must match `str`.
The subcompiler will use `children[0]` and `children[2]`.
*Example* (one variadic terms)
----
(_type int _* str)
----
First child must match `int` and _last_ child must match `str`.
The subcompiler will use `children[0]` and `children[-1]`.
*Example* (multiple variadic terms)
----
(_type int+ sym str+)
----
The subcompiler can not use any integer and `children[]` to match `sym`.
This must be tracked at runtime in a variable (`cur_index`).
The subcompiler will use fixed indices before the first variadic element and after the last one.
==== Node pattern terms
The node pattern terms are delegated to the `NodePatternSubcompiler`.
In the pattern `(:sym :sym)`, both `:sym` will be compiled differently because the first `:sym` is in "sequence head": `:sym === node.type` and `:sym == node.children[0]` respectively.
The subcompiler indicates if the pattern is in "sequence head" or not, so the `NodePatternSubcompiler` can produce the right code.
Variadic elements may not (currently) cover the sequence head.
As a convenience, `+(...)+` is understood as `+(_ ...)+`.
Other types of nodes will raise an error (e.g.
`(<will not compile>)`;
see `Node#in_sequence_head`)
==== Precedence
Like the node pattern subcompiler, it generates code that has higher or equal precedence to `&&`, so as to make chaining convenient.
== Variant: WithMeta
These variants of the Parser / Builder / Lexer generate `location` information (exactly like the `parser` gem) for AST nodes as well as comments with their locations (like the `parser` gem).
Since this information is not typically used when one only wants to define methods, it is not loaded by default.
== Variant: Debug
These variants of the Compiler / Subcompilers works by adding tracing code before and after each compilation of `NodePatternSubcompiler` and `SequenceSubcompiler`.
A unique ID is assigned to each node and the tracing code flips a corresponding switch when the expression is about to be evaluated, and after (joined with `&&` so it only flips the switch if the node was a match).
Atoms are not compiled differently as they are not really matchable (when not compiled as a node pattern)
|