File: accelerators.md

package info (click to toggle)
ruby-parslet 1.6.1-1
  • links: PTS, VCS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 908 kB
  • ctags: 473
  • sloc: ruby: 5,220; makefile: 2
file content (234 lines) | stat: -rw-r--r-- 12,011 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# Parslet Accelerator

## Synopsis

Reading this all the way is worth it. But don't despair; if your attention span is short, read this and zap away! The TLDR:

Parslet is slow because of the way it is constructed internally. Optimisation helps, but makes your parser harder to read. Don't go there. Use parser accelerators instead - optimize parts of your parser without changing its definition. This is what it looks like:

    slow_parser = something >> slow_part >> other

    include Parslet::Accelerator
    optimized_parser = apply(
      rule( slow_part ) { faster_part })

    optimized_parser.parse('"Parsing is now fully optimized! (tm)"')

Thus parslet allows you to write cristal clear parsers that are also as fast as you can make them. (But still slower than C!)

## Introduction

The goals of parslet are simple: make writing PEG parsers a predictable and straightforward endeavour. Some people have since claimed the word 'fun' for writing parsers, a connotation that we don't entirely oppose - otherwise why would we spend our time extending parslet?

### Dark Clouds ahead

Writing your first thousand line parser that works is easy – IF you use parslet. But very often, the resulting parser is rather slow - having execution times in the second range instead of the subsecond range.

You fire up your email client and drop us a mail to the mailing list, asking: "Why is parslet so slow?" You'll receive the following answers:

* Parslet is not a parser generator, but a parser engine based on Ruby. As such, it will be slower than parsers written in languages such as C.
* Parslet's internal structure is simple and clear. We've invested a lot of effort in making everything it does obvious and extendable. The downside of this somewhat OO-heavy approach is that we've got many objects juggling data and deep callstacks. Read: Bad use of caches and CPU time.
* Very few big languages have parsers written in high level languages such as Ruby. For good reasons. Depending on how serious you are about writing a new language (as opposed to fiddling around), you might want to _not start with parslet at all_.

It's not like we haven't done anything to fix the above reasons; rather, we're doing everything we can, provided the main goal of *simplicity and understandability* is not in danger! If you look up what has been done over the years you will find a lot of small and large optimisations. But we have always refused to sacrifice simplicity of design to the god of optimisation, especially when it came to making a single parser faster. We want parslet to be fast in general, and frankly, your parser of language X is not our concern - only insofar as it uses parslet.

But parslet needs to be useful for something. If not, what is the point? We would like to make parslet as useful as possible for smaller languages and for places where execution speed isn't your only concern. A lot of languages have rapidly evolving grammars and are developed by programmers that don't have the time for hand-writing parsers in C.

Still, what should you do once you've written your parser and speed becomes the issue? Until now, you had no real options short of rewriting the damn thing in C. That has changed: we've come up with Parslet Accelerator. The Accelerator will allow you to pattern match bits of your _parser_ and replace them with bits that do the same work, but faster. Really just hot spot optimisation, but without sacrificing readability of the original parser grammar.

### An Example

Let's consider the parser for quoted strings as an example, usually written to look something like this:

    quote = str('"')
    quote >> (quote.absent? >> any).repeat >> quote

If we spell out the work parslet needs to do when matching a 1000 character string using this method, the performance problems will become obvious to you. Parslet will:

* Match a quote
* In a loop:
  * Try to match a quote
  * If that fails, continue, otherwise break from the loop
  * Gobble up a single char
* Match the final quote

The inner loop will be executed a 1000 times; that's a 1000 times reading a char, checking to see if it's a quote, then reading it again, etc... As a programmer, this should disturb you deeply. And the fix, pseudo-code wise, should be obvious to you. Let's look at this revised flow:

* Match a quote
* Gobble up as many chars as possible, until we hit a quote
* Match the final quote

Ok, we've sort of cheated there in the middle - we've transformed something into a single step that is really still a loop. But as a Ruby programmer, you will not see this as a loop, but rather as a call to a more efficient library like `StringScanner`, which underlies `Parslet::Source`.

So we're pretty confident that this new parser will work faster; maybe fast even. Let's assume that we've got a `GobbleUp` atom that gobbles up chars until it hits a stop char. Our faster parser would have to look something like this:

    quote = str('"')
    quote >> GobbleUp.new('"') >> quote

And all is fine, right? We don't think so. You've chosen to use parslet, so you don't want to end up sprinkling your grammar which is as much specification as it is implementation with things like `GobbleUp`. Wouldn't it be nice if you could keep the parser as it is, but somehow replace the pattern of `(quote.absent? >> any).repeat` with `GobbleUp.new('"')` before doing any work with your parser? Well, you can.

    quote = str('"')
    parser = quote >> (quote.absent? >> any).repeat >> quote

    A = Accelerator # for making what follows a bit shorter
    optimized_parser = A.apply(parser,
      A.rule( (A.str(:x).absent? >> A.any).repeat ) { GobbleUp.new(x) })

    optimized_parser.parse('"Parsing is now fully optimized! (tm)"')

(If you're interested in a bit of history, the example that triggered the discussion around accelerators is preserved in [optimizer.rb](https://github.com/kschiess/parslet/blob/master/experiments/optimizer.rb). If you look past the hacks and the optimism, you'll recognize some of the things we talk about in this document.)

### About this Document

Now that the goal is defined, let us expose the details of the system proposed above. We'll start with explaining what these `Accelerator.rule` things are, how they match against your parser and how binding of variables work. (*Parslet Pattern Matching*) Then we'll explain what actions you can take once you've matched part of your parser. (*Binding and Actions*)

## Parser Pattern Matching

We'll demonstrate how pattern detection is constructed by showing what the smallest parts do first. Let's require needed libraries.

    require 'parslet'
    require 'parslet/accelerator'

    include Parslet

The whole optimizer code is inside the `Parslet::Accelerator` module. If you read that, read 'particle accelerator', not 'will make my code fast'. It is very possible to make things worse using `Parslet::Accelerator`.

The simplest parser I can think of would be the one matching a simple string.

    atom = str('foo')
    expression = Accelerator.str(:x)
    binding = Accelerator.match(atom, expression)

    binding[:x].assert == 'foo'

Note that the above was somewhat verbose, with all these variables and all that. We'll write shorter examples from now on.

Another simple parser is the one that matches against variants of the `match(...)` atom. Multiple forms of this exist, so we'll go through all of them. First comes a simple character range match.

    binding = Accelerator.match(
      match['a-z'],
      Accelerator.re(:x))

    binding[:x].assert == '[a-z]'

Note how the internal regular expression value used for the match atom is really bound to :x – we'll keep this as a convention. This also means that some parslet internas are leaked through the Accelerator API here. We consider that a feature, since working with accelerators will bring you into close contact with the atoms internas.

Also the Accelerator matcher for `Parslet.match` is called `Accelerator.re` - the reason for this should be obvious from what stands above.

Just to make this complete, a special case for the `match(...)` atom is the `any` atom.

    binding = Accelerator.match(any, Accelerator.any)
    binding.assert != nil

## Composite Parsers

Let's start assembling these simple parsers into more complex patterns and match those. As our first pattern, we'll consider sequences.

    binding = Accelerator.match(
      str('a') >> str('b'),
      Accelerator.str(:x) >> Accelerator.str(:y))

    binding.values_at(:x, :y).assert == %w(a b)

Also, alternatives should correctly bind.

    binding = Accelerator.match(
      str('a') | str('b'),
      Accelerator.str(:x) | Accelerator.str(:y))

    binding.values_at(:x, :y).assert == %w(a b)

Note that this means that we bind to the whole alternative subtree, not either to the left or to the right. We're matching the parslet tree, so the meaning of the alternative is irrelevant.

Let's quickly skip through the list of other composite parsers here: First off is repetition.

    Accelerator.match(
      str('a').repeat(1,2),
      A.str('a').repeat(1,2)).assert != nil

And then there is positive and negative lookahead.

    Accelerator.match(
      str('a').present?,
      A.str('a').present?).assert != nil

    Accelerator.match(
      str('a').absent?,
      A.str('a').absent?).assert != nil

And named values.

    Accelerator.match(
      str('a').as(:a),
      A.str('a').as(:a)).assert != nil

Which we also want to be able to bind.

    Accelerator.match(
      str('a').as(:a),
      A.str('a').as(:b)).assert == {:b => :a}

## Binding to Values

As a side note, our parser should also respect literal value matches in the pattern and only bind to subsequent locations when the values match up.

    binding = Accelerator.match(
      str('a') >> str('b'),
      Accelerator.str(:x) >> Accelerator.str(:x))

    binding.assert == nil

Another property should be that literal strings passed to the pattern should be matched using ===.

    binding = Accelerator.match(
      str('abc') >> str('bcd'),
      Accelerator.str(/b/) >> Accelerator.str('bcd'))

    binding.assert == {}

The binding is empty here, since no variables were given. But lets also implement constrained variable bindings, that seems useful. The way this works is that you specify a variable you want to bind to first, and then a list of constraints that are matched by `#===`.

    A.match(str('abc'), A.str(:x, /c/))[:x].assert == 'abc'
    A.match(str('abc'), A.str(:x, /d/)).assert == nil

    A.match(str('abc'), A.str(:x, /a/, /c/))[:x].assert == 'abc'
    A.match(str('abc'), A.str(:x, /a/, /d/)).assert == nil

Here's a quick demonstration that demonstrates that this feature equally applies to both `Accelerator.re` and `Parslet.match`.

    A.match(match['abc'], A.re(:x, /d/)).assert == nil
    A.match(match['abc'], A.re(:x, /c/))[:x].assert == '[abc]'

## Bindings and Actions

Matching parsers is only useful if we can take action and create a new parser that is different in some way. Let's say we have the lofty goal of creating an 'accelerator' that reverses the language understood by a parser. Where the first parser would parse something like:

    aaaaabbbb

we want the second parser to parse something like:

    bbbaaaaa

Here's how the first parser would look like:

    parser = str('a').repeat(1) >> str('b').repeat(1)

And here's how we turn this around to parse the second kind of input.

    include Parslet
    parser = str('a').repeat(1) >> str('b').repeat(1)
    new_parser = A.apply(parser,
      A.rule(A.str(:x).repeat(1) >> A.str(:y).repeat(1)) { 
        str(y).repeat(1) >> str(x).repeat(1) } )

    new_parser.parse('bbbaaaa').assert == 'bbbaaaa'

The `Accelerator.apply` method should be called with the following schema: 

    Accelerator.apply(PARSER, RULE1, RULE2, ...)

This is a more modern syntax than `Parslet::Transform` uses - maybe we'll update that class in a future version.

## Closing Note

* not a panacea