1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312
|
<!DOCTYPE html>
<html>
<head>
<meta content="text/html;charset=UTF-8" http-equiv="Content-type" />
<title>parslet - Parser construction</title>
<meta content="Kaspar Schiess (http://absurd.li)" name="author" />
<link href="images/favicon3.ico" rel="shortcut icon" />
<link href="/parslet/stylesheets/site.css" rel="stylesheet" /><link href="/parslet/stylesheets/sh_whitengrey.css" rel="stylesheet" /><script src="http://code.jquery.com/jquery-2.1.4.min.js"></script><script src="/parslet/javascripts/toc.js"></script><script src="/parslet/javascripts/sh_main.min.js"></script><script src="/parslet/javascripts/sh_ruby.min.js"></script>
</head>
<body class="code" onload="sh_highlightDocument(); $('#toc').toc({selectors: 'h2'});">
<div id="everything">
<div class="main_menu">
<img src="/parslet/images/parsley_logo.png" alt="Parslet Logo" />
<ul>
<li>
<a href="/parslet/">about</a>
</li>
<li>
<a href="/parslet/get-started.html">get started</a>
</li>
<li>
<a href="/parslet/install.html">install</a>
</li>
<li>
<a href="/parslet/documentation.html">docs</a>
</li>
<li>
<a href="/parslet/contribute.html">contribute</a>
</li>
<li>
<a href="/parslet/projects.html">projects</a>
</li>
</ul>
</div>
<div class="content">
<h1>
Parser construction
</h1><p>A parser is nothing more than a class that derives from
<code>Parslet::Parser</code>. The simplest parser that one could write would
look like this:</p>
<pre class="sh_ruby"><code>
class SimpleParser < Parslet::Parser
rule(:a_rule) { str('simple_parser') }
root(:a_rule)
end
</code></pre>
<p>The language recognized by this parser is simply the string “simple_parser”.
Parser rules do look a lot like methods and are defined by</p>
<pre class="sh_ruby"><code>
rule(name) { definition_block }
</code></pre>
<p>Behind the scenes, this really defines a method that returns whatever you
return from it.</p>
<p>Every parser has a root. This designates where parsing should start. It is like
an entry point to your parser. With a root defined like this:</p>
<pre class="sh_ruby"><code>
root(:my_root)
</code></pre>
<p>you create a <code>#parse</code> method in your parser that will start parsing
by calling the <code>#my_root</code> method. You’ll also have a <code>#root</code>
(instance) method that is an alias of the root method. The following things are
really one and the same:</p>
<pre class="sh_ruby"><code>
SimpleParser.new.parse(string)
SimpleParser.new.root.parse(string)
SimpleParser.new.a_rule.parse(string)
</code></pre>
<p>Knowing these things gives you a lot of flexibility; I’ll explain why at the
end of the chapter. For now, just let me point out that because all of this is
Ruby, your favorite editor will syntax highlight parser code just fine.</p>
<h2>Atoms: The inside of a parser</h2>
<h3>Matching strings of characters</h3>
<p>A parser is constructed from parser atoms (or parslets, hence the name). The
atoms are what appear inside your rules (and maybe elsewhere). We’ve already
encountered an atom, the string atom:</p>
<pre class="sh_ruby"><code>
str('simple_parser')
</code></pre>
<p>This returns a <code>Parslet::Atoms::Str</code> instance. These parser atoms
all derive from <code>Parslet::Atoms::Base</code> and have essentially just
one method you can call: <code>#parse</code>. So this works:</p>
<pre class="sh_ruby"><code title="parser atoms">
str('foobar').parse('foobar') # => "foobar"@0
</code></pre>
<p>The atoms are small parsers that can recognize languages and throw errors, just
like real <code>Parslet::Parser</code> subclasses.</p>
<h3>Matching character ranges</h3>
<p>The second parser atom you will have to know about allows you to match
character ranges:</p>
<pre class="sh_ruby"><code>
match('[0-9a-f]')
</code></pre>
<p>The above atom would match the numbers zero through nine and the letters ‘a’
to ‘f’ – yeah, you guessed right – hexadecimal numbers for example. The inside
of such a match parslet is essentially a regular expression that matches
a single character of input. Because we’ll be using ranges so much with
<code>#match</code> and because typing (‘[]’) is tiresome, here’s another way
to write the above <code>#match</code> atom:</p>
<pre class="sh_ruby"><code>
match['0-9a-f']
</code></pre>
<p>Character matches are instances of <code>Parslet::Atoms::Re</code>. Here are
some more examples of character ranges:</p>
<pre class="sh_ruby"><code>
match['[:alnum:]'] # letters and numbers
match['\n'] # newlines
match('\w') # word characters
match('.') # any character
</code></pre>
<h3>The wild wild <code>#any</code></h3>
<p>The last example above corresponds to the regular expression <code>/./</code> that matches
any one character. There is a special atom for that:</p>
<pre class="sh_ruby"><code>
any
</code></pre>
<h2>Composition of Atoms</h2>
<p>These basic atoms can be composed to form complex grammars. The following
few sections will tell you about the various ways atoms can be composed.</p>
<h3>Simple Sequences</h3>
<p>Match ‘foo’ and then ‘bar’:</p>
<pre class="sh_ruby"><code>
str('foo') >> str('bar') # same as str('foobar')
</code></pre>
<p>Sequences correspond to instances of the class
<code>Parslet::Atoms::Sequence</code>.</p>
<h3>Repetition and its Special Cases</h3>
<p>To model atoms that can be repeated, you should use <code>#repeat</code>:</p>
<pre class="sh_ruby"><code>
str('foo').repeat
</code></pre>
<p>This will allow foo to repeat any number of times, including zero. If you
look at the signature for <code>#repeat</code> in <code>Parslet::Atoms::Base</code>,
you’ll see that it has really two arguments: <em>min</em> and <em>max</em>. So the following
code all makes sense:</p>
<pre class="sh_ruby"><code>
str('foo').repeat(1) # match 'foo' at least once
str('foo').repeat(1,3) # at least once and at most 3 times
str('foo').repeat(0, nil) # the default: same as str('foo').repeat
</code></pre>
<p>Repetition has a special case that is used frequently: Matching something
once or not at all can be achieved by <code>repeat(0,1)</code>, but also
through the prettier:</p>
<pre class="sh_ruby"><code>
str('foo').maybe # same as str('foo').repeat(0,1)
</code></pre>
<p>These all map to <code>Parslet::Atoms::Repetition</code>. Please note this
little twist to <code>#maybe</code>:</p>
<pre class="sh_ruby"><code title="maybes twist">
str('foo').maybe.as(:f).parse('') # => {:f=>nil}
str('foo').repeat(0,1).as(:f).parse('') # => {:f=>[]}
</code></pre>
<p>The ‘nil’-value of <code>#maybe</code> is nil. This is catering to the
intuition that <code>foo.maybe</code> either gives me <code>foo</code> or
nothing at all, not an empty array. But have it your way!</p>
<h3>Alternation</h3>
<p>The most important composition method for grammars is alternation. Without
it, your grammars would only vary in the amount of things matched, but not
in content. Here’s how this looks:</p>
<pre class="sh_ruby"><code>
str('foo') | str('bar') # matches 'foo' OR 'bar'
</code></pre>
<p>This reads naturally as “‘foo’ or ‘bar’”.</p>
<h3>Operator precedence</h3>
<p>The operators we have chosen for parslet atom combination have the operator
precedence that you would expect. No parenthesis are needed to express
alternation of sequences:</p>
<pre class="sh_ruby"><code>
str('s') >> str('equence') |
str('se') >> str('quence')
</code></pre>
<h3>And more</h3>
<p>Parslet atoms are not as pretty as Treetop atoms. There you go, we said it.
However, there seems to be a different kind of aesthetic about them; they
are pure Ruby and integrate well with the rest of your environment. Have a
look at this:</p>
<pre class="sh_ruby"><code>
# Also consumes the space after important things like ';' or ':'. Call this
# giving the character you want to match as argument:
#
# arg >> (spaced(',') >> arg).repeat
#
def spaced(character)
str(character) >> match['\s']
end
</code></pre>
<p>or even this:</p>
<pre class="sh_ruby"><code>
# Turns any atom into an expression that matches a left parenthesis, the
# atom and then a right parenthesis.
#
# bracketed(sum)
#
def bracketed(atom)
spaced('(') >> atom >> spaced(')')
end
</code></pre>
<p>You might say that because parslet is just plain old Ruby objects itself (<span class="caps">PORO</span>
™), it allows for very tight code. Module inclusion, class inheritance, …
all your tools should work well with parslet.</p>
<h2>Tree construction</h2>
<p>By default, parslet will just echo back to you the strings you feed into it.
Parslet will not generate a parser for you and neither will it generate your
abstract syntax tree for you. The method <code>#as(name)</code> allows you
to specify exactly how you want your tree to look like:</p>
<pre class="sh_ruby"><code title="using as">
str('foo').parse('foo') # => "foo"@0
str('foo').as(:bar).parse('foo') # => {:bar=>"foo"@0}
</code></pre>
<p>So you think: <code>#as(name)</code> allows me to create a hash, big deal.
That’s not all. You’ll notice that annotating everything that you want to keep
in your grammar with <code>#as(name)</code> autocreates a sensible tree
composed of hashes and arrays and strings. It’s really somewhat magic: Parslet
has a set of clever rules that merge the annotated output from your atoms into
a tree. Here are some more examples, with the atom on the left and the resulting
tree (assuming a successful parse) on the right:</p>
<pre class="sh_ruby"><code>
# Normal strings just map to strings
str('a').repeat "aaa"@0
# Arrays capture repetition of non-strings
str('a').repeat.as(:b) {:b=>"aaa"@0}
str('a').as(:b).repeat [{:b=>"a"@0}, {:b=>"a"@1}, {:b=>"a"@2}]
# Subtrees get merged - unlabeled strings discarded
str('a').as(:a) >> str('b').as(:b) {:a=>"a"@0, :b=>"b"@1}
str('a') >> str('b').as(:b) >> str('c') {:b=>"b"@1}
# #maybe will return nil, not the empty array
str('a').maybe.as(:a) {:a=>"a"@0}
str('a').maybe.as(:a) {:a=>nil}
</code></pre>
<h2>Capturing input</h2>
<p><em>Advanced reading material – feel free to skip this.</em></p>
<p>Sometimes a parser needs to match against something that was already matched
against. Think about Ruby heredocs for example:</p>
<pre class="sh_ruby"><code>
str = <<-HERE
This is part of the heredoc.
HERE
</code></pre>
<p>The key to matching this kind of document is to capture part of the input
first and then construct the rest of the parser based on the captured part.
This is what it looks like in its simplest form:</p>
<pre class="sh_ruby"><code>
match['ab'].capture(:capt) >> # create the capture
dynamic { |s,c| str(c.captures[:capt]) } # and match using the capture
</code></pre>
<p>This parser matches either ‘aa’ or ‘bb’, but not mixed forms ‘ab’ or ‘ba’. The
last sample introduced two new concepts for this kind of complex parser: the
<code>#capture(name)</code> method and the <code>dynamic { ... }</code> code
block.</p>
<p>Appending <code>#capture(name)</code> to any parser will capture that parsers
result in the captures hash in the parse context. If and only if the parser
<code>match['ab']</code> succeeds, it stores either ‘a’ or ‘b’ in
<code>context.captures[:capt]</code>.</p>
<p>The only way to get at that hash during the parse process is in a
<code>dynamic { ... }</code> code block. (for reasons that are out of the
scope of this document) In such a block, you can:</p>
<pre class="sh_ruby"><code>
dynamic { |source, context|
# construct parsers by using randomness
rand < 0.5 ? str('a') : str('b')
# Or by using context information
str( context.captures[:capt] )
# Or by .. doing other kind of work (consumes 100 chars and then 'a')
source.consume(100)
str('a')
}
</code></pre>
<h3>Scopes</h3>
<p>What if you want to parse heredocs contained within heredocs? It’s turtles all
the way down, after all. To be able to remember what string was used to
construct the outer heredoc, you would use the <code>#scope { ... }</code>
block that was introduced in parslet 1.5. Like opening a Ruby block, it allows
you to capture results (assign values to variables) to the same names you’ve
already used in outer scope – <em>without destroying the outer scopes values for
these captures!</em>.</p>
<p>Here’s an example for this:
<pre class="sh_ruby"><code>
str('a').capture(:a) >> scope { str('b').capture(:a) } >>
dynamic { |s,c| str(c.captures[:a]) }
</code></pre></p>
<p>This parses ‘aba’ – if you understand that, you understand scopes and
captures. Congrats.</p>
<h2>And more</h2>
<p>Now you know exactly how to create parsers using Parslet. Your parsers
will output intricate structures made of endless arrays, complex hashes and
a few string leftovers. But your programming skills fail you when you try
to put all this data to use. Selecting keys upon keys in hash after hash, you
feel like a cockroach that has just read Kafka’s works. This is no fun. This
is not what you signed up for.</p>
<p>Time to introduce you to <a href="transform.html">Parslet::Transform</a> and its workings.</p>
</div>
<div class="copyright">
<p><span class="caps">MIT</span> License, 2010-2018, © <a href="http://absurd.li">Kaspar Schiess</a><br/></p>
</div>
<script>
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-16365074-2']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</div>
</body>
</html>
|