File: scanning_with_simpleparse.html

package info (click to toggle)
simpleparse 2.2.0-1
links: PTS, VCS
area: main
in suites: stretch
size: 1,336 kB
ctags: 1,349
sloc: python: 6,420; ansic: 5,855; makefile: 2
file content (146 lines) | stat: -rw-r--r-- 8,680 bytes
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
  <meta http-equiv="CONTENT-TYPE"
 content="text/html; charset=windows-1252">
  <title>Text Scanning with SimpleParse 2.0</title>
  <meta name="GENERATOR" content="OpenOffice.org 641  (Win32)">
  <meta name="AUTHOR" content="Mike Fletcher">
  <meta name="CREATED" content="20020705;55800">
  <meta name="CHANGEDBY" content="Mike Fletcher">
  <meta name="CHANGED" content="20020706;20413400">
  <link rel="stylesheet" type="text/css" href="sitestyle.css">
  <meta name="author" content="Mike C. Fletcher">
</head>
<body lang="en-US">
<h1>Text Scanning with SimpleParse 2.0</h1>
<p>SimpleParse 2.0 provides a parser generator which converts an EBNF
grammar into a run-time parser for use in scanning/marking up texts.
This document describes the process of developing and using an EBNF
grammar to perform the text-scanning process.</p>
<p>Prerequisites:</p>
<ul>
  <li>Python 2.x programming</li>
  <li>Some familiarity with EBNF grammars and other parsing terminology</li>
</ul>
<h2>Creation of a Simple Grammar</h2>
<p>The primary function of SimpleParse is to convert an EBNF grammar
into an in-memory object which can do the work of scanning (and
potentially
processing) data which conforms to that grammar. Therefore, to use the
system effectively, we need to be able to create grammars.</p>
<p>For our first experiment, we'll define a simple grammar for use in
parsing an INI-file-like format. Users of SimpleParse 1.0 will
recognise the
format from the original documentation. This version uses somewhat more
features (and is shorter as a result) than was easily accomplished with
SimpleParse 1.0.</p>
<p>Here's the grammar definition:</p>
<p>____ simpleexample2_1.py ____</p>
<pre>from simpleparse.common import numbers, strings, comments<br><br>declaration = r'''# note use of raw string when embedding in python code...<br>file           :=  [ \t\n]*, section+<br>section        :=  '[',identifier!,']'!, ts,'\n', body<br>body           :=  statement*<br>statement      :=  (ts,semicolon_comment)/equality/nullline<br>nullline       :=  ts,'\n'<br>equality       :=  ts, identifier,ts,'=',ts,identified,ts,'\n'<br>identifier     :=  [a-zA-Z], [a-zA-Z0-9_]*<br>identified     :=  string/number/identifier<br>ts             :=  [ \t]*<br>'''</pre>
<p> The first line incorporates a new feature of SimpleParse 2.0,
namely the
ability to automatically include (and build your own, incidentally)
libraries of commonly used productions (rules/patterns/grammars). By
importing these three modules, I've made the productions string,
number and semicolon_comment (among others) available to all the
Parser instances I create for the rest of this session.<br>
</p>
<p><b>New Feature Note</b>: The <code>identifier!</code> and <code>']'!</code>
element tokens in the "section" production tell the parser generator to
report
a ParserSyntaxError if we attempt to parse these element tokens and
fail.
&nbsp;We could also have spelled this particular segment of the grammar:<br>
</p>
<pre>section        :=  '[',!,identifier,']', ts,'\n', body<br></pre>
<p>which spelling is often easier to use in complex grammars.</p>
<p>If you are not familiar with EBNF grammars, or would like a
reference
to the various features of the SimpleParse grammar, please see: <a
 href="simpleparse_grammars.html">SimpleParse Grammars</a> . We will
assume that you understand the grammars being presented.</p>
<h3>Checking a Grammar</h3>
<p>SimpleParse does not have a separate compilation step, but it's
useful as you're writing your grammar to set up tests both for whether
the grammar itself is syntactically correct, and for whether the
productions match
the values you expect them to (and don't match those you don't want
them
to).</p>
<p>To check that a grammar is syntactically correct, the easiest
approach is to attempt to create a Parser with the grammar. The Parser
will complain if your grammar is syntactically incorrect, generating a
ValueError which reports the last line of the declaration which parsed
correctly, and the remainder of the declaration.</p>
<pre>from simpleparse.parser import Parser<br>parser = Parser( declaration)</pre>
<p> If, for example, you had left out a comma in the section
production between the literal ']' and ts, you would get an error like
so:</p>
<pre>S:\sp\simpleparse\examples&gt;bad_declaration.py<br>Traceback (most recent call last):<br>  File "S:\sp\simpleparse\examples\bad_declaration.py", line 21, in ?<br>    parser = Parser( declaration, "file" ) # will raise ValueError<br>  File "S:\sp\simpleparse\parser.py", line 34, in __init__<br>    definitionSources = definitionSources,<br>  File "S:\sp\simpleparse\simpleparsegrammar.py", line 380, in __init__<br>    raise ValueError(<br>ValueError: Unable to complete parsing of the EBNF, stopped at line 3 (134 chars<br> of 467)<br>Unparsed:<br>ts,'\n', body<br>body           :=  statement*<br>statement      :=  (ts,semicolon_comment)/equality/nulll...</pre>
<p> You can see this for yourself by running
examples/bad_declaration.py .</p>
<p>If your grammar is correct, Parser( declaration) will simply create
the underlying generator objects which can produce a parser for your
grammar. If you want to check that particular production has all of
it's required sub-productions, you can call myparser.buildTagger(
productionname ),
but I normally leave that test to be caught during the production
checking phase below.</p>
<h3>Checking a Production</h3>
<p>Now that we have our Parser object, and know that the grammar is
syntactically correct, we can test that our productions match/don't
match the values
we expect. Depending on your particular philosophy, this may be done
using the unittest module, or merely as informal tests during
development.</p>
<p>In our grammar above, let's try checking that the equality
production
really does match some values we expect it to match:</p>
<pre>testEquality = [<br>	"s=3\n",<br>	"s = 3\n",<br>	'''  s="three\\nthere"\n''',<br>	'''  s=three\n''',<br>]<br><br>production = "equality"<br><br>for testData in testEquality:<br>	success, children, nextcharacter = parser.parse( testData, production=production)<br>	assert success and nextcharacter==len(testData), """Wasn't able to parse %s as a %s (%s chars parsed of %s), returned value was %s"""%( repr(testData), production, nextcharacter, len(testData), (success, children, nextcharacter))</pre>
<p> You should be prepared to have those tests fail a few times. It's
easy to miss the effect of a particular feature of your grammar (such
as the inclusion of newline in the equality production above). It
took 3 tries before I got the tests above properly defined. Setting up
your tests within an automated framework such as unittest is probably a
good idea. It's
also a good idea to set up tests that check that that values which
shouldn't match don't.</p>
<p>Note: You may receive an error message from the parser.parse( ) call
saying that a particular production name isn't defined within the
grammar. You'll need to figure out why that name isn't there (did you
include the common module you were planning to use, or did you mis-type
a name somewhere?)
and correct the problem before the tests will run. This error serves as
a check that the production has all required sub-productions (as noted
in
the previous section).</p>
<h2>Scanning Text with the Grammar</h2>
<p>You saw the basic approach to parsing in the section on testing
above, but there are a few differences when you're creating a real
world parser. The first is that you will likely want to define a
default root production for the parser. In the examples above, the
root was specified explicitly during the call to parse to allow us to
test any of the productions in
the grammar. In normal use, you don't want users of your parser to need
to know what production is used for parsing a buffer, so you provide a
default in the Parser's initialiser:</p>
<pre>parser = Parser( declaration, "file" )<br>parser.parse( testData)</pre>
<p> Note: the root is treated differently than all other productions,
as it
doesn't return a result-tuple in the results tree, but instead governs
the
overall operation of the parser, determining whether it succeeds or
fails
as a whole. The <b>children</b> of the root production produce the
top-level
results of the parsing pass.</p>
<p>You can see the result tree returned from the parse method by
running
examples/simpleexample2_3.py . You can read about how to process the
results
tree in <a href="./processing_result_trees.html">Processing Result
Trees</a>.</p>
<a href="index.html">Up to index...</a><br>
</body>
</html>