1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290
|
=head1 NAME
Parser::MGC::Tutorial - tutorial to build simple recursive-descent parsers
=head1 INTRODUCTION
L<Parser::MGC> is an abstract base class that provides useful features to
assist writing a parser.
A parser written using this module will be a subclass of C<Parser::MGC>, which
provides a C<parse> method. For example, the following trivial parser accepts
only input content matching the given regexp.
package ExampleParser;
use base qw( Parser::MGC );
sub parse
{
my $self = shift;
$self->expect( qr/hello world/i );
return 1;
}
A program using this parser constructs an instance of it, and uses this
instance to parse content. Content is passed either as string value to the
C<from_string> method, or from a named file or filehandle to the C<from_file>
method.
use feature qw( say );
use ExampleParser;
my $parser = ExampleParser->new;
say $parser->from_string( "Hello World" );
When run, this program outputs the return value of the C<from_string> method,
which itself is the value returned by the C<parse> method.
$ perl tut01.pl
1
Content can also be provided by C<from_file> instead:
use feature qw( say );
use ExampleParser;
my $parser = ExampleParser->new;
say $parser->from_file( \*STDIN );
=head2 Returning Values
Of course, a parser that simply returns 1 to say it matches may not be all
that useful. Almost always when a parser is being used to match some input, it
is because some values or structure are needed from this input, which should
be returned to the caller.
The C<expect> method attempts to match the next part of the input string with
the given regexp, and returns the matching substring. A subsequent C<expect>
call will continue parsing where the previous one finished. C<Parser::MGC>
will automatically skip over whitespace between these calls, so most grammars
that are whitespace-insensitive should not need to worry about this
explicitly.
By using two C<expect> calls, a parser can first examine that some given
literal is present, and then return a matched substring from the input.
sub parse
{
my $self = shift;
$self->expect( qr/hello/i );
my $name = $self->expect( qr/\w+/ );
return $name;
}
Z<>
say $parser->from_string( "Hello World" );
Z<>
$ perl tut02.pl
World
Token methods make this easier by providing convenient shortcuts to
commonly-used matching patterns.
sub parse
{
my $self = shift;
$self->expect( qr/hello/i );
return $self->token_ident;
}
If instead we pass in a value that does not match the regexp, an exception is
thrown, including details of how the parse failed.
say $parser->from_string( "Hello, world!" );
Z<>
$ perl tut03.pl
Expected (?i-xsm:hello world) on line 1 at:
Hello, world!
^
A typical use of a parser is to form an Abstract Syntax Tree (AST) from a
given input. In this case it is likely that the return value from the parse
method will be some object the application can use to inspect the syntax.
sub parse
{
my $self = shift;
my $num = $self->token_number;
return MyGrammar::Expression::Number->new( $num );
}
=head2 Indicating Failure
While the basic methods such as C<expect> and the various token methods will
indicate a failure automatically, there may be cases in the grammar that more
logic is required by the parser. If this logic wishes to indicate a failure in
the input and cause back-tracking to occur, it can use the C<fail> method.
sub parse
{
my $self = shift;
my $num = $self->token_number;
$num >= 0 or $self->fail( "Expected a non-negative number" );
return $num;
}
=head1 STRUCTURE
So far we've managed to parse simple patterns that could have been specified
with a simple regular expression. Any parser for a nontrivial grammar will
need other abilities as well; it will need to be able to choose from a list of
alternatives, to be able to repeat patterns, and to form nested scopes to
match other content within.
C<Parser::MGC> provides a set of methods that take one or more C<CODE>
references that perform some parsing step, and form a higher-level
construction out of them. These can be used to build more complex parsers out
of simple ones. It is this recursive structure that gives C<Parser::MGC> its
main power over simple one-shot regexp matching.
Any nontrivial grammar is likely to be formed from multiple named rules. It is
natural therefore to split the parser for such a grammar into methods whose
names reflect the structure of the grammar to be parsed. Each of the
structure-forming methods which takes C<CODE> references invokes each by
passing in the parser object itself as the first argument. This makes it
simple to invoke sub-rules by passing references to method subs themselves,
because the parser object will already be passed as the invocant.
The following examples will build together into a parser for a simple C-like
expression language.
=head2 Optional Rules
The simplest of the structure-forming methods, C<maybe>, attempts to run the
parser step it is given and if it succeeds, returns the value returned by that
step. If it fails by throwing an exception, then the C<maybe> call simply
returns C<undef> and resets the current parse position back to where it was
before it started. This allows writing a grammar that includes an optional
element, similar to the C<?> quantifier in a regular expression.
sub parse_type
{
my $self = shift;
my $storage = $self->maybe( sub {
$self->token_kw(qw( static auto typedef ));
} );
return MyGrammar::Type->new( $self->parse_ident, $storage );
}
=head2 Repeated Rules
The next structure-forming method, C<sequence_of>, attempts to run the parser
step it is given multiple times until it fails, and returns an C<ARRAY>
reference collecting up all the return values from each iteration that
succeeded. By itself, C<sequence_of> can never fail; if the body never matches
then it just yields an empty array and consumes nothing from the input. This
allows writing a grammar that includes a repeating element, similar to the
C<*> quantifier in a regular expression.
sub parse_statements
{
my $self = shift;
my $statements = $self->sequence_of( sub {
$self->parse_statement;
} );
return MyGrammar::Statements->new( $statements );
}
Often it is the case that the grammar requires at least one item to be
present, and should not accept an empty parse of zero elements. This can be
achieved in code by testing the size of the returned array, and using the
C<fail> method. This could be considered similar to the C<+> quantifier in a
regular expression.
sub parse_statements
{
my $self = shift;
my $statements = $self->sequence_of( sub {
$self->parse_statement;
} );
@$statements > 0 or $self->fail( "Expected at least one statement" );
return MyGrammar::Statements->new( $statements );
}
Another case that often happens it that the grammar requires some simple
separation pattern between each parsed item, such as a comma. The C<list_of>
method helps here because it automatically handles those separating patterns
between the items, returning a reference to an array containing only the
actual parsed items without the separators.
sub parse_expression_list
{
my $self = shift;
my $exprs = $self->list_of( ",", sub {
$self->parse_expression;
} );
return MyGrammar::ExpressionList->new( $exprs );
}
=head2 Alternate Rules
To handle a choice of multiple different alternatives in the grammar, the
C<any_of> method takes an ordered list of parser steps, and attempts to invoke
each in turn. It yields as its result the result of the first one of these
that didn't fail. This allows writing a grammar that allows a choice of
multiple different rules at some point, similar to the C<|> alternation in a
regular expression.
sub parse_statement
{
my $self = shift;
$self->any_of(
sub { $self->parse_declaration },
sub { $self->parse_expression; $self->expect( ';' ); },
sub { $self->parse_block_statement },
);
}
=head2 Scoping Rules
The final structure-forming method has no direct analogy to a regular
expression, though usually similar structures can be found. To handle the case
where some nested structure has to be handled between opening and closing
markers, the C<scope_of> method can be used. It takes three arguments, being
the opening marker, a parser step to handle the contents of the body, and the
closing marker. It expects to find each of these in sequence, and returns the
value that the inner parsing step returned.
However, what makes it more interesting is that during execution of the inner
parsing step, the basic token functions all take into account the closing
marker. No token function will return a result if the stream now looks like
the scope closing marker. Instead, they'll all fail claiming to be at the end
of the scope. This makes it much simpler to parse, for example, lists of
values surrounded by braces.
sub parse_array_initialiser
{
my $self = shift;
$self->scope_of( "{", sub { $self->parse_expression_list }, "}" );
}
During execution of the inner call to C<parse_expression_list>, any occurence
in the stream of the C<}> marker will appear to be the end of the stream,
causing the inner call to stop at hopefully the right place (barring other
syntax errors), and terminating correctly.
|