1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377
|
# Copyright 2022 Jeffrey Kegler
# This file is part of Marpa::R2. Marpa::R2 is free software: you can
# redistribute it and/or modify it under the terms of the GNU Lesser
# General Public License as published by the Free Software Foundation,
# either version 3 of the License, or (at your option) any later version.
#
# Marpa::R2 is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser
# General Public License along with Marpa::R2. If not, see
# http://www.gnu.org/licenses/.
=head1 NAME
Marpa::R2::Advanced::Input_Models - Alternative input models
=head1 About this document
The alternative input models described in this document are an
advanced technique.
While you are starting out with Marpa, you
probably want to ignore this document.
If you are an experienced Marpa user,
it is still safe to ignore this document,
but you might find the possibilities it discusses
interesting.
This document describes the input models at a conceptual,
rather than an implementation level.
The reader should be familiar with
L<the recognizer's external scanning
capability|Marpa::R2::Scanless::R/"External scanning">.
A reader interested in implementation is assumed to
be familiar with the methods described in
L<Marpa::R2::Scanless::R/"Mutators for external scanning">,
=head1 Marpa has two different ideas of location
In the other Marpa documentation,
we have spoken of "location",
and assumed the standard input model.
The locations actually in use by the methods
described for the standard input model were Earley set
ordinals.
(An Earley set's ordinal is also its ID.)
Marpa actually has two different ideas of location --
Earley set ordinal and earleme.
This is ignored in the other Marpa documents,
and it can be, because they
assume the standard input model.
Use of the standard input model
guarantees that earleme
and Earley set ordinal
will always be exactly the same.
This document introduces methods which make it
possible (and in fact likely) that earleme and
Earley set ordinal will differ.
From here on,
the reader will need to pay careful attention
to the distinction.
=head1 What is an alternative input model?
An alternative input model
is anything that is not the default, token-stream model.
More helpfully, Marpa allows variable-length tokens and ambiguous tokens,
and an alternative input model is any input model which
=over
=item * Allows a token whose length is not exactly 1, or
=item * Allows locations which have more than one token.
=back
To do either of these things,
a user must use
L<external scanning|Marpa::R2::Scanless::R/"External scanning">
and
L<the recognizer'x external scanning
methods|Marpa::R2::Scanless::R/"Mutators for external scanning">.
In particular,
if an application is not directly
using the recognizer's
C<lexeme_read> or
C<lexeme_alternative> method calls,
that application is not using an alternative input method.
Many concepts, such as parsing location,
parse exhaustion,
and the end of parsing,
are somewhat more complicated when alternative
input models are involved.
These concepts were explained in L<the main document for
the recognizer|Marpa::R2::Scanless:R> on the assumption
that the default input model was in use.
This document revises those explanations as necessary
to take into
account the alternative input models.
=head1 Token streams
Marpa's default input model is the traditional one --
a token stream.
Token streams are very standard in parsing applications --
so much so
that most texts do not take the trouble
of defining the term.
A B<token stream> is input structured as
a sequence of tokens,
where each token occupies one location
and every location has a token.
In the token stream model, all tokens are
of the same length.
Conventionally, all tokens are of length 1,
and the token stream starts at location 0.
Following this convention,
the I<N>th token would start at
location I<N-1> and end
at location I<N>.
For example,
the first token would start at location 0 and end at location 1.
The second token would start at location 1 and end at location 2.
=head1 Earlemes
For most parsers, position is location in a token stream.
To deal with variable-length and overlapping tokens,
Marpa needs a more flexible idea of location.
Marpa's tracks position in terms of B<earlemes>.
B<Earlemes> are named after Jay Earley,
the inventor of the first algorithm
in Marpa's lineage.
Every token has a start earleme and an end earleme.
The token stream model may also be called the token-per-earleme
model.
In the token stream model,
token location and earleme location
are exactly identical.
More formally, in the token stream model,
if the token location is I<N>,
then the earleme location is also I<N>.
If a user's application uses the token stream model,
the user can ignore the existence of earlemes,
and can think entirely in terms of
tokens and their position in a token stream.
Because of this, the main Marpa documents
often speak
simply of the "location" in the parse.
=head1 The furthest earleme
The B<furthest earleme> is the last earleme at which a token ends.
In the default input model,
the furthest earleme and the current earleme
are always the same.
As a result,
in the default input model, the furthest earleme is not an important
concept.
In alternative input models,
tokens may be longer than 1 earleme, and
the furthest earleme and the current earleme may be far apart.
This becomes an issue when
parsing is finished.
Alternative input models use
one or more calls to the recognizer's
L<C<lexeme_complete()>|Marpa::R2::Scanless::R/"lexeme_complete()">
method to ensure
that processing of input catches up to the furthest earleme.
=head1 The latest Earley set and latest earleme
The B<latest earleme> is the earleme location of the latest
Earley set.
In the default input model, the latest earleme is always the
same as the current earleme.
In alternative input models,
there may not be an Earley set at a given earleme location.
When that is the case for the current earleme,
then the latest Earley set is not at the current earleme,
and the latest earleme and current earlemes are different.
=head1 Ambiguous lexing
Marpa allows ambiguous tokens.
Several Marpa tokens can start at a single parsing location.
Ambiguous tokens can be of various lengths.
Tokens can also overlap.
B<Potentially
ambiguous lexing>
occurs when more than one token starts
at a single earleme.
When potentially ambiguous lexing occurs,
it becomes possible for there to be more
than one sequence of tokens.
An B<actual lexical ambiguity> only occurs if
more than one of the potential token sequences is consistent with
the grammar.
If there is no actual lexical ambiguity,
Marpa will use the only token choice that is
consistent with the grammar.
When lexing is B<actually ambiguous>, Marpa
will use all the alternatives
consistent with the grammar.
When the lexing in a parse is actually ambiguous,
the parse will be ambiguous.
From the point of view of Marpa's semantics,
ambiguities caused by lexing look the
same as ambiguities caused by an ambiguous grammar.
In the standard
terminology,
if a grammar produces more than one parse tree
for any input,
then that grammar must be ambiguous.
In Marpa this is not strictly true.
In Marpa,
if the input is ambiguous,
even an unambiguous grammar can produce more than one parse.
=head1 Duplicate tokens
A duplicate token is a token of the same type
and the same length as another
that was read at the same earleme.
Duplicate tokens are impossible in the default, token-stream,
model.
This is because in the token-stream model only one token can be
read at each earleme.
In alternative models, more than one token may be read at
an earleme, and duplicates B<are> possible.
Marpa detects duplicate tokens and treats them as
"hard errors" --
Marpa throws an exception
when it sees a duplicate token.
Marpa's assumption is that
duplicate tokens indicate
an error at the application level.
An application can retry input after
a duplicate token, if it
catches the exception.
In the future, if recovery from duplicate tokens is found
to be a useful technique, Marpa may provide an option to change
its behavior, so that a soft failure is returned
when there is a duplicate token.
=head1 Earlemes: the details
While scanning, Marpa keeps track of the B<current earleme>.
Earlemes in a parse start at earleme 0 and increase numerically.
The earleme immediately following earleme 0 is earleme 1,
the earleme immediately following earleme 1 is earleme 2,
and so on.
The earleme immediately following earleme I<N> is always earleme I<N+1>.
B<Distance> in the earleme stream is
what you would intuitively expect it to be.
The distance between earleme I<X> and earleme I<Y> is
the absolute value of the difference between I<X> and I<Y>,
I<|X-Y|>.
The distance from earleme 3 to earleme 6,
for example, is 3 earlemes.
Whenever a token is given to Marpa to be scanned,
it starts at the current earleme.
In addition to the type and value of the token,
Marpa must be told the token's B<length> in earlemes.
The length of a Marpa token must be greater than zero.
This earleme length will become
the distance from the start of the
token to the end of the token.
If the length of the token is I<L>,
and the current earleme is I<C>,
the end of the token will be at earleme I<C+L>.
=head1 The character-per-earleme model
Many different models of the relationship between tokens and earlemes
are possible, but two are particularly important.
One is the one-token-per-earleme model,
which is the default,
and which has already been described.
The other is the one-character-per-earleme model.
In the one-character-per-earleme model,
every character will be treated as being exactly one
earleme in length.
If a token is more than one character in length,
that token will span earlemes.
When the lexing is ambiguous, tokens may overlap.
When a one-character-per-earleme model of input is used,
there may be many earlemes at which no tokens start.
For example,
in a straightforward character-per-earleme implementation
of a grammar for a language that allows
comments,
no tokens will start at
any earlemes which correspond to character locations inside
a comment.
=head1 Other input models
So far only the token-per-earleme and
character-per-earleme models have seen any
real use in Marpa programs.
But other models are certainly possible.
Using earlemes,
you can structure your input in almost any way you like.
There are only three restrictions:
=over 4
=item 1
Scanning always starts at earleme 0.
=item 2
All tokens starting at
earleme I<N> must be scanned before
any tokens starting at earleme I<N+1>.
In other words, the tokens must be scanned in non-decreasing order
by start earleme.
=item 3
Every token must have a length, in earlemes,
which is greater than zero.
In other words,
token length can never
be zero or negative.
=back
=head1 Copyright and License
=for Marpa::R2::Display
ignore: 1
Copyright 2022 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.
=for Marpa::R2::Display::End
=cut
|