1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331
|
# Grammar Reference
## Definitions
A **grammar** is a list of rules and terminals, that together define a language.
Terminals define the alphabet of the language, while rules define its structure.
In Lark, a terminal may be a string, a regular expression, or a concatenation of these and other terminals.
Each rule is a list of terminals and rules, whose location and nesting define the structure of the resulting parse-tree.
A **parsing algorithm** is an algorithm that takes a grammar definition and a sequence of symbols (members of the alphabet), and matches the entirety of the sequence by searching for a structure that is allowed by the grammar.
### General Syntax and notes
Grammars in Lark are based on [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) syntax, with several enhancements.
EBNF is basically a short-hand for common BNF patterns.
Optionals are expanded:
```ebnf
a b? c -> (a c | a b c)
```
Repetition is extracted into a recursion:
```ebnf
a: b* -> a: _b_tag
_b_tag: (_b_tag b)?
```
And so on.
Lark grammars are composed of a list of definitions and directives, each on its own line. A definition is either a named rule, or a named terminal, with the following syntax, respectively:
```c
rule: <EBNF EXPRESSION>
| etc.
TERM: <EBNF EXPRESSION> // Rules aren't allowed
```
**Comments** start with
either `//` (C++ style) or `#` (Python style, since version 1.1.6)
and last to the end of the line.
Lark begins the parse with the rule 'start', unless specified otherwise in the options.
Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer, or scanner).
## Terminals
Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals.
**Syntax:**
```html
<NAME> [. <priority>] : <literals-and-or-terminals>
```
Terminal names must be uppercase.
Literals can be one of:
* `"string"`
* `/regular expression+/`
* `"case-insensitive string"i`
* `/re with flags/imulx`
* Literal range: `"a".."z"`, `"1".."9"`, etc.
Terminals also support grammar operators, such as `|`, `+`, `*` and `?`.
Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).
### Templates
Templates are expanded when preprocessing the grammar.
Definition syntax:
```ebnf
my_template{param1, param2, ...}: <EBNF EXPRESSION>
```
Use syntax:
```ebnf
some_rule: my_template{arg1, arg2, ...}
```
Example:
```ebnf
_separated{x, sep}: x (sep x)* // Define a sequence of 'x sep x sep x ...'
num_list: "[" _separated{NUMBER, ","} "]" // Will match "[1, 2, 3]" etc.
```
### Priority
Terminals can be assigned a priority to influence lexing. Terminal priorities
are signed integers with a default value of 0.
When using a lexer, the highest priority terminals are always matched first.
When using Earley's dynamic lexing, terminal priorities are used to prefer
certain lexings and resolve ambiguity.
### Regexp Flags
You can use flags on regexps and strings. For example:
```perl
SELECT: "select"i //# Will ignore case, and match SELECT or Select, etc.
MULTILINE_TEXT: /.+/s
SIGNED_INTEGER: /
[+-]? # the sign
(0|[1-9][0-9]*) # the digits
/x
```
Supported flags are one of: `imslux`. See Python's regex documentation for more details on each one.
Regexps/strings of different flags can only be concatenated in Python 3.6+
#### Notes for when using a lexer:
When using a lexer (basic or contextual), it is the grammar-author's responsibility to make sure the literals don't collide, or that if they do, they are matched in the desired order. Literals are matched according to the following precedence:
1. Highest priority first (priority is specified as: TERM.number: ...)
2. Length of match (for regexps, the longest theoretical match is used)
3. Length of literal / pattern definition
4. Name
**Examples:**
```perl
IF: "if"
INTEGER : /[0-9]+/
INTEGER2 : ("0".."9")+ //# Same as INTEGER
DECIMAL.2: INTEGER? "." INTEGER //# Will be matched before INTEGER
WHITESPACE: (" " | /\t/ )+
SQL_SELECT: "select"i
```
### Regular expressions & Ambiguity
Each terminal is eventually compiled to a regular expression. All the operators and references inside it are mapped to their respective expressions.
For example, in the following grammar, `A1` and `A2`, are equivalent:
```perl
A1: "a" | "b"
A2: /a|b/
```
This means that inside terminals, Lark cannot detect or resolve ambiguity, even when using Earley.
For example, for this grammar:
```perl
start : (A | B)+
A : "a" | "ab"
B : "b"
```
We get only one possible derivation, instead of two:
```bash
>>> p = Lark(g, ambiguity="explicit")
>>> p.parse("ab")
Tree('start', [Token('A', 'ab')])
```
This is happening because Python's regex engine always returns the best matching option. There is no way to access the alternatives.
If you find yourself in this situation, the recommended solution is to use rules instead.
Example:
```python
>>> p = Lark("""start: (a | b)+
... !a: "a" | "ab"
... !b: "b"
... """, ambiguity="explicit")
>>> print(p.parse("ab").pretty())
_ambig
start
a ab
start
a a
b b
```
## Rules
**Syntax:**
```html
<name> : <items-to-match> [-> <alias> ]
| ...
```
Names of rules and aliases are always in lowercase.
Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: `|` ).
An alias is a name for the specific rule alternative. It affects tree construction.
Each item is one of:
* `rule`
* `TERMINAL`
* `"string literal"` or `/regexp literal/`
* `(item item ..)` - Group items
* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
* `item?` - Zero or one instances of item ("maybe")
* `item*` - Zero or more instances of item
* `item+` - One or more instances of item
* `item ~ n` - Exactly *n* instances of item
* `item ~ n..m` - Between *n* to *m* instances of item (not recommended for wide ranges, due to performance issues)
**Examples:**
```perl
hello_world: "hello" "world"
mul: (mul "*")? number //# Left-recursion is allowed and encouraged!
expr: expr operator expr
| value //# Multi-line, belongs to expr
four_words: word ~ 4
```
### Priority
Like terminals, rules can be assigned a priority. Rule priorities are signed
integers with a default value of 0.
When using LALR, the highest priority rules are used to resolve collision errors.
When using Earley, rule priorities are used to resolve ambiguity.
<a name="dirs"></a>
## Directives
### %ignore
All occurrences of the terminal will be ignored, and won't be part of the parse.
Using the `%ignore` directive results in a cleaner grammar.
It's especially important for the LALR(1) algorithm, because adding whitespace (or comments, or other extraneous elements) explicitly in the grammar, harms its predictive abilities, which are based on a lookahead of 1.
**Syntax:**
```html
%ignore <TERMINAL>
```
**Examples:**
```perl
%ignore " "
COMMENT: "#" /[^\n]/*
%ignore COMMENT
```
### %import
Allows one to import terminals and rules from lark grammars.
When importing rules, all their dependencies will be imported into a namespace, to avoid collisions. It's not possible to override their dependencies (e.g. like you would when inheriting a class).
**Syntax:**
```html
%import <module>.<TERMINAL>
%import <module>.<rule>
%import <module>.<TERMINAL> -> <NEWTERMINAL>
%import <module>.<rule> -> <newrule>
%import <module> (<TERM1>, <TERM2>, <rule1>, <rule2>)
```
If the module path is absolute, Lark will attempt to load it from the built-in directory (which currently contains `common.lark`, `python.lark`, and `unicode.lark`).
If the module path is relative, such as `.path.to.file`, Lark will attempt to load it from the current working directory. Grammars must have the `.lark` extension.
The rule or terminal can be imported under another name with the `->` syntax.
**Example:**
```perl
%import common.NUMBER
%import .terminals_file (A, B, C)
%import .rules_file.rulea -> ruleb
```
Note that `%ignore` directives cannot be imported. Imported rules will abide by the `%ignore` directives declared in the main grammar.
### %declare
Declare a terminal without defining it. Useful for plugins.
### %override
Override a rule or terminals, affecting all references to it, even in imported grammars.
Useful for implementing an inheritance pattern when importing grammars.
**Example:**
```perl
%import my_grammar (start, number, NUMBER)
// Add hex support to my_grammar
%override number: NUMBER | /0x\w+/
```
### %extend
Extend the definition of a rule or terminal, e.g. add a new option on what it can match, like when separated with `|`.
Useful for splitting up a definition of a complex rule with many different options over multiple files.
Can also be used to implement a plugin system where a core grammar is extended by others.
**Example:**
```perl
%import my_grammar (start, NUMBER)
// Add hex support to my_grammar
%extend NUMBER: /0x\w+/
```
For both `%extend` and `%override`, there is not requirement for a rule/terminal to come from another file, but that is probably the most common usecase
|