1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564
|
ReBulk
======
[](https://pypi.python.org/pypi/rebulk)
[](https://pypi.python.org/pypi/rebulk)
[](https://github.com/Toilal/rebulk/actions?query=workflow%3Aci)
[](https://coveralls.io/r/Toilal/rebulk?branch=master)
[](https://github.com/relekang/python-semantic-release)
ReBulk is a python library that performs advanced searches in strings
that would be hard to implement using [re
module](https://docs.python.org/3/library/re.html) or [String
methods](https://docs.python.org/3/library/stdtypes.html#str) only.
It includes some features like `Patterns`, `Match`, `Rule` that allows
developers to build a custom and complex string matcher using a readable
and extendable API.
This project is hosted on GitHub: <https://github.com/Toilal/rebulk>
Install
=======
```sh
$ pip install rebulk
```
Usage
=====
Regular expression, string and function based patterns are declared in a
`Rebulk` object. It use a fluent API to chain `string`, `regex`, and
`functional` methods to define various patterns types.
```python
>>> from rebulk import Rebulk
>>> bulk = Rebulk().string('brown').regex(r'qu\w+').functional(lambda s: (20, 25))
```
When `Rebulk` object is fully configured, you can call `matches` method
with an input string to retrieve all `Match` objects found by registered
pattern.
```python
>>> bulk.matches("The quick brown fox jumps over the lazy dog")
[<brown:(10, 15)>, <quick:(4, 9)>, <jumps:(20, 25)>]
```
If multiple `Match` objects are found at the same position, only the
longer one is kept.
```python
>>> bulk = Rebulk().string('lakers').string('la')
>>> bulk.matches("the lakers are from la")
[<lakers:(4, 10)>, <la:(20, 22)>]
```
String Patterns
===============
String patterns are based on
[str.find](https://docs.python.org/3/library/stdtypes.html#str.find)
method to find matches, but returns all matches in the string.
`ignore_case` can be enabled to ignore case.
```python
>>> Rebulk().string('la').matches("lalalilala")
[<la:(0, 2)>, <la:(2, 4)>, <la:(6, 8)>, <la:(8, 10)>]
>>> Rebulk().string('la').matches("LalAlilAla")
[<la:(8, 10)>]
>>> Rebulk().string('la', ignore_case=True).matches("LalAlilAla")
[<La:(0, 2)>, <lA:(2, 4)>, <lA:(6, 8)>, <la:(8, 10)>]
```
You can define several patterns with a single `string` method call.
```python
>>> Rebulk().string('Winter', 'coming').matches("Winter is coming...")
[<Winter:(0, 6)>, <coming:(10, 16)>]
```
Regular Expression Patterns
===========================
Regular Expression patterns are based on a compiled regular expression.
[re.finditer](https://docs.python.org/3/library/re.html#re.finditer)
method is used to find matches.
If [regex module](https://pypi.python.org/pypi/regex) is available, it
can be used by rebulk instead of default [re
module](https://docs.python.org/3/library/re.html). Enable it with `REBULK_REGEX_ENABLED=1` environment variable.
```python
>>> Rebulk().regex(r'l\w').matches("lolita")
[<lo:(0, 2)>, <li:(2, 4)>]
```
You can define several patterns with a single `regex` method call.
```python
>>> Rebulk().regex(r'Wint\wr', r'com\w{3}').matches("Winter is coming...")
[<Winter:(0, 6)>, <coming:(10, 16)>]
```
All keyword arguments from
[re.compile](https://docs.python.org/3/library/re.html#re.compile) are
supported.
```python
>>> import re # import required for flags constant
>>> Rebulk().regex('L[A-Z]KERS', flags=re.IGNORECASE) \
... .matches("The LaKeRs are from La")
[<LaKeRs:(4, 10)>]
>>> Rebulk().regex('L[A-Z]', 'L[A-Z]KERS', flags=re.IGNORECASE) \
... .matches("The LaKeRs are from La")
[<La:(20, 22)>, <LaKeRs:(4, 10)>]
>>> Rebulk().regex(('L[A-Z]', re.IGNORECASE), ('L[a-z]KeRs')) \
... .matches("The LaKeRs are from La")
[<La:(20, 22)>, <LaKeRs:(4, 10)>]
```
If [regex module](https://pypi.python.org/pypi/regex) is available, it
automatically supports repeated captures.
```python
>>> # If regex module is available, repeated_captures is True by default.
>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+').matches("01-02-03-04")
>>> matches[0].children # doctest:+SKIP
[<01:(0, 2)>, <02:(3, 5)>, <03:(6, 8)>, <04:(9, 11)>]
>>> # If regex module is not available, or if repeated_captures is forced to False.
>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+', repeated_captures=False) \
... .matches("01-02-03-04")
>>> matches[0].children
[<01:(0, 2)+initiator=01-02-03-04>, <04:(9, 11)+initiator=01-02-03-04>]
```
- `abbreviations`
Defined as a list of 2-tuple, each tuple is an abbreviation. It
simply replace `tuple[0]` with `tuple[1]` in the expression.
\>\>\> Rebulk().regex(r\'Custom-separators\',
abbreviations=\[(\"-\", r\"\[W\_\]+\")\])\...
.matches(\"Custom\_separators using-abbreviations\")
\[\<Custom\_separators:(0, 17)\>\]
Functional Patterns
===================
Functional Patterns are based on the evaluation of a function.
The function should have the same parameters as `Rebulk.matches` method,
that is the input string, and must return at least start index and end
index of the `Match` object.
```python
>>> def func(string):
... index = string.find('?')
... if index > -1:
... return 0, index - 11
>>> Rebulk().functional(func).matches("Why do simple ? Forget about it ...")
[<Why:(0, 3)>]
```
You can also return a dict of keywords arguments for `Match` object.
You can define several patterns with a single `functional` method call,
and function used can return multiple matches.
Chain Patterns
==============
Chain Patterns are ordered composition of string, functional and regex
patterns. Repeater can be set to define repetition on chain part.
```python
>>> r = Rebulk().regex_defaults(flags=re.IGNORECASE)\
... .defaults(children=True, formatter={'episode': int, 'version': int})\
... .chain()\
... .regex(r'e(?P<episode>\d{1,4})').repeater(1)\
... .regex(r'v(?P<version>\d+)').repeater('?')\
... .regex(r'[ex-](?P<episode>\d{1,4})').repeater('*')\
... .close() # .repeater(1) could be omitted as it's the default behavior
>>> r.matches("This is E14v2-15-16-17").to_dict() # converts matches to dict
MatchesDict([('episode', [14, 15, 16, 17]), ('version', 2)])
```
Patterns parameters
===================
All patterns have options that can be given as keyword arguments.
- `validator`
Function to validate `Match` value given by the pattern. Can also be
a `dict`, to use `validator` with pattern named with key.
```python
>>> def check_leap_year(match):
... return int(match.value) in [1980, 1984, 1988]
>>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \
... .matches("In year 1982 ...")
>>> len(matches)
0
>>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \
... .matches("In year 1984 ...")
>>> len(matches)
1
```
Some base validator functions are available in `rebulk.validators`
module. Most of those functions have to be configured using
`functools.partial` to map them to function accepting a single `match`
argument.
- `formatter`
Function to convert `Match` value given by the pattern. Can also be
a `dict`, to use `formatter` with matches named with key.
```python
>>> def year_formatter(value):
... return int(value)
>>> matches = Rebulk().regex(r'\d{4}', formatter=year_formatter) \
... .matches("In year 1982 ...")
>>> isinstance(matches[0].value, int)
True
```
- `pre_match_processor` / `post_match_processor`
Function to mutagen or invalidate a match generated by a pattern.
Function has a single parameter which is the Match object. If
function returns False, it will be considered as an invalid match.
If function returns a match instance, it will replace the original
match with this instance in the process.
- `post_processor`
Function to change the default output of the pattern. Function
parameters are Matches list and Pattern object.
- `name`
The name of the pattern. It is automatically passed to `Match`
objects generated by this pattern.
- `tags`
A list of string that qualifies this pattern.
- `value`
Override value property for generated `Match` objects. Can also be a
`dict`, to use `value` with pattern named with key.
- `validate_all`
By default, validator is called for returned `Match` objects only.
Enable this option to validate them all, parent and children
included.
- `format_all`
By default, formatter is called for returned `Match` values only.
Enable this option to format them all, parent and children included.
- `disabled`
A `function(context)` to disable the pattern if returning `True`.
- `children`
If `True`, all children `Match` objects will be retrieved instead of
a single parent `Match` object.
- `private`
If `True`, `Match` objects generated from this pattern are available
internally only. They will be removed at the end of `Rebulk.matches`
method call.
- `private_parent`
Force parent matches to be returned and flag them as private.
- `private_children`
Force children matches to be returned and flag them as private.
- `private_names`
Matches names that will be declared as private
- `ignore_names`
Matches names that will be ignored from the pattern output, after
validation.
- `marker`
If `true`, `Match` objects generated from this pattern will be
markers matches instead of standard matches. They won\'t be included
in `Matches` sequence, but will be available in `Matches.markers`
sequence (see `Markers` section).
Match
=====
A `Match` object is the result created by a registered pattern.
It has a `value` property defined, and position indices are available
through `start`, `end` and `span` properties.
In some case, it contains children `Match` objects in `children`
property, and each child `Match` object reference its parent in `parent`
property. Also, a `name` property can be defined for the match.
If groups are defined in a Regular Expression pattern, each group match
will be converted to a single `Match` object. If a group has a name
defined (`(?P<name>group)`), it is set as `name` property in a child
`Match` object. The whole regexp match (`re.group(0)`) will be converted
to the main `Match` object, and all subgroups (1, 2, \... n) will be
converted to `children` matches of the main `Match` object.
```python
>>> matches = Rebulk() \
... .regex(r"One, (?P<one>\w+), Two, (?P<two>\w+), Three, (?P<three>\w+)") \
... .matches("Zero, 0, One, 1, Two, 2, Three, 3, Four, 4")
>>> matches
[<One, 1, Two, 2, Three, 3:(9, 33)>]
>>> for child in matches[0].children:
... '%s = %s' % (child.name, child.value)
'one = 1'
'two = 2'
'three = 3'
```
It\'s possible to retrieve only children by using `children` parameters.
You can also customize the way structure is generated with `every`,
`private_parent` and `private_children` parameters.
```python
>>> matches = Rebulk() \
... .regex(r"One, (?P<one>\w+), Two, (?P<two>\w+), Three, (?P<three>\w+)", children=True) \
... .matches("Zero, 0, One, 1, Two, 2, Three, 3, Four, 4")
>>> matches
[<1:(14, 15)+name=one+initiator=One, 1, Two, 2, Three, 3>, <2:(22, 23)+name=two+initiator=One, 1, Two, 2, Three, 3>, <3:(32, 33)+name=three+initiator=One, 1, Two, 2, Three, 3>]
```
Match object has the following properties that can be given to Pattern
objects
- `formatter`
Function to convert `Match` value given by the pattern. Can also be
a `dict`, to use `formatter` with matches named with key.
```python
>>> def year_formatter(value):
... return int(value)
>>> matches = Rebulk().regex(r'\d{4}', formatter=year_formatter) \
... .matches("In year 1982 ...")
>>> isinstance(matches[0].value, int)
True
```
- `format_all`
By default, formatter is called for returned `Match` values only.
Enable this option to format them all, parent and children included.
- `conflict_solver`
A `function(match, conflicting_match)` used to solve conflict.
Returned object will be removed from matches by `ConflictSolver`
default rule. If `__default__` string is returned, it will fallback
to default behavior keeping longer match.
Matches
=======
A `Matches` object holds the result of `Rebulk.matches` method call.
It\'s a sequence of `Match` objects and it behaves like a list.
All methods accepts a `predicate` function to filter `Match` objects
using a callable, and an `index` int to retrieve a single element from
default returned matches.
It has the following additional methods and properties on it.
- `starting(index, predicate=None, index=None)`
Retrieves a list of `Match` objects that starts at given index.
- `ending(index, predicate=None, index=None)`
Retrieves a list of `Match` objects that ends at given index.
- `previous(match, predicate=None, index=None)`
Retrieves a list of `Match` objects that are previous and nearest to
match.
- `next(match, predicate=None, index=None)`
Retrieves a list of `Match` objects that are next and nearest to
match.
- `tagged(tag, predicate=None, index=None)`
Retrieves a list of `Match` objects that have the given tag defined.
- `named(name, predicate=None, index=None)`
Retrieves a list of `Match` objects that have the given name.
- `range(start=0, end=None, predicate=None, index=None)`
Retrieves a list of `Match` objects for given range, sorted from
start to end.
- `holes(start=0, end=None, formatter=None, ignore=None, predicate=None, index=None)`
Retrieves a list of *hole* `Match` objects for given range. A hole
match is created for each range where no match is available.
- `conflicting(match, predicate=None, index=None)`
Retrieves a list of `Match` objects that conflicts with given match.
- `chain_before(self, position, seps, start=0, predicate=None, index=None)`:
Retrieves a list of chained matches, before position, matching
predicate and separated by characters from seps only.
- `chain_after(self, position, seps, end=None, predicate=None, index=None)`:
Retrieves a list of chained matches, after position, matching
predicate and separated by characters from seps only.
- `at_match(match, predicate=None, index=None)`
Retrieves a list of `Match` objects at the same position as match.
- `at_span(span, predicate=None, index=None)`
Retrieves a list of `Match` objects from given (start, end) tuple.
- `at_index(pos, predicate=None, index=None)`
Retrieves a list of `Match` objects from given position.
- `names`
Retrieves a sequence of all `Match.name` properties.
- `tags`
Retrieves a sequence of all `Match.tags` properties.
- `to_dict(details=False, first_value=False, enforce_list=False)`
Convert to an ordered dict, with `Match.name` as key and
`Match.value` as value.
It\'s a subclass of
[OrderedDict](https://docs.python.org/2/library/collections.html#collections.OrderedDict),
that contains a `matches` property which is a dict with `Match.name`
as key and list of `Match` objects as value.
If `first_value` is `True` and distinct values are found for the
same name, value will be wrapped to a list. If `False`, first value
only will be kept and values lists can be retrieved with
`values_list` which is a dict with `Match.name` as key and list of
`Match.value` as value.
if `enforce_list` is `True`, all values will be wrapped to a list,
even if a single value is found.
If `details` is True, `Match.value` objects are replaced with
complete `Match` object.
- `markers`
A custom `Matches` sequences specialized for `markers` matches (see
below)
Markers
=======
If you have defined some patterns with `markers` property, then
`Matches.markers` points to a special `Matches` sequence that contains
only `markers` matches. This sequence supports all methods from
`Matches`.
Markers matches are not intended to be used in final result, but can be
used to implement a `Rule`.
Rules
=====
Rules are a convenient and readable way to implement advanced
conditional logic involving several `Match` objects. When a rule is
triggered, it can perform an action on `Matches` object, like filtering
out, adding additional tags or renaming.
Rules are implemented by extending the abstract `Rule` class. They are
registered using `Rebulk.rule` method by giving either a `Rule`
instance, a `Rule` class or a module containing `Rule classes` only.
For a rule to be triggered, `Rule.when` method must return `True`, or a
non empty list of `Match` objects, or any other truthy object. When
triggered, `Rule.then` method is called to perform the action with
`when_response` parameter defined as the response of `Rule.when` call.
Instead of implementing `Rule.then` method, you can define `consequence`
class property with a Consequence classe or instance, like
`RemoveMatch`, `RenameMatch` or `AppendMatch`. You can also use a list
of consequence when required : `when_response` must then be iterable,
and elements of this iterable will be given to each consequence in the
same order.
When many rules are registered, it can be useful to set `priority` class
variable to define a priority integer between all rule executions
(higher priorities will be executed first). You can also define
`dependency` to declare another Rule class as dependency for the current
rule, meaning that it will be executed before.
For all rules with the same `priority` value, `when` is called before,
and `then` is called after all.
```python
>>> from rebulk import Rule, RemoveMatch
>>> class FirstOnlyRule(Rule):
... consequence = RemoveMatch
...
... def when(self, matches, context):
... grabbed = matches.named("grabbed", 0)
... if grabbed and matches.previous(grabbed):
... return grabbed
>>> rebulk = Rebulk()
>>> rebulk.regex("This match(.*?)grabbed", name="grabbed")
<...Rebulk object ...>
>>> rebulk.regex("if it's(.*?)first match", private=True)
<...Rebulk object at ...>
>>> rebulk.rules(FirstOnlyRule)
<...Rebulk object at ...>
>>> rebulk.matches("This match is grabbed only if it's the first match")
[<This match is grabbed:(0, 21)+name=grabbed>]
>>> rebulk.matches("if it's NOT the first match, This match is NOT grabbed")
[]
```
|