1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698
|
.. _filters:
.. All code examples here should have a unique URL that maps to
an entry in test/data/filter_documentation_testdata.yaml which
will be used to provide input/output data for the filter example
so that the examples can be verified to be correct automatically.
Filters
=======
.. only:: man
Synopsis
--------
urlwatch --edit
Description
-----------
Each job can have two filter stages configured, with one or more
filters processed after each other:
* Applied to the downloaded page before diffing the changes (``filter``)
* Applied to the diff result before reporting the changes (``diff_filter``)
While creating your filter pipeline, you might want to preview what the
filtered output looks like. You can do so by first configuring your job
and then running urlwatch with the ``--test-filter`` command, passing in
the index (from ``--list``) or the URL/location of the job to be tested:
::
urlwatch --test-filter 1 # Test the first job in the list
urlwatch --test-filter https://example.net/ # Test the job with the given URL
The output of this command will be the filtered plaintext of the job,
this is the output that will (in a real urlwatch run) be the input to
the diff algorithm.
The ``filter`` is only applied to new content, the old content was
already filtered when it was retrieved. This means that changes to
``filter`` are not visible when reporting unchanged contents
(see :ref:`configuration_display` for details), and the diff output
will be between (old content with filter at the time old content was
retrieved) and (new content with current filter).
Once urlwatch has collected at least 2 historic snapshots of a job
(two different states of a webpage) you can use the command-line
option ``--test-diff-filter`` to test your ``diff_filter`` settings;
this will use historic data cached locally.
Built-in filters
----------------
The list of built-in filters can be retrieved using::
urlwatch --features
At the moment, the following filters are built-in:
- **beautify**: Beautify HTML
- **css**: Filter XML/HTML using CSS selectors
- **csv2text**: Convert CSV to plaintext
- **element-by-class**: Get all HTML elements by class
- **element-by-id**: Get an HTML element by its ID
- **element-by-style**: Get all HTML elements by style
- **element-by-tag**: Get an HTML element by its tag
- **format-json**: Convert to formatted json
- **grep**: Filter only lines matching a regular expression
- **grepi**: Remove lines matching a regular expression
- **hexdump**: Convert binary data to hex dump format
- **html2text**: Convert HTML to plaintext
- **pdf2text**: Convert PDF to plaintext
- **pretty-xml**: Pretty-print XML
- **ical2text**: Convert `iCalendar`_ to plaintext
- **ocr**: Convert text in images to plaintext using Tesseract OCR
- **re.sub**: Replace text with regular expressions using Python's re.sub
- **re.findall**: Find all non-overlapping matches using Python's re.findall
- **reverse**: Reverse input items
- **sha1sum**: Calculate the SHA-1 checksum of the content
- **shellpipe**: Filter using a shell command
- **sort**: Sort input items
- **remove-duplicate-lines**: Remove duplicate lines (case sensitive)
- **strip**: Strip leading and trailing whitespace
- **striplines**: Strip leading and trailing whitespace in each line
- **xpath**: Filter XML/HTML using XPath expressions
- **jq**: Filter, transform and extract values from JSON
.. To convert the "urlwatch --features" output, use:
sed -e 's/^ \* \(.*\) - \(.*\)$/- **\1**: \2/'
.. _iCalendar: https://en.wikipedia.org/wiki/ICalendar
Picking out elements from a webpage
-----------------------------------
You can pick only a given HTML element with the built-in filter, for
example to extract ``<div id="something">.../<div>`` from a page, you
can use the following in your ``urls.yaml``:
.. code:: yaml
url: http://example.org/idtest.html
filter:
- element-by-id: something
Also, you can chain filters, so you can run html2text on the result:
.. code:: yaml
url: http://example.net/id2text.html
filter:
- element-by-id: something
- html2text
Chaining multiple filters
-------------------------
The example urls.yaml file also demonstrates the use of built-in
filters, here 3 filters are used: html2text, line-grep and whitespace
removal to get just a certain info field from a webpage:
.. code:: yaml
url: https://example.net/version.html
filter:
- html2text
- grep: "Current.*version"
- strip
Extracting only the ``<body>`` tag of a page
--------------------------------------------
If you want to extract only the body tag you can use this filter:
.. code:: yaml
url: https://example.org/bodytag.html
filter:
- element-by-tag: body
Filtering based on an XPath expression
--------------------------------------
To filter based on an
`XPath <https://www.w3.org/TR/1999/REC-xpath-19991116/>`__ expression,
you can use the ``xpath`` filter like so:
.. code:: yaml
url: https://example.net/xpath.html
filter:
- xpath: /html/body/marquee
This filters only the ``<marquee>`` elements directly below the ``<body>``
element, which in turn must be below the ``<html>`` element of the document,
stripping out everything else.
See Microsoft’s `XPath Examples <https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx>`__ page for some other examples.
You can also find an XPath of an ``<html>`` node in the Chromium/Google Chrome developer tools by right clicking on the node and selecting ``copy XPath``.
Filtering based on CSS selectors
--------------------------------
To filter based on a `CSS
selector <https://www.w3.org/TR/2011/REC-css3-selectors-20110929/>`__,
you can use the ``css`` filter like so:
.. code:: yaml
url: https://example.net/css.html
filter:
- css: ul#groceries > li.unchecked
This would filter only ``<li class="unchecked">`` tags directly
below ``<ul id="groceries">`` elements.
Some limitations and extensions exist as explained in `cssselect’s
documentation <https://cssselect.readthedocs.io/en/latest/#supported-selectors>`__.
Using XPath and CSS filters with XML and exclusions
---------------------------------------------------
By default, XPath and CSS filters are set up for HTML documents.
However, it is possible to use them for XML documents as well (these
examples parse an RSS feed and filter only the titles and publication
dates):
.. code:: yaml
url: https://example.com/blog/xpath-index.rss
filter:
- xpath:
path: '//item/title/text()|//item/pubDate/text()'
method: xml
.. code:: yaml
url: http://example.com/blog/css-index.rss
filter:
- css:
selector: 'item > title, item > pubDate'
method: xml
- html2text: re
To match an element in an `XML
namespace <https://www.w3.org/TR/xml-names/>`__, use a namespace prefix
before the tag name. Use a ``:`` to separate the namespace prefix and
the tag name in an XPath expression, and use a ``|`` in a CSS selector.
.. code:: yaml
url: https://example.net/feed/xpath-namespace.xml
filter:
- xpath:
path: '//item/media:keywords/text()'
method: xml
namespaces:
media: http://search.yahoo.com/mrss/
.. code:: yaml
url: http://example.org/feed/css-namespace.xml
filter:
- css:
selector: 'item > media|keywords'
method: xml
namespaces:
media: http://search.yahoo.com/mrss/
- html2text
Alternatively, use the XPath expression ``//*[name()='<tag_name>']`` to
bypass the namespace entirely.
Another useful option with XPath and CSS filters is ``exclude``.
Elements selected by this ``exclude`` expression are removed from the
final result. For example, the following job will not have any ``<a>``
tag in its results:
.. code:: yaml
url: https://example.org/css-exclude.html
filter:
- css:
selector: body
exclude: a
Limiting the returned items from a CSS Selector or XPath
--------------------------------------------------------
If you only want to return a subset of the items returned by a CSS
selector or XPath filter, you can use two additional subfilters:
* ``skip``: How many elements to skip from the beginning (default: 0)
* ``maxitems``: How many elements to return at most (default: no limit)
For example, if the page has multiple elements, but you only want
to select the second and third matching element (skip the first, and
return at most two elements), you can use this filter:
.. code:: yaml
url: https://example.net/css-skip-maxitems.html
filter:
- css:
selector: div.cpu
skip: 1
maxitems: 2
Dealing with duplicated results
*******************************
If you get multiple results on one page, but you only expected one
(e.g. because the page contains both a mobile and desktop version in
the same HTML document, and shows/hides one via CSS depending on the
viewport size), you can use ``maxitems: 1`` to only return the first
item.
Fixing list reorderings with CSS Selector or XPath filters
----------------------------------------------------------
In some cases, the ordering of items on a webpage might change regularly
without the actual content changing. By default, this would show up in
the diff output as an element being removed from one part of the page and
inserted in another part of the page.
In cases where the order of items doesn't matter, it's possible to sort
matched items lexicographically to avoid spurious reports when only the
ordering of items changes on the page.
The subfilter for ``css`` and ``xpath`` filters is ``sort``, and can be
``true`` or ``false`` (the default):
.. code:: yaml
url: https://example.org/items-random-order.html
filter:
- css:
selector: span.item
sort: true
Filtering PDF documents
-----------------------
To monitor the text of a PDF file, you use the `pdf2text` filter. It requires
the installation of the `pdftotext`_ library and any of its
`OS-specific dependencies`_.
.. _pdftotext: https://github.com/jalan/pdftotext/blob/master/README.md#pdftotext
.. _OS-specific dependencies: https://github.com/jalan/pdftotext/blob/master/README.md#os-dependencies
This filter *must* be the first filter in a chain of filters, since it
consumes binary data and outputs text data.
.. code-block:: Yaml
url: https://example.net/pdf-test.pdf
filter:
- pdf2text
- strip
If the PDF file is password protected, you can specify its password:
.. code-block:: Yaml
url: https://example.net/pdf-test-password.pdf
filter:
- pdf2text:
password: urlwatchsecret
- strip
Dealing with CSV input
----------------------
The ``csv2text`` filter can be used to turn CSV data to a prettier textual representation
This is done by supplying a ``format_string`` which is a `python format string`_.
.. _`python format string`: https://docs.python.org/3/library/string.html#format-string-syntax
If the CSV has a header, the format string should use the header names lowercased.
For example, let's say we have a CSV file containing data like this::
Name;Company
Smith;Initech
Doe;Initech
A possible format string for the above CSV (note the lowercase keys)::
Mr {name} works at {company}
If there is no header row, you will need to use the numeric array notation::
Mr {0} works at {1}
You can force the use of numeric indices with the flag ``ignore_header``.
The key ``has_header`` can be used to force use the first line or first
ignore the first line as header, otherwise `csv.Sniffer`_ will be used.
.. _`csv.Sniffer`: https://docs.python.org/3/library/csv.html#csv.Sniffer
Sorting of webpage content
--------------------------
Sometimes a web page can have the same data between comparisons but it
appears in random order. If that happens, you can choose to sort before
the comparison.
.. code:: yaml
url: https://example.net/sorting.txt
filter:
- sort
The sort filter takes an optional ``separator`` parameter that defines
the item separator (by default sorting is line-based), for example to
sort text paragraphs (text separated by an empty line):
.. code:: yaml
url: http://example.org/paragraphs.txt
filter:
- sort:
separator: "\n\n"
This can be combined with a boolean ``reverse`` option, which is useful
for sorting and reversing with the same separator (using ``%`` as
separator, this would turn ``3%2%4%1`` into ``4%3%2%1``):
.. code:: yaml
url: http://example.org/sort-reverse-percent.txt
filter:
- sort:
separator: '%'
reverse: true
Reversing of lines or separated items
-------------------------------------
To reverse the order of items without sorting, the ``reverse`` filter
can be used. By default it reverses lines:
.. code:: yaml
url: http://example.com/reverse-lines.txt
filter:
- reverse
This behavior can be changed by using an optional separator string
argument (e.g. items separated by a pipe (``|``) symbol,
as in ``1|4|2|3``, which would be reversed to ``3|2|4|1``):
.. code:: yaml
url: http://example.net/reverse-separator.txt
filter:
- reverse: '|'
Alternatively, the filter can be specified more verbose with a dict.
In this example ``"\n\n"`` is used to separate paragraphs (items that
are separated by an empty line):
.. code:: yaml
url: http://example.org/reverse-paragraphs.txt
filter:
- reverse:
separator: "\n\n"
Watching Github releases and Gitlab tags
----------------------------------------
This is an example how to watch the GitHub “releases” page for a given
project for the latest release version, to be notified of new releases:
.. code:: yaml
url: https://github.com/tulir/gomuks/releases
filter:
- xpath:
path: //*[@class="Link--primary Link"]
maxitems: 1
- html2text:
This is the corresponding version for Github tags:
.. code:: yaml
url: https://github.com/thp/urlwatch/tags
filter:
- xpath:
path: //*[@class="Link--primary Link"]
maxitems: 1
- html2text:
and for Gitlab tags:
.. code:: yaml
url: https://gitlab.com/chinstrap/gammastep/-/tags
filter:
- xpath: (//a[contains(@class,"item-title ref-name")])[1]
- html2text
Alternatively, ``jq`` can be used for filtering:
.. code:: Yaml
url: https://api.github.com/repos/voxpupuli/puppet-rundeck/tags
filter:
- jq: '.[0].name'
Find, remove or replace text using regular expressions
------------------------------------------------------
You can use ``re.sub`` and ``re.findall`` to apply regular expressions.
``re.sub`` can be used to remove or replace all non-overlapping instances
of matched text. The following example applies the filter 3 times:
1. Just specifying a string as the value will replace the matches with
the empty string.
2. Simple patterns can be replaced with another string using “pattern”
as the expression and “repl” as the replacement.
3. You can use groups (``()``) and back-reference them with ``\1``
(etc..) to put groups into the replacement string.
``repl`` defaults to the empty string, which will remove matched strings.
.. code:: yaml
url: https://example.com/regex-substitute.html
filter:
- re.sub: '\s*href="[^"]*"'
- re.sub:
pattern: '<h1>'
repl: 'HEADING 1: '
- re.sub:
pattern: '</([^>]*)>'
repl: '<END OF TAG \1>'
``re.findall`` can be used to find all non-overlapping matches of a
regular expression. Each match is output on its own line. The following
example applies the filter twice:
1. It uses a group (``()``) and back-reference (``\1``) to extract a
date from the input string.
2. It breaks the numbers in the date out into separate lines.
If ``repl`` is not specified, the full match will be included in the output.
.. code:: yaml
url: https://example.com/regex-findall.html
filter:
- re.findall:
pattern: 'The next draw is on (\d{4}-\d{2}-\d{2}).'
repl: '\1'
- re.findall: '\d+'
Note: When using HTML or XML, it is usually better to use CSS selectors or
XPATH expressions. HTML and XML `cannot be parsed`_ properly using regular
expressions. If the CSS selector or XPATH cannot provide the targeted
selection required, using an ``html2text`` filter first then using
``re.findall`` can be a good pattern.
.. _`cannot be parsed`: https://stackoverflow.com/a/1732454/1047040
If you want to enable flags (e.g. ``re.MULTILINE``) in ``re.sub``
or ``re.findall`` filters, use an "inline flag", here are some
examples:
* ``re.MULTILINE``: ``(?m)`` (Makes ``^`` match start-of-line and ``$`` match end-of-line)
* ``re.DOTALL``: ``(?s)`` (Makes ``.`` also match a newline)
* ``re.IGNORECASE``: ``(?i)`` (Perform case-insensitive matching)
.. _full re syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax
This allows you, for example, to remove all leading spaces (only
space character and tab):
.. code:: yaml
url: http://example.com/leading-spaces.txt
filter:
- re.sub: '(?m)^[ \t]*'
Using a shell script as a filter
--------------------------------
While the built-in filters are powerful for processing markup such as
HTML and XML, in some cases you might already know how you would filter
your content using a shell command or shell script. The ``shellpipe``
filter allows you to start a shell and run custom commands to filter
the content.
The text data to be filtered will be written to the standard input
(``stdin``) of the shell process and the filter output will be taken
from the shell's standard output (``stdout``).
For example, if you want to use ``grep`` tool with the case insensitive
matching option (``-i``) and printing only the matching part of
the line (``-o``), you can specify this as ``shellpipe`` filter:
.. code:: yaml
url: https://example.net/shellpipe-grep.txt
filter:
- shellpipe: "grep -i -o 'price: <span>.*</span>'"
This feature also allows you to use :manpage:`sed(1)`, :manpage:`awk(1)` and :manpage:`perl(1)`
one-liners for text processing (of course, any text tool that
works in a shell can be used). For example, this :manpage:`awk(1)` one-liner
prepends the line number to each line:
.. code:: yaml
url: https://example.net/shellpipe-awk-oneliner.txt
filter:
- shellpipe: awk '{ print FNR " " $0 }'
You can also use a multi-line command for a more sophisticated
shell script (``|`` in YAML denotes the start of a text block):
.. code:: yaml
url: https://example.org/shellpipe-multiline.txt
filter:
- shellpipe: |
FILENAME=`mktemp`
# Copy the input to a temporary file, then pipe through awk
tee $FILENAME | awk '/The numbers for (.*) are:/,/The next draw is on (.*)./'
# Analyze the input file in some other way
echo "Input lines: $(wc -l $FILENAME | awk '{ print $1 }')"
rm -f $FILENAME
Within the ``shellpipe`` script, two environment variables will
be set for further customization (this can be useful if you have
an external shell script file that is used as filter for multiple
jobs, but needs to treat each job in a slightly different way):
+----------------------------+------------------------------------------------------+
| Environment variable | Contents |
+============================+======================================================+
| ``$URLWATCH_JOB_NAME`` | The name of the job (``name`` key in jobs YAML) |
+----------------------------+------------------------------------------------------+
| ``$URLWATCH_JOB_LOCATION`` | The URL of the job, or command line (for shell jobs) |
+----------------------------+------------------------------------------------------+
Converting text in images to plaintext
--------------------------------------
The ``ocr`` filter uses the `Tesseract OCR engine`_ to convert text in images
to plain text. It requires two Python modules to be installed:
`pytesseract`_ and `Pillow`_. Any file formats supported by Pillow (PIL) are
supported.
.. _Tesseract OCR engine: https://github.com/tesseract-ocr
.. _pytesseract: https://github.com/madmaze/pytesseract
.. _Pillow: https://python-pillow.org
This filter *must* be the first filter in a chain of filters, since it
consumes binary data and outputs text data.
.. code-block:: Yaml
url: https://example.net/ocr-test.png
filter:
- ocr:
timeout: 5
language: eng
- strip
The subfilters ``timeout`` and ``language`` are optional:
* ``timeout``: Timeout for the recognition, in seconds (default: 10 seconds)
* ``language``: Text language (e.g. ``fra`` or ``eng+fra``, default: ``eng``)
Filtering JSON response data using ``jq`` selectors
---------------------------------------------------
The ``jq`` filter uses the Python bindings for `jq`_, a lightweight JSON processor.
Use of this filter requires the optional `jq Python module`_ to be installed.
.. _jq: https://stedolan.github.io/jq/
.. _jq Python module: https://github.com/mwilliamson/jq.py
.. code-block:: Yaml
url: https://example.net/jobs.json
filter:
- jq:
query: '.[].title'
The subfilter ``query`` is optional:
* ``query``: A valid ``jq`` filter string.
Supports aggregations, selections, and the built-in operators like ``length``. For
more information on the operations permitted, see the `jq Manual`_.
.. _jq Manual: https://stedolan.github.io/jq/manual/
.. only:: man
Files
-----
``$XDG_CONFIG_HOME/urlwatch/urls.yaml``
See also
--------
:manpage:`urlwatch(1)`,
:manpage:`urlwatch-intro(5)`,
:manpage:`urlwatch-jobs(5)`
|