File: advanced.rst

package info (click to toggle)
urlwatch 2.29-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 880 kB
  • sloc: python: 4,003; sh: 53; makefile: 19
file content (484 lines) | stat: -rw-r--r-- 14,709 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
.. _advanced_topics:

Advanced Topics
===============


Adding URLs from the command line
---------------------------------

Quickly adding new URLs to the job list from the command line::

    urlwatch --add url=http://example.org,name=Example


Using word-based differences
----------------------------

You can also specify an external ``diff``-style tool (a tool that takes
two filenames (old, new) as parameter and returns on its standard output
the difference of the files), for example to use :manpage:`wdiff(1)` to get
word-based differences instead of line-based difference, or `pandiff <https://github.com/davidar/pandiff>`_ to get markdown differences:

.. code-block:: yaml

   url: https://example.com/
   diff_tool: wdiff

Note that ``diff_tool`` specifies an external command-line tool, so that
tool must be installed separately (e.g. ``apt install wdiff`` on Debian
or ``brew install wdiff`` on macOS). Syntax highlighting is supported for
``wdiff``-style output, but potentially not for other diff tools.


Ignoring whitespace changes
---------------------------

If you would like to ignore whitespace changes so that you don't receive
notifications for trivial differences, you can use ``diff_tool`` for this.
For example:

.. code-block:: yaml

   diff_tool: "diff --ignore-all-space --unified"

When using another external ``diff``-like tool, make sure it returns unified
output format to retain syntax highlighting.


Only show added or removed lines
--------------------------------

The ``diff_filter`` feature can be used to filter the diff output text
with the same tools (see :doc:`filters`) used for filtering web pages.

In order to show only diff lines with added lines, use:

.. code-block:: yaml

   url: http://example.com/things-get-added.html
   diff_filter:
     - grep: '^[@+]'

This will only keep diff lines starting with ``@`` or ``+``. Similarly,
to only keep removed lines:

.. code-block:: yaml

   url: http://example.com/things-get-removed.html
   diff_filter:
     - grep: '^[@-]'

More sophisticated diff filtering is possibly by combining existing
filters, writing a new filter or using ``shellpipe`` to delegate the
filtering/processing of the diff output to an external tool.

Read the next section if you want to disable empty notifications.


Disable empty notifications
---------------------------

As an extension to the previous example, let's say you want to only
get notified with all lines added, but receive no notifications at all
if lines are removed.

A diff usually looks like this:

.. code-block:: diff

    --- @	Fri, 04 Mar 2022 19:58:14 +0100
    +++ @	Fri, 04 Mar 2022 19:58:22 +0100
    @@ -1,3 +1,3 @@
     someline
    -someotherlines
    +someotherline
     anotherline

We want to filter all lines starting with "+" only, but because of
the headers we also want to filter lines that start with "+++",
which can be accomplished like so:

.. code-block:: yaml

    url: http://example.com/only-added.html
    diff_filter:
      - grep: '^[+]'      # Include all lines starting with "+"
      - grepi: '^[+]{3}'  # Exclude the line starting with "+++"

This deals with all diff lines now, but since urlwatch reports
"changed" pages even when the ``diff_filter`` returns an empty string
(which might be useful in some cases), you have to explicitly opt out
by using ``urlwatch --edit-config`` and setting the ``empty-diff``
option to ``false`` in the ``display`` category:

.. code-block:: yaml

    display:
      empty-diff: false


Pass diff output to a custom script
-----------------------------------

In some situations, it might be useful to run a script with the diff as input
when changes were detected (e.g. to start an update or process something). This
can be done by combining ``diff_filter`` with the ``shellpipe`` filter, which
can be any custom script.

The output of the custom script will then be the diff result as reported by
urlwatch, so if it outputs any status, the ``CHANGED`` notification that
urlwatch does will contain the output of the custom script, not the original
diff. This can even have a "normal" filter attached to only watch links
(the ``css: a`` part of the filter definitions):

.. code-block:: yaml

   url: http://example.org/downloadlist.html
   filter:
     - css: a
   diff_filter:
     - shellpipe: /usr/local/bin/process_new_links.sh


Comparing web pages visually
----------------------------

To compare the visual contents of web pages, Nicolai has written
`pyvisualcompare <https://github.com/nspo/pyvisualcompare>`__ as
a frontend (with GUI) to ``urlwatch``. The tool can be used to
select a region of a web page. It then generates a configuration
for ``urlwatch`` to run ``pyvisualcompare`` and generate a hash
for the screen contents.


Ignoring connection errors
--------------------------

In some cases, it might be useful to ignore (temporary) network errors
to avoid notifications being sent. While there is a ``display.error``
config option (defaulting to ``true``) to control reporting of errors
globally, to ignore network errors for specific jobs only, you can use
the ``ignore_connection_errors`` key in the job list configuration file:

.. code-block:: yaml

   url: https://example.com/
   ignore_connection_errors: true

Similarly, you might want to ignore some (temporary) HTTP errors on the
server side:

.. code-block:: yaml

   url: https://example.com/
   ignore_http_error_codes: 408, 429, 500, 502, 503, 504

or ignore all HTTP errors if you like:

.. code-block:: yaml

   url: https://example.com/
   ignore_http_error_codes: 4xx, 5xx

You can also ignore incomplete reads:

.. code-block:: yaml

   url: "https://example.com/"
   ignore_incomplete_reads: true

Overriding the content encoding
-------------------------------

For web pages with misconfigured HTTP headers or rare encodings, it may
be useful to explicitly specify an encoding from Python’s `Standard
Encodings <https://docs.python.org/3/library/codecs.html#standard-encodings>`__.

.. code-block:: yaml

   url: https://example.com/
   encoding: utf-8


Changing the default timeout
----------------------------

By default, url jobs timeout after 60 seconds. If you want a different
timeout period, use the ``timeout`` key to specify it in number of
seconds, or set it to 0 to never timeout.

.. code-block:: yaml

   url: https://example.com/
   timeout: 300


Supplying cookie data
---------------------

It is possible to add cookies to HTTP requests for pages that need it,
the YAML syntax for this is:

.. code-block:: yaml

   url: http://example.com/
   cookies:
       Key: ValueForKey
       OtherKey: OtherValue


Comparing with several latest snapshots
---------------------------------------

If a webpage frequently changes between several known stable states, it
may be desirable to have changes reported only if the webpage changes
into a new unknown state. You can use ``compared_versions`` to do this.

.. code-block:: yaml

   url: https://example.com/
   compared_versions: 3

In this example, changes are only reported if the webpage becomes
different from the latest three distinct states. The differences are
shown relative to the closest match.


Receiving a report every time urlwatch runs
-------------------------------------------

If you are watching pages that change seldomly, but you still want to
be notified daily if ``urlwatch`` still works, you can watch the output
of the ``date`` command, for example:

.. code-block:: yaml

   name: "urlwatch watchdog"
   command: "date"

Since the output of ``date`` changes every second, this job should produce a
report every time urlwatch is run.


Using Redis as a cache backend
------------------------------------------
If you want to use Redis as a cache backend over the default SQLite3 file::

    urlwatch --cache=redis://localhost:6379/

There is no migration path from the SQLite3 format, the cache will be empty
the first time Redis is used.


Watching changes on .onion (Tor) pages
--------------------------------------

Since pages on the `Tor Network`_ are not accessible via public DNS and TCP,
you need to either configure a Tor client as HTTP/HTTPS proxy or use the
:manpage:`torify(1)` tool from the ``tor`` package (``apt install tor`` on Debian,
``brew install tor`` on macOS). Setting up Tor is out of scope for this
document. On a properly set up Tor installation, one can just prefix the
``urlwatch`` command with the ``torify`` wrapper to access .onion pages:

.. code-block:: bash

   torify urlwatch

.. _Tor Network: https://www.torproject.org


Watching Facebook Page Events
-----------------------------

If you want to be notified of new events on a public Facebook page, you
can use the following job pattern, replace ``PAGE`` with the name of the
page (can be found by navigating to the events page on your browser):

.. code-block:: yaml

   url: http://m.facebook.com/PAGE/pages/permalink/?view_type=tab_events
   filter:
     - css:
         selector: div#objects_container
         exclude: 'div.x, #m_more_friends_who_like_this, img'
     - re.sub:
         pattern: '(/events/\d*)[^"]*'
         repl: '\1'
     - html2text: pyhtml2text


Setting the content width for ``html2text`` (``lynx`` method)
-------------------------------------------------------------

When using the ``lynx`` method in the ``html2text`` filter, it uses a default
width that will cause additional line breaks to be inserted.

To set the ``lynx`` output width to 400 characters, use this filter setup:

.. code-block:: yaml

   url: http://example.com/longlines.html
   filter:
     - html2text:
         method: lynx
         width: 400


Configuring how long browser jobs wait for pages to load
--------------------------------------------------------

For browser jobs, you can configure how long the headless browser will wait
before a page is considered loaded by using the ``wait_until`` option.

It can take one of four values (see `wait_until docs`_ of Playwright):

   - ``load`` - consider operation to be finished when the load event is fired
   - ``domcontentloaded`` - consider operation to be finished when the
     DOMContentLoaded event is fired
   - ``networkidle`` - **discouraged** consider operation to be finished when there
     are no network connections for at least 500 ms. Don't use this method for
     testing, rely on web assertions to assess readiness instead
   - ``commit`` - consider operation to be finished when network response is
     received and the document started loading

.. _`wait_until docs`: https://playwright.dev/python/docs/api/class-page#page-goto-option-wait-until


Treating ``NEW`` jobs as ``CHANGED``
------------------------------------

In some cases (e.g. when the ``diff_tool`` or ``diff_filter`` executes some
external command as a side effect that should also run for the initial page
state), you can set the ``treat_new_as_changed`` to ``true``, which will make
the job report as ``CHANGED`` instead of ``NEW`` the first time it is retrieved
(and the diff will be reported, too).

.. code-block:: yaml

   url: http://example.com/initialpage.html
   treat_new_as_changed: true

This option will also change the behavior of ``--test-diff-filter``, and allow
testing the diff filter if only a single version of the page has been
retrieved.


Monitoring the same URL in multiple jobs
----------------------------------------

Because urlwatch uses the ``url``/``navigate`` (for URL/Browser jobs) and/or
the ``command`` (for Shell jobs) key as unique identifier, each URL can only
appear in a single job. If you want to monitor the same URL multiple times,
you can append ``#1``, ``#2``, ... (or anything that makes them unique) to
the URLs, like this:

.. code-block:: yaml

    name: "Looking for Thing A"
    url: http://example.com/#1
    filter:
      - grep: "Thing A"
    ---
    name: "Looking for Thing B"
    url: http://example.com/#2
    filter:
      - grep: "Thing B"


Updating a URL and keeping past history
---------------------------------------

Job history is stored based on the value of the ``url`` parameter, so updating
a job's URL in the configuration file ``urls.yaml`` will create a new job with 
no history.  Retain history by using ``--change-location``::

    urlwatch --change-location http://example.org#old http://example.org#new

The command also works with Browser and Shell jobs, changing ``navigate`` and 
``command`` respectively.


Running a subset of jobs
------------------------

To run one or more specific jobs instead of all known jobs, provide
the job index numbers to the urlwatch command. For example, to run
jobs with index 2, 4, and 7:

.. code-block:: bash

   urlwatch 2 4 7


Sending HTML form data using POST
---------------------------------

To simulate submitting a HTML form using the POST method, you can pass
the form fields in the ``data`` field of the job description:

.. code-block:: yaml

    name: "My POST Job"
    url: http://example.com/foo
    data:
      username: "foo"
      password: "bar"
      submit: "Send query"

By default, the request will use the HTTP ``POST`` method, and the
``Content-type`` will be set to ``application/x-www-form-urlencoded``.


Sending arbitrary data using HTTP PUT
-------------------------------------

It is possible to customize the HTTP method and ``Content-type`` header,
allowing you to send arbitrary requests to the server:

.. code-block:: yaml

    name: "My PUT Request"
    url: http://example.com/item/new
    method: PUT
    headers:
      Content-type: application/json
    data: '{"foo": true}'

.. only:: man

    See also
    --------

    :manpage:`urlwatch(1)`,
    :manpage:`urlwatch-intro(7)`,
    :manpage:`urlwatch-jobs(5)`,
    :manpage:`urlwatch-filters(5)`,
    :manpage:`urlwatch-config(5)`,
    :manpage:`urlwatch-reporters(5)`


UTF-8 support on Windows
------------------------

On Windows, the default file encoding might be locale-specific and not work
correctly if files are saved using the (recommended) UTF-8 encoding.

If you are having problems loading UTF-8-encoded files on Windows, you might
see an issue like the following when ``urlwatch`` parses your config files:

.. code-block:: text

    UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 214: character maps to <undefined>

To work around this issue, Python 3.7 and newer have a new
`UTF-8 Mode`_ that can be enabled by setting the environment
variable ``PYTHONUTF8`` to ``1``::

    set PYTHONUTF8=1
    urlwatch

You can also add this environment variable to your user environment or system
environment to apply the UTF-8 Mode to all Python programs on your machine.

.. _UTF-8 Mode: https://peps.python.org/pep-0540/