File: introduction.rst

package info (click to toggle)
urlwatch 2.29-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 880 kB
  • sloc: python: 4,003; sh: 53; makefile: 19
file content (170 lines) | stat: -rw-r--r-- 4,925 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
.. _introduction:

Introduction
============


Quick Start
-----------

1. Run ``urlwatch`` once to migrate your old data or start fresh
2. Use ``urlwatch --edit`` to customize jobs and filters (``urls.yaml``)
3. Use ``urlwatch --edit-config`` to customize settings and reporters (``urlwatch.yaml``)
4. Add ``urlwatch`` to your crontab (``crontab -e``) to monitor webpages periodically

The checking interval is defined by how often you run ``urlwatch``. You
can use e.g. `crontab.guru <https://crontab.guru>`__ to figure out the
schedule expression for the checking interval, we recommend not more
often than 30 minutes (this would be ``*/30 * * * *``). If you have
never used cron before, check out the `crontab command
help <https://www.computerhope.com/unix/ucrontab.htm>`__.

On Windows, ``cron`` is not installed by default. Use the `Windows Task
Scheduler <https://en.wikipedia.org/wiki/Windows_Task_Scheduler>`__
instead, or see `this StackOverflow
question <https://stackoverflow.com/q/132971/1047040>`__ for
alternatives.


How it works
------------

Every time you run :manpage:`urlwatch(1)`, it:

- retrieves the output of each job and filters it
- compares it with the version retrieved the previous time ("diffing")
- if it finds any differences, it invokes enabled reporters (e.g.
  text reporter, e-mail reporter, ...) to notify you of the changes

Jobs and Filters
----------------

Each website or shell command to be monitored constitutes a "job".

The instructions for each such job are contained in a config file in the `YAML
format`_. If you have more than one job, you separate them with a line
containing only ``---``.

You can edit the job and filter configuration file using:

.. code::

    urlwatch --edit

If you get an error, set your ``$EDITOR`` (or ``$VISUAL``) environment
variable in your shell, for example:

.. code::

    export EDITOR=/bin/nano

While you can edit the YAML file manually, using ``--edit`` will
do sanity checks before activating the new configuration file.

.. _YAML format: https://yaml.org/spec/

Kinds of Jobs
~~~~~~~~~~~~~

Each job must have exactly one of the following keys, which also
defines the kind of job:

- ``url`` retrieves what is served by the web server (HTTP GET by default),
- ``navigate`` uses a headless browser to load web pages requiring JavaScript, and
- ``command`` runs a shell command.

Each job can have an optional ``name`` key to define a user-visible name for the job.

You can then use optional keys to finely control various job's parameters.

.. only:: man

    See :manpage:`urlwatch-jobs(5)` for detailed information on job configuration.

Filters
~~~~~~~

You may use the ``filter`` key to select one or more :doc:`filters` to apply to
the data after it is retrieved, for example to:

- select HTML: ``css``, ``xpath``, ``element-by-class``, ``element-by-id``, ``element-by-style``, ``element-by-tag``
- make HTML more readable: ``html2text``, ``beautify``
- make PDFs readable: ``pdf2text``
- make JSON more readable: ``format-json``
- make iCal more readable: ``ical2text``
- make binary readable: ``hexdump``
- just detect changes: ``sha1sum``
- edit text: ``grep``, ``grepi``, ``strip``, ``sort``, ``striplines``

These filters can be chained. As an example, after retrieving an HTML
document by using the ``url`` key, you can extract a selection with the
``xpath`` filter, convert this to text with ``html2text``, use ``grep`` to
extract only lines matching a specific regular expression, and then ``sort``
them:

.. code-block:: yaml

    name: "Sample urlwatch job definition"
    url: "https://example.dummy/"
    https_proxy: "http://dummy.proxy/"
    max_tries: 2
    filter:
      - xpath: '//section[@role="main"]'
      - html2text:
          method: pyhtml2text
          unicode_snob: true
          body_width: 0
          inline_links: false
          ignore_links: true
          ignore_images: true
          pad_tables: false
          single_line_break: true
      - grep: "lines I care about"
      - sort:
    ---

.. only:: man

    See :manpage:`urlwatch-filters(5)` for detailed information on filter configuration.

Reporters
---------

`urlwatch` can be configured to do something with its report besides
(or in addition to) the default of displaying it on the console.

:doc:`reporters` are configured in the global configuration file:

.. code::

    urlwatch --edit-config

Examples of reporters:

- ``email`` (using SMTP)
- email using ``mailgun``
- ``slack``
- ``discord``
- ``pushbullet``
- ``telegram``
- ``matrix``
- ``pushover``
- ``stdout``
- ``xmpp``
- ``shell``

.. only:: man

    See :manpage:`urlwatch-reporters(5)` for reporter configuration options.

.. only:: man

    See Also
    --------

    :manpage:`urlwatch(1)`,
    :manpage:`urlwatch-jobs(5)`,
    :manpage:`urlwatch-filters(5)`,
    :manpage:`urlwatch-config(5)`,
    :manpage:`urlwatch-reporters(5)`,
    :manpage:`cron(8)`