1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
|
.. _introduction:
Introduction
============
Quick Start
-----------
1. Run ``urlwatch`` once to migrate your old data or start fresh
2. Use ``urlwatch --edit`` to customize jobs and filters (``urls.yaml``)
3. Use ``urlwatch --edit-config`` to customize settings and reporters (``urlwatch.yaml``)
4. Add ``urlwatch`` to your crontab (``crontab -e``) to monitor webpages periodically
The checking interval is defined by how often you run ``urlwatch``. You
can use e.g. `crontab.guru <https://crontab.guru>`__ to figure out the
schedule expression for the checking interval, we recommend not more
often than 30 minutes (this would be ``*/30 * * * *``). If you have
never used cron before, check out the `crontab command
help <https://www.computerhope.com/unix/ucrontab.htm>`__.
On Windows, ``cron`` is not installed by default. Use the `Windows Task
Scheduler <https://en.wikipedia.org/wiki/Windows_Task_Scheduler>`__
instead, or see `this StackOverflow
question <https://stackoverflow.com/q/132971/1047040>`__ for
alternatives.
How it works
------------
Every time you run :manpage:`urlwatch(1)`, it:
- retrieves the output of each job and filters it
- compares it with the version retrieved the previous time ("diffing")
- if it finds any differences, it invokes enabled reporters (e.g.
text reporter, e-mail reporter, ...) to notify you of the changes
Jobs and Filters
----------------
Each website or shell command to be monitored constitutes a "job".
The instructions for each such job are contained in a config file in the `YAML
format`_. If you have more than one job, you separate them with a line
containing only ``---``.
You can edit the job and filter configuration file using:
.. code::
urlwatch --edit
If you get an error, set your ``$EDITOR`` (or ``$VISUAL``) environment
variable in your shell, for example:
.. code::
export EDITOR=/bin/nano
While you can edit the YAML file manually, using ``--edit`` will
do sanity checks before activating the new configuration file.
.. _YAML format: https://yaml.org/spec/
Kinds of Jobs
~~~~~~~~~~~~~
Each job must have exactly one of the following keys, which also
defines the kind of job:
- ``url`` retrieves what is served by the web server (HTTP GET by default),
- ``navigate`` uses a headless browser to load web pages requiring JavaScript, and
- ``command`` runs a shell command.
Each job can have an optional ``name`` key to define a user-visible name for the job.
You can then use optional keys to finely control various job's parameters.
.. only:: man
See :manpage:`urlwatch-jobs(5)` for detailed information on job configuration.
Filters
~~~~~~~
You may use the ``filter`` key to select one or more :doc:`filters` to apply to
the data after it is retrieved, for example to:
- select HTML: ``css``, ``xpath``, ``element-by-class``, ``element-by-id``, ``element-by-style``, ``element-by-tag``
- make HTML more readable: ``html2text``, ``beautify``
- make PDFs readable: ``pdf2text``
- make JSON more readable: ``format-json``
- make iCal more readable: ``ical2text``
- make binary readable: ``hexdump``
- just detect changes: ``sha1sum``
- edit text: ``grep``, ``grepi``, ``strip``, ``sort``, ``striplines``
These filters can be chained. As an example, after retrieving an HTML
document by using the ``url`` key, you can extract a selection with the
``xpath`` filter, convert this to text with ``html2text``, use ``grep`` to
extract only lines matching a specific regular expression, and then ``sort``
them:
.. code-block:: yaml
name: "Sample urlwatch job definition"
url: "https://example.dummy/"
https_proxy: "http://dummy.proxy/"
max_tries: 2
filter:
- xpath: '//section[@role="main"]'
- html2text:
method: pyhtml2text
unicode_snob: true
body_width: 0
inline_links: false
ignore_links: true
ignore_images: true
pad_tables: false
single_line_break: true
- grep: "lines I care about"
- sort:
---
.. only:: man
See :manpage:`urlwatch-filters(5)` for detailed information on filter configuration.
Reporters
---------
`urlwatch` can be configured to do something with its report besides
(or in addition to) the default of displaying it on the console.
:doc:`reporters` are configured in the global configuration file:
.. code::
urlwatch --edit-config
Examples of reporters:
- ``email`` (using SMTP)
- email using ``mailgun``
- ``slack``
- ``discord``
- ``pushbullet``
- ``telegram``
- ``matrix``
- ``pushover``
- ``stdout``
- ``xmpp``
- ``shell``
.. only:: man
See :manpage:`urlwatch-reporters(5)` for reporter configuration options.
.. only:: man
See Also
--------
:manpage:`urlwatch(1)`,
:manpage:`urlwatch-jobs(5)`,
:manpage:`urlwatch-filters(5)`,
:manpage:`urlwatch-config(5)`,
:manpage:`urlwatch-reporters(5)`,
:manpage:`cron(8)`
|