File: setup-prometheus.rst

package info (click to toggle)
dask 2022.12.1%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 16,456 kB
  • sloc: python: 93,706; javascript: 1,893; makefile: 152; sh: 101
file content (146 lines) | stat: -rw-r--r-- 13,229 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
Setup Prometheus monitoring
===========================

Prometheus_ is a widely popular tool for monitoring and alerting a wide variety of systems. 
A distributed cluster offers a number of Prometheus metrics if the prometheus_client_ package is installed.
The metrics are exposed in Prometheus' text-based format at the ``/metrics`` endpoint on both schedulers and workers.

.. _Prometheus: https://prometheus.io
.. _prometheus_client: https://github.com/prometheus/client_python

Available metrics
-----------------

Apart from the metrics exposed per default by the ``prometheus_client``, schedulers and workers expose a number of Dask-specific metrics.


Scheduler metrics
^^^^^^^^^^^^^^^^^

The scheduler exposes the following metrics about itself:

+--------------------------------------------------+-------------------------------------------------------------------------+
|                   Metric name                    |                               Description                               |
+==================================================+=========================================================================+
| ``dask_scheduler_clients``                       | Number of clients connected                                             |
+--------------------------------------------------+-------------------------------------------------------------------------+
| ``dask_scheduler_desired_workers``               | Number of workers scheduler needs for task graph                        |
+--------------------------------------------------+-------------------------------------------------------------------------+
| ``dask_scheduler_workers``                       | Number of workers known by scheduler                                    |
+--------------------------------------------------+-------------------------------------------------------------------------+
| ``dask_scheduler_tasks``                         | Number of tasks known by scheduler                                      |
+--------------------------------------------------+-------------------------------------------------------------------------+
| ``dask_scheduler_tasks_suspicious_total``        | Total number of times a task has been marked suspicious                 |
+--------------------------------------------------+-------------------------------------------------------------------------+
| ``dask_scheduler_tasks_forgotten_total``         | Total number of processed tasks no longer in memory and already         |
|                                                  | removed from the scheduler job queue                                    |
|                                                  |                                                                         |
|                                                  | **Note:** Task groups on the                                            |
|                                                  | scheduler which have all tasks in the forgotten state are not included. |
+--------------------------------------------------+-------------------------------------------------------------------------+
| ``dask_scheduler_prefix_state_totals_total``     | Accumulated count of task prefix in each state                          |
+--------------------------------------------------+-------------------------------------------------------------------------+
| ``dask_scheduler_tick_duration_maximum_seconds`` | Maximum tick duration observed since Prometheus last scraped metrics    |
+--------------------------------------------------+-------------------------------------------------------------------------+
| ``dask_scheduler_tick_count_total``              | Total number of ticks observed since the server started                 |
+--------------------------------------------------+-------------------------------------------------------------------------+


Semaphore metrics
^^^^^^^^^^^^^^^^^

The following metrics about semaphores are available on the scheduler:

+-------------------------------------------------+---------------------------------------------------------------------------------+
|                   Metric name                   |                                   Description                                   |
+=================================================+=================================================================================+
| ``dask_semaphore_max_leases``                   | Maximum leases allowed per semaphore                                            |
|                                                 |                                                                                 |
|                                                 | **Note:** This will be constant for each semaphore during its lifetime.         |
+-------------------------------------------------+---------------------------------------------------------------------------------+
| ``dask_semaphore_active_leases``                | Amount of currently active leases per semaphore                                 |
+-------------------------------------------------+---------------------------------------------------------------------------------+
| ``dask_semaphore_pending_leases``               | Amount of currently pending leases per semaphore                                |
+-------------------------------------------------+---------------------------------------------------------------------------------+
| ``dask_semaphore_acquire_total``                | Total number of leases acquired per semaphore                                   |
+-------------------------------------------------+---------------------------------------------------------------------------------+
| ``dask_semaphore_release_total``                | Total number of leases released per semaphore                                   |
|                                                 |                                                                                 |
|                                                 | **Note:** If a semaphore is closed while there are still leases active,         |
|                                                 | this count will not equal ``semaphore_acquired_total`` after execution.         |
+-------------------------------------------------+---------------------------------------------------------------------------------+
| ``dask_semaphore_average_pending_lease_time_s`` | Exponential moving average of the time it took to acquire a lease per semaphore |
|                                                 |                                                                                 |
|                                                 | **Note:** This only includes time spent on scheduler side,                      |
|                                                 | it does not include time spent on communication.                                |
|                                                 |                                                                                 |
|                                                 | **Note:** This average is calculated based on order of leases                   |
|                                                 | instead of time of lease acquisition.                                           |
+-------------------------------------------------+---------------------------------------------------------------------------------+


Work-stealing metrics
^^^^^^^^^^^^^^^^^^^^^

If ``work-stealing`` is enabled, the scheduler exposes these metrics:


+---------------------------------------+-----------------------------------+
|              Metric name              |            Description            |
+=======================================+===================================+
| ``dask_stealing_request_count_total`` | Total number of stealing requests |
+---------------------------------------+-----------------------------------+
| ``dask_stealing_request_cost_total``  | Total cost of stealing requests   |
+---------------------------------------+-----------------------------------+


Worker metrics
^^^^^^^^^^^^^^

The worker exposes these metrics about itself:

+-----------------------------------------------+--------------------------------------------------------------------------------+
|                  Metric name                  |                                  Description                                   |
+===============================================+================================================================================+
| ``dask_worker_tasks``                         | Number of tasks at worker                                                      |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_threads``                       | Number of worker threads                                                       |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_latency_seconds``               | Latency of worker connection                                                   |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_memory_bytes``                  | Memory breakdown                                                               |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_transfer_incoming_bytes``       | Total size of open data transfers from other workers                           |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_transfer_incoming_count``       | Number of open data transfers from other workers                               |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_transfer_incoming_count_total`` | Total number of data transfers from other workers since the worker was started |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_transfer_outgoing_bytes``       | Total size of open data transfers to other workers                             |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_transfer_outgoing_count``       | Number of open data transfers to other workers                                 |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_transfer_outgoing_count_total`` | Total number of data transfers to other workers since the worker was started   |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_concurrent_fetch_requests``     | **Deprecated:** This metric has been renamed to ``transfer_incoming_count``.   |
|                                               |                                                                                |
|                                               | Number of open fetch requests to other workers                                 |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_tick_duration_maximum_seconds`` | Maximum tick duration observed since Prometheus last scraped metrics           |
+-----------------------------------------------+--------------------------------------------------------------------------------+
| ``dask_worker_tick_count_total``              | Total number of ticks observed since the server started                        |
+-----------------------------------------------+--------------------------------------------------------------------------------+

If the crick_ package is installed, the worker additionally exposes:

.. _crick: https://github.com/dask/crick

+-------------------------------------------------+----------------------------------+
|                   Metric name                   |           Description            |
+=================================================+==================================+
| ``dask_worker_tick_duration_median_seconds``    | Median tick duration at worker   |
+-------------------------------------------------+----------------------------------+
| ``dask_worker_task_duration_median_seconds``    | Median task runtime at worker    |
+-------------------------------------------------+----------------------------------+
| ``dask_worker_transfer_bandwidth_median_bytes`` | Bandwidth for transfer at worker |
+-------------------------------------------------+----------------------------------+