1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
|
.. When modifying the contents of the first two sections of this page, please adjust the corresponding page in the dask.dask documentation accordingly.
Prometheus monitoring
=====================
Prometheus_ is a widely popular tool for monitoring and alerting a wide variety of
systems. A distributed cluster offers a number of Prometheus metrics if the
prometheus_client_ package is installed. The metrics are exposed in Prometheus'
text-based format at the ``/metrics`` endpoint on both schedulers and workers.
Available metrics
-----------------
Apart from the metrics exposed per default by the prometheus_client_, schedulers and
workers expose a number of Dask-specific metrics.
Scheduler metrics
^^^^^^^^^^^^^^^^^
The scheduler exposes the following metrics about itself:
dask_scheduler_clients
Number of clients connected
dask_scheduler_client_connections_added_total
Total number of client connections added to the scheduler
.. note::
This metric does *not* count distinct clients. If a client disconnects
and reconnects later on, it will be counted twice.
dask_scheduler_client_connections_removed_total
Total number of client connections removed from the scheduler
.. note::
This metric does *not* count distinct clients. If a client disconnects,
then reconnects and disconnects again, it will be counted twice.
dask_scheduler_desired_workers
Number of workers scheduler needs for task graph
dask_scheduler_gil_contention_seconds_total
Value representing cumulative total of *potential* GIL contention,
in the form of cumulative seconds during which any thread held the GIL locked.
Other threads may or may not have been actually trying to acquire the GIL in the
meantime.
.. note::
Requires ``gilknocker`` to be installed, and
``distributed.admin.system-monitor.gil.enabled``
configuration to be set.
dask_scheduler_gc_collection_seconds_total
Total time spent on garbage dask_scheduler_gc_collection_seconds_total
.. note::
Due to measurement overhead, this metric only measures
time spent on garbage collection for generation=2
dask_scheduler_workers
Number of workers known by scheduler
dask_scheduler_workers_added_total
Total numbers of workers added to the scheduler
dask_scheduler_workers_removed_total
Total number of workers removed from the scheduler
dask_scheduler_last_time_total
Cumulative SystemMonitor time
dask_scheduler_tasks
Number of tasks known by scheduler
dask_scheduler_tasks_suspicious_total
Total number of times a task has been marked suspicious
dask_scheduler_tasks_forgotten_total
Total number of processed tasks no longer in memory and already removed from the
scheduler job queue.
.. note::
Task groups on the scheduler which have all tasks in the forgotten state are not
included.
dask_scheduler_tasks_compute_seconds_total
Total time (per prefix) spent computing tasks
dask_scheduler_tasks_transfer_seconds_total
Total time (per prefix) spent transferring
dask_scheduler_tasks_output_bytes
Current size of in memory tasks, broken down by task prefix, without duplicates.
Note that when a task output is transferred between worker, you'll typically end up
with a duplicate, so this measure is going to be lower than the actual cluster-wide
managed memory. See also ``dask_worker_memory_bytes``, which does count duplicates.
dask_scheduler_task_groups
Number of task groups known by scheduler
dask_scheduler_prefix_state_totals_total
Accumulated count of task prefix in each state
dask_scheduler_tick_count_total
Total number of ticks observed since the server started
dask_scheduler_tick_duration_maximum_seconds
Maximum tick duration observed since Prometheus last scraped metrics.
If this is significantly higher than what's configured in
``distributed.admin.tick.interval`` (default: 20ms), it highlights a blocked event
loop, which in turn hampers timely task execution and network comms.
Semaphore metrics
^^^^^^^^^^^^^^^^^
The following metrics about :class:`~distributed.Semaphore` objects are available on the
scheduler:
dask_semaphore_max_leases
Maximum leases allowed per semaphore.
.. note::
This will be constant for each semaphore during its lifetime.
dask_semaphore_active_leases
Amount of currently active leases per semaphore
dask_semaphore_pending_leases
Amount of currently pending leases per semaphore
dask_semaphore_acquire_total
Total number of leases acquired per semaphore
dask_semaphore_release_total
Total number of leases released per semaphore
.. note::
If a semaphore is closed while there are still leases active, this count will not
equal ``dask_semaphore_acquire_total`` after execution.
dask_semaphore_average_pending_lease_time_s
Exponential moving average of the time it took to acquire a lease per semaphore
.. note::
This only includes time spent on scheduler side, it does not include time spent
on communication.
.. note::
This average is calculated based on order of leases instead of time of lease
acquisition.
Work-stealing metrics
^^^^^^^^^^^^^^^^^^^^^
If :doc:`work-stealing` is enabled, the scheduler exposes these metrics:
dask_stealing_request_count_total
Total number of stealing requests
dask_stealing_request_cost_total
Total cost of stealing requests
Worker metrics
^^^^^^^^^^^^^^
The worker exposes these metrics about itself:
dask_worker_tasks
Number of tasks at worker
dask_worker_threads
Number of worker threads
dask_worker_gil_contention_seconds_total
Value representing cumulative total of *potential* GIL contention,
in the form of cumulative seconds during which any thread held the GIL locked.
Other threads may or may not have been actually trying to acquire the GIL in the
meantime.
.. note::
Requires ``gilknocker`` to be installed, and
``distributed.admin.system-monitor.gil.enabled``
configuration to be set.
dask_worker_gc_collection_seconds_total
Total time spent on garbage dask_scheduler_gc_collection_seconds_total
.. note::
Due to measurement overhead, this metric only measures
time spent on garbage collection for generation=2
dask_worker_latency_seconds
Latency of worker connection
dask_worker_memory_bytes
Memory breakdown
dask_worker_transfer_incoming_bytes
Total size of open data transfers from other workers
dask_worker_transfer_incoming_count
Number of open data transfers from other workers
dask_worker_transfer_incoming_count_total
Total number of data transfers from other workers since the worker was started
dask_worker_transfer_outgoing_bytes
Size of open data transfers to other workers
dask_worker_transfer_outgoing_bytes_total
Total size of open data transfers to other workers since the worker was started
dask_worker_transfer_outgoing_count
Number of open data transfers to other workers
dask_worker_transfer_outgoing_count_total
Total number of data transfers to other workers since the worker was started
dask_worker_concurrent_fetch_requests
**Deprecated:** This metric has been renamed to
``dask_worker_transfer_incoming_count``.
dask_worker_tick_count_total
Total number of ticks observed since the server started
dask_worker_tick_duration_maximum_seconds
Maximum tick duration observed since Prometheus last scraped metrics.
If this is significantly higher than what's configured in
``distributed.admin.tick.interval`` (default: 20ms), it highlights a blocked event
loop, which in turn hampers timely task execution and network comms.
dask_worker_spill_bytes_total
Total size of spilled/unspilled data since the worker was started;
in other words, cumulative disk I/O that is attributable to spill activity.
This includes a ``memory_read`` measure, which allows to derive cache hit ratio::
cache hit ratio = memory_read / (memory_read + disk_read)
dask_worker_spill_count_total
Total number of spilled/unspilled keys since the worker was started;
in other words, cumulative disk accesses that are attributable to spill activity.
This includes a ``memory_read`` measure, which allows to derive cache hit ratio::
cache hit ratio = memory_read / (memory_read + disk_read)
dask_worker_spill_time_seconds_total
Total amount of time that was spent spilling/unspilling since the worker was
started, broken down by activity: (de)serialize, (de)compress, (un)spill.
If the crick_ package is installed, the worker additionally exposes:
dask_worker_tick_duration_median_seconds
Median tick duration at worker
dask_worker_task_duration_median_seconds
Median task runtime at worker
dask_worker_transfer_bandwidth_median_bytes
Bandwidth for transfer at worker
.. _Prometheus: https://prometheus.io
.. _prometheus_client: https://github.com/prometheus/client_python
.. _crick: https://github.com/dask/crick
|