File: hang-debugging.rst

package info (click to toggle)
mesa 25.2.8-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 312,152 kB
  • sloc: ansic: 2,185,354; xml: 1,028,239; cpp: 512,236; python: 76,148; asm: 38,329; yacc: 12,198; lisp: 4,114; lex: 3,429; sh: 855; makefile: 237
file content (85 lines) | stat: -rw-r--r-- 3,564 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
:orphan:

.. _radv-debug-hang:

Debugging GPU hangs with RADV
=============================

UMR (optional)
--------------

UMR is needed for dumping a lot of useful information. Clone, build and install
`UMR <https://gitlab.freedesktop.org/tomstdenis/umr>`__. Do not forget to run
``chmod +s $(which umr)`` so RADV can actually access UMR.

UMR needs to access some kernel debug interfaces:

.. code-block:: sh

   chmod 777 /sys/kernel/debug
   chmod -R 777 /sys/kernel/debug/dri

Secure boot has to be disabled as well.

Generating and analyzing hang reports
-------------------------------------

With UMR installed, you can now set ``RADV_DEBUG=hang`` which makes RADV insert
trace markers and synchronization and check for hangs. The hang report will be
saved to ``~/radv_dumps_<pid>_<time>``. Inside the directory of the hang report,
there are a couple of files:

* ``*.spv``: SPIR-V binaries of the pipeline that was bound when the hang
  occurred.
* ``addr_binding_report.log``: VK_EXT_address_binding_report logs.
* ``app_info.log``: ``VkApplicationInfo`` fields.
* ``bo_history.log``: A list of every GPU memory allocation and deallocation.
  If the GPU hang was caused by a page fault, you can use
  `radv_check_va.py <https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/amd/vulkan/radv_check_va.py>`__
  to figure out if address is invalid or used after the memory was deallocated.
* ``bo_ranges.log``: Address ranges that were valid at the time of submission.
* ``dmesg.log``: Output of ``dmesg``, if available.
* ``gpu_info.log``: Fields of ``radeon_info``.
* ``pipeline.log``: IR of the shaders that were bound during the hang as well as
  programm counters of waves executing said shaders and bound descriptors.
* ``registers.log``: Various GPU state registers.
* ``trace.log``: An annotated list of the command stream that caused the hang.
  the commands that hung come after
  ``!!!!! This is the last trace point that was reached by the CP !!!!!``.
* ``umr_ring.log``: Similar to ``trace.log``.
* ``umr_waves.log``: A list of waves that were active at the time of the hang,
  including register values.
* ``vm_fault.log``: The page fault address if a page fault occurred.

Note: By default, the backend IR (ACO or LLVM) and the disassembly should be
dumped to ``pipeline.log``. But due to shaders caching, you might need
``RADV_DEBUG=hang,nocache`` to get SPIR-V and NIR in the GPU hang report.

Debugging Steam games
---------------------

Steam games require a bit more work so RADV can access UMR: In your Steam library,
make sure **Tools** is checked and search for **Steam Linux Runtime**.
Under **Properties** -> **Installed Files**, click **Browse**, open
``_v2-entry-point`` and add

.. code-block:: sh

   shift 2
   exec "$@"

at the top of the file. Hang debugging can be enabled by selecting the faulting
game and adding ``RADV_DEBUG=hang %command%`` under **Properties** -> **General**
-> **LAUNCH OPTIONS**.

Debugging hangs without RADV_DEBUG=hang
---------------------------------------

In some situations, ``RADV_DEBUG=hang`` wouldn't be able to generate a GPU hang
report, like for synchronization issues (because it enables
``RADV_DEBUG=syncshaders`` behind the scene). An alternative solution is to
disable GPU recovery by adding ``amdgpu.gpu_recovery=0`` to your kernel command
line options. And then invoke UMR manually with
``umr --by-pci <pci_id> -O bits,halt_waves -go 0 -wa <ring> -go 1 2>&1`` for
dumping the waves and ``umr --by-pci <pci_id> -RS <ring> 2>&1`` for dumping the
rings once the GPU hang occurred.