File: system-debugging.md

package info (click to toggle)
rocm 6.4.3%2Bds-4~exp2
  • links: PTS, VCS
  • area: main
  • in suites: experimental
  • size: 17,188 kB
  • sloc: sh: 10,149; python: 1,221; xml: 979; javascript: 168; perl: 46; makefile: 32
file content (68 lines) | stat: -rw-r--r-- 1,845 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
myst:
  html_meta:
    "description": "Learn more about common system-level debugging measures for ROCm."
    "keywords": "env, var, sys, PCIe, troubleshooting, admin, error"
---

# System debugging

## ROCm language and system-level debug, flags, and environment variables

Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, `net.ifnames=0 biosdevname=0`

## ROCr error code

* 2 Invalid Dimension
* 4 Invalid Group Memory
* 8 Invalid (or Null) Code
* 32 Invalid Format
* 64 Group is too large
* 128 Out of VGPRs
* 0x80000000 Debug Options

## Command to dump firmware version and get Linux kernel version

`sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info`

`uname -a`

## Debug flags

Debug messages when developing/debugging base ROCm driver. You could enable the printing from `libhsakmt.so` by setting an environment variable, `HSAKMT_DEBUG_LEVEL`. Available debug levels are 3-7. The higher level you set, the more messages will print.

* `export HSAKMT_DEBUG_LEVEL=3` : Only pr_err() prints.

* `export HSAKMT_DEBUG_LEVEL=4` : pr_err() and pr_warn() print.

* `export HSAKMT_DEBUG_LEVEL=5` : We currently do not implement “notice”. Setting to 5 is same as setting to 4.

* `export HSAKMT_DEBUG_LEVEL=6` : pr_err(), pr_warn(), and pr_info print.

* `export HSAKMT_DEBUG_LEVEL=7` : Everything including pr_debug prints.

## ROCr level environment variables for debug

`HSA_ENABLE_SDMA=0`

`HSA_ENABLE_INTERRUPT=0`

`HSA_SVM_GUARD_PAGES=0`

`HSA_DISABLE_CACHE=1`

## Turn off page retry on GFX9/Vega devices

`sudo -s`

`echo 1 > /sys/module/amdkfd/parameters/noretry`

## HIP environment variables 3.x

### OpenCL debug flags

`AMD_OCL_WAIT_COMMAND=1 (0 = OFF, 1 = On)`

## PCIe-debug

For information on how to debug and profile HIP applications, see {doc}`hip:how-to/debugging`