1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898
|
# Ricks-Lab GPU Utilities - User Guide
A set of utilities for monitoring GPU performance and modifying control settings.
## Current rickslab-gpu-utils Version: 3.7.x
- [Installation](#installation)
- [Getting Started](#getting-started)
- [Using gpu-ls](#using-gpu-ls)
- [GPU Type Dependent Behavior](#gpu-type-dependent-behavior)
- [Using gpu-mon](#using-gpu-mon)
- [Using gpu-plot](#using-gpu-plot)
- [Using gpu-pac](#using-gpu-pac)
- [Updating the PCI ID decode file](#updating-the-PCI-ID-decode-file)
- [Optimizing Compute Performance-Power](#optimizing-compute-performance-power)
- [Running Startup PAC Bash Files](#running-startup-pac-bash-files)
## Installation
There are 4 methods of installation available and summarized here:
* [Repository](#repository-installation) - This approach is recommended for those interested in contributing to the project or helping to troubleshoot an issue in realtime with the developer. This type of installation can exist along with any of the other installation type.
* [PyPI](#pypi-installation) - Meant for users wanting to run the very latest version. All **PATCH** level versions are released here first. This install method is also meant for users not on a Debian distribution.
* [Rickslab.com Debian](#rickslabcom-debian-installation) - Lags the PyPI release in order to assure robustness. May not include every **PATCH** version.
* [Official Debian](#official-debian-package-installation) - Only **MAJOR/MINOR** releases. This works for releases of Ubuntu 22.04 or Bullseye 11.3 or later.
### Repository Installation
For a developer/contributor to the project, it is expected that you duplicate the development environment
using a virtual environment. So far, my development activities for this project have used python3.6.
The following are details on setting up a virtual environment with python3.6:
```commandline
sudo apt install -y python3.6-venv
sudo apt install -y python3.6-dev
```
Clone the repository from GitHub with the following command:
```commandline
git clone https://github.com/Ricks-Lab/gpu-utils.git
cd gpu-utils
```
Initialize your *rickslab-gpu-utils-env* if it is your first time to use it. From the project root directory, execute:
```commandline
python3.6 -m venv rickslab-gpu-utils-env
source rickslab-gpu-utils-env/bin/activate
pip install --upgrade pip
pip install --no-cache-dir -r requirements-venv.txt
```
If you get errors installing `vext`, you may need to use the `--use-pep517`:
```commandline
pip install --no-cache-dir --use-pep517 -r requirements-venv.txt
```
On newer systems, I have found that I get a `ModuleNotFoundError: No module named 'numpy'`, even though `numpy` was
successfully installed in the newly created virtual environment. To resolve this, I deactivated the venv and installed
it for the system instance of python. When back in the venv, the issue is resolved. No idea why this is happening.
You then run the desired commands by specifying the full path: `./gpu-ls`
### PyPI Installation
First, remove any previous Debian package and any ricks-amdgpu-utils PyPI installations:
```commandline
sudo apt purge rickslab-gpu-utils
sudo apt purge ricks-amdgpu-utils
sudo apt autoremove
pip3 uninstall ricks-amdgpu-utils
```
Install the latest package from [PyPI](https://pypi.org/project/rickslab-gpu-utils/) with the following
commands:
```commandline
pip3 install rickslab-gpu-utils
```
Or, use the pip upgrade option if you have already installed a previous version:
```commandline
pip3 install rickslab-gpu-utils -U
```
You may need to open a new terminal window in order for the path to the utilities to be set.
### Rickslab.com Debian Installation
First, remove any previous PyPI installation and exit that terminal. If you
also have a Debian installed version, the pip uninstall will likely fail,
unless you remove the Debian package first. You can skip this step if you
are certain no other install types are still installed:
```commandline
sudo apt purge rickslab-gpu-utils
sudo apt purge ricks-amdgpu-utils
sudo apt autoremove
pip uninstall rickslab-gpu-utils
exit
```
If you had previously (before 3.7.6) installed from rickslab.com, you should
delete the key from the apt keyring:
```commandline
sudo apt-key del C98B8839
```
Next, add the *rickslab-gpu-utils* repository:
```shell
wget -q -O - https://debian.rickslab.com/PUBLIC.KEY | sudo gpg --dearmour -o /usr/share/keyrings/rickslab-agent.gpg
echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/rickslab-agent.gpg] https://debian.rickslab.com/gpu-utils/ eddore main' | sudo tee /etc/apt/sources.list.d/rickslab-gpu-utils.list
sudo apt update
```
Then install the package with apt:
```commandline
sudo apt install rickslab-gpu-utils
```
If you decide to no longer use this type of install, you can remove
rickslab-gpu-utils from the system repository list by executing the following:
```shell
echo '' | sudo tee /etc/apt/sources.list.d/rickslab-gpu-utils.list
```
### Official Debian Package Installation
First you should verify the availability of the package by distribution with the following command:
```commandline
rmadison rickslab-gpu-utils
```
Current package availability is as follows:
```text
rickslab-gpu-utils | 3.6.0-2 | jammy/universe | source, all
rickslab-gpu-utils | 3.6.0-3 | kinetic/universe | source, all
rickslab-gpu-utils | 3.8.0-1 | lunar/universe | source, all
rickslab-gpu-utils | 3.8.0-1 | mantic/universe | source, all
```
Then remove any previous PyPI installation and exit that terminal. If you also
have a Debian installed versions, the pip uninstall will likely fail, unless you
remove the Debian package first. You can skip this step if you are certain no
over install types have been installed:
```commandline
sudo apt purge ricks-amdgpu-utils
sudo apt purge rickslab-gpu-utils
sudo apt autoremove
pip uninstall rickslab-gpu-utils
exit
```
If you had previously added https://debian.rickslab.com/gpu-utils/ as a repository
source, then you will need to remove this in order to download from the official
debian repository. This can be accomplished with the following shell command:
```shell
echo '' | sudo tee /etc/apt/sources.list.d/rickslab-gpu-utils.list
```
```commandline
sudo apt update
sudo apt install rickslab-gpu-utils
```
## Getting Started
First, this set of utilities is written and tested with Python3.6. If you are using an older
version, you will likely see syntax errors. If you are encountering problems, then execute:
```commandline
gpu-chk
```
This should display a message indicating any Python or Kernel incompatibilities. In order to
get maximum capability of these utilities, you should be running with a kernel that provides
support of the GPUs you have installed. If using AMD GPUs, installing the latest **amdgpu**
driver or **ROCm** release, may provide additional capabilities. If you have Nvidia GPUs
installed, you should have **nvidia.smi** installed in order for the utility reading of the
cards to be possible. Writing to GPUs is currently only possible for AMD GPUs, and only
with compatible cards. Modifying AMD GPU properties requires that the AMD ppfeaturemask
be set to 0xfffd7fff. This can be accomplished by adding `amdgpu.ppfeaturemask=0xfffd7fff`
to the `GRUB_CMDLINE_LINUX_DEFAULT` value in `/etc/default/grub` and executing `sudo update-grub`:
I found a more specific way of determining the ppfeaturemask value that sets only the required
bits. I have not yet tested on enough systems to know it is robust:
```shell
printf 'amdgpu.ppfeaturemask=0x%x\n' "$(($(cat /sys/module/amdgpu/parameters/ppfeaturemask) | 0x4000))"
```
```commandline
cd /etc/default
sudo vi grub
```
Modify to include the featuremask as follows:
```shell
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ppfeaturemask=0xfffd7fff"
```
After saving, update grub:
```commandline
sudo update-grub
```
and then reboot.
If you have Nvidia GPUs installed, you will need to have nvidia-smi installed.
## Using gpu-ls
After getting your system setup to support **rickslab-gpu-utils**, it is best to verify functionality by
listing your GPU details with the *gpu-ls* command. The utility will use the system `lspci` command
to identify all installed GPUs. The utility will also verify system setup/configuration for read, write,
and compute capability. Additional performance/configuration details are read from the GPU for compatible
GPUs. Example of the output is as follows:
```
OS command [nvidia-smi] executable found: [/usr/bin/nvidia-smi]
Detected GPUs: INTEL: 1, NVIDIA: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
3 total GPUs, 1 rw, 1 r-only, 0 w-only
Card Number: 0
Vendor: INTEL
Readable: False
Writable: False
Compute: False
Device ID: {'device': '0x3e91', 'subsystem_device': '0x8694', 'subsystem_vendor': '0x1043', 'vendor': '0x8086'}
Decoded Device ID: 8th Gen Core Processor Gaussian Mixture Model
Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
PCIe ID: 00:02.0
Driver: i915
GPU Type: Unsupported
HWmon: None
Card Path: /sys/class/drm/card0/device
System Card Path: /sys/devices/pci0000:00/0000:00:02.0
Card Number: 1
Vendor: NVIDIA
Readable: True
Writable: False
Compute: True
GPU UID: GPU-fcbaadc4-4040-c2e5-d5b6-52d1547bcc64
GPU S/N: [Not Supported]
Device ID: {'device': '0x1381', 'subsystem_device': '0x1073', 'subsystem_vendor': '0x10de', 'vendor': '0x10de'}
Decoded Device ID: GM107 [GeForce GTX 750]
Card Model: GeForce GTX 750
Display Card Model: GeForce GTX 750
Card Index: 0
PCIe ID: 01:00.0
Link Speed: GEN3
Link Width: 8
##################################################
Driver: 390.138
vBIOS Version: 82.07.32.00.32
Compute Platform: OpenCL 1.2 CUDA
Compute Mode: Default
GPU Type: Supported
HWmon: None
Card Path: /sys/class/drm/card1/device
System Card Path: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0
##################################################
Current Power (W): 15.910
Power Cap (W): 38.50
Power Cap Range (W): [30.0, 38.5]
Fan Target Speed (rpm): None
Current Fan PWM (%): 40.000
##################################################
Current GPU Loading (%): 100
Current Memory Loading (%): 36
Current VRAM Usage (%): 91.437
Current VRAM Used (GB): 0.876
Total VRAM (GB): 0.958
Current Temps (C): {'temperature.gpu': 40.0, 'temperature.memory': None}
Current Clk Frequencies (MHz): {'clocks.gr': 1163.0, 'clocks.mem': 2505.0, 'clocks.sm': 1163.0, 'clocks.video': 1046.0}
Maximum Clk Frequencies (MHz): {'clocks.max.gr': 1293.0, 'clocks.max.mem': 2505.0, 'clocks.max.sm': 1293.0}
Current SCLK P-State: [0, '']
Power Profile Mode: [Not Supported]
Card Number: 2
Vendor: AMD
Readable: True
Writable: True
Compute: True
GPU UID: None
Device ID: {'device': '0x731f', 'subsystem_device': '0xe411', 'subsystem_vendor': '0x1da2', 'vendor': '0x1002'}
Decoded Device ID: Radeon RX 5600 XT
Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev ca)
Display Card Model: Radeon RX 5600 XT
PCIe ID: 04:00.0
Link Speed: 16 GT/s
Link Width: 16
##################################################
Driver: amdgpu
vBIOS Version: 113-5E4111U-X4G
Compute Platform: OpenCL 2.0 AMD-APP (3075.10)
GPU Type: CurvePts
HWmon: /sys/class/drm/card2/device/hwmon/hwmon3
Card Path: /sys/class/drm/card2/device
System Card Path: /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/0000:03:00.0/0000:04:00.0
##################################################
Current Power (W): 99.000
Power Cap (W): 160.000
Power Cap Range (W): [0, 192]
Fan Enable: 0
Fan PWM Mode: [2, 'Dynamic']
Fan Target Speed (rpm): 1170
Current Fan Speed (rpm): 1170
Current Fan PWM (%): 35
Fan Speed Range (rpm): [0, 3200]
Fan PWM Range (%): [0, 100]
##################################################
Current GPU Loading (%): 50
Current Memory Loading (%): 49
Current GTT Memory Usage (%): 0.432
Current GTT Memory Used (GB): 0.026
Total GTT Memory (GB): 5.984
Current VRAM Usage (%): 11.969
Current VRAM Used (GB): 0.716
Total VRAM (GB): 5.984
Current Temps (C): {'edge': 54.0, 'junction': 61.0, 'mem': 68.0}
Critical Temps (C): {'edge': 118.0, 'junction': 99.0, 'mem': 99.0}
Current Voltages (V): {'vddgfx': 937}
Current Clk Frequencies (MHz): {'mclk': 875.0, 'sclk': 1780.0}
Current SCLK P-State: [2, '1780Mhz']
SCLK Range: ['800Mhz', '1820Mhz']
Current MCLK P-State: [3, '875Mhz']
MCLK Range: ['625Mhz', '930Mhz']
Power Profile Mode: 5-COMPUTE
Power DPM Force Performance Level: manual
```
If everything is working fine, you should see no warning or errors. The listing utility
also has other command line options:
```
usage: gpu-ls [-h] [--about] [--long | --short | --table | --raw]
[--pstates | --ppm | --features | --clinfo] [--verbose]
[--force_all] [--no_markup] [--no_fan] [--debug]
optional arguments:
-h, --help show this help message and exit
--about README
--long Long listing of GPU details. Includes pstate, ppm, features,
and clinfo.
--short Short listing of basic GPU details
--table Current status of readable GPUs
--raw Show all raw GPU sensor data
--pstates Output pstate tables instead of GPU details
--ppm Output power/performance mode tables instead of GPU details
--features Output amdgpu Feature table instead of GPU details
--clinfo Include openCL with card details
--verbose Display informational message of GPU util progress
--force_all Force attempt to read all sensors
--no_markup Do not format ls output
--no_fan Do not include fan setting options
-d, --debug Debug logger output
```
The *--clinfo* option will make a call to clinfo, if it is installed, and list openCL parameters
along with the basic parameters. The benefit of running this in *gpu-ls* is that the tool
uses the PCIe slot id to associate clinfo results with the appropriate GPU in the listing.
If you have the clinfo package installed, then the command *gpu-ls --clinfo* should provide something
like this at the end of each card's listing (example shown for an AMD GPU):
```
Card Number: 1
Vendor: AMD
PP Features: 0x0000000019f0e3cf
Readable: True
Writable: True
Compute: True
Device ID: {'device': '0x66af', 'subsystem_device': '0x1000', 'subsystem_vendor': '0x1458', 'vendor': '0x1002'}
Decoded Device ID: Vega 20 [Radeon VII]
PCIe ID: 43:00.0
GPU Type: Modern
HWmon: /sys/class/drm/card1/device/hwmon/hwmon1
Card Path: /sys/class/drm/card1/device
System Card Path: /sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.0
### CLINFO Table Data ############################
Device OpenCL C Version: OpenCL C 2.0
Device Name: gfx906
Device Version: OpenCL 2.0 AMD-APP (3143.9)
Driver Version: 3143.9 (PAL,HSAIL)
Max Compute Units: 60
SIMD per CU: 4
SIMD Width: 16
SIMD Instruction Width: 1
CL Max Memory Allocation: 14588628172
Max Work Item Dimensions: 3
Max Work Item Sizes: 1024 1024 1024
Max Work Group Size: 1024
Preferred Work Group Size: 256
Preferred Work Group Multiple: 64
```
If not, then to see the *clinfo* data you may need to add yourself to the 'video' and 'render' groups by using
these commands:
```
sudo usermod -a -G video $LOGNAME
sudo usermod -a -G render $LOGNAME
```
The *--pstates* option will display the GPU P-State definition table and all other available P-State details.
```
Card Number: 1
Vendor: AMD
PP Features: 0x0000000019f0e3cf
Readable: True
Writable: True
Compute: True
Device ID: {'device': '0x66af', 'subsystem_device': '0x1000', 'subsystem_vendor': '0x1458', 'vendor': '0x1002'}
Decoded Device ID: Vega 20 [Radeon VII]
PCIe ID: 43:00.0
GPU Type: CurvePts
HWmon: /sys/class/drm/card1/device/hwmon/hwmon1
Card Path: /sys/class/drm/card1/device
System Card Path: /sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.0
### P-State Table Data ###########################
##################################################
DPM States:
SCLK: MCLK:
0: 701Mhz 0: 351Mhz
1: 809Mhz 1: 801Mhz
2: 1135Mhz 2: 1001Mhz
3: 1373Mhz
4: 1547Mhz
5: 1684Mhz
6: 1750Mhz
7: 1774Mhz
8: 1802Mhz
##################################################
PP OD States:
SCLK: MCLK:
0: 808Mhz -
1: 1801Mhz - 1: 1000Mhz -
##################################################
VDDC_CURVE:
0: ['808Mhz', '722mV']
1: ['1304Mhz', '820mV']
2: ['1801Mhz', '1122mV']
##################################################
All Pstates:
mclk:
0: *351Mhz, 1: 801Mhz, 2: 1001Mhz
dcefclk:
0: *358Mhz, 1: 454Mhz, 2: 567Mhz, 3: 680Mhz, 4: 756Mhz, 5: 850Mhz, 6: 972Mhz, 7: 1134Mhz
socclk:
0: 310Mhz, 1: 524Mhz, 2: 567Mhz, 3: 619Mhz, 4: 680Mhz, 5: 756Mhz, 6: 850Mhz, 7: *972Mhz
fclk:
0: 551Mhz, 1: 611Mhz, 2: 691Mhz, 3: 761Mhz, 4: 871Mhz, 5: 961Mhz, 6: 1081Mhz, 7: *1226Mhz
sclk:
0: 701Mhz, 1: *809Mhz, 2: 1135Mhz, 3: 1373Mhz, 4: 1547Mhz, 5: 1684Mhz, 6: 1750Mhz, 7: 1774Mhz, 8: 1802Mhz
```
Different generations of cards will provide different information with the --ppm option. Here is an
example for AMD Ellesmere and Polaris cards:
```
Card Number: 1
Vendor: AMD
PP Features: 0x0000000019f0e3cf
Readable: True
Writable: True
Compute: True
Device ID: {'device': '0x66af', 'subsystem_device': '0x1000', 'subsystem_vendor': '0x1458', 'vendor': '0x1002'}
Decoded Device ID: Vega 20 [Radeon VII]
PCIe ID: 43:00.0
GPU Type: Modern
HWmon: /sys/class/drm/card1/device/hwmon/hwmon1
Card Path: /sys/class/drm/card1/device
System Card Path: /sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.0
### PPM Table Data ###############################
PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS UseRlcBusy MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
0 BOOTUP_DEFAULT*:
0( GFXCLK) 0 0 1 0 4 800 4587520 -65536 0
1( SOCCLK) 0 0 1 0 4 800 327680 -6553 0
2( UCLK) 0 0 1 0 4 800 327680 -65536 0
3( FCLK) 0 0 0 0 4 800 327680 -6553 0
1 3D_FULL_SCREEN :
0( GFXCLK) 0 1 1 0 4 800 4587520 -65536 0
1( SOCCLK) 0 1 4 850 4 800 327680 -65536 0
2( UCLK) 0 1 4 850 4 800 327680 -65536 0
3( FCLK) 0 1 4 850 4 800 327680 -65536 0
2 POWER_SAVING :
0( GFXCLK) 0 0 1 0 3 0 5898240 -65536 0
1( SOCCLK) 0 0 1 0 3 0 1310720 -6553 0
2( UCLK) 0 0 1 0 3 0 1966080 -65536 0
3( FCLK) 0 0 0 0 3 800 1966080 -6553 0
3 VIDEO :
0( GFXCLK) 0 1 1 0 4 500 4587520 -6553 0
1( SOCCLK) 0 0 1 0 4 500 1310720 -6553 0
2( UCLK) 0 0 1 0 4 500 1966080 -65536 0
3( FCLK) 0 0 3 0 4 500 1966080 -6553 0
4 VR :
0( GFXCLK) 0 1 0 1540 4 800 5898240 -6553 65536
1( SOCCLK) 0 1 2 0 4 800 327680 -32768 -65536
2( UCLK) 0 1 2 0 4 800 327680 -32768 -65536
3( FCLK) 0 1 2 0 4 800 327680 -32768 -65536
5 COMPUTE :
0( GFXCLK) 0 1 0 1600 3 0 3932160 -65536 -65536
1( SOCCLK) 0 0 4 850 3 0 327680 -65536 -32768
2( UCLK) 0 0 4 850 3 0 327680 -65536 -32768
3( FCLK) 0 0 4 850 3 0 327680 -65536 -32768
6 CUSTOM :
0( GFXCLK) 0 0 1 0 4 800 4587520 -65536 0
1( SOCCLK) 0 0 1 0 4 800 327680 -6553 0
2( UCLK) 0 0 1 0 4 800 327680 -65536 0
3( FCLK) 0 0 0 0 4 800 327680 -6553 0
```
## GPU Type Dependent Behavior
GPU capability and compatibility varies over the various vendors and generations of hardware. In
order to manage this variability, **rickslab-gpu-utils** must classify each installed GPU by its vendor
and type. So far, valid types are as follows:
* **Undefined** - This is the default assigned type, before a valid type can be determined.
* **Unsupported** - This is the type assigned for cards which have no capability of reading beyond basic parameters typical of PCIe devices.
* **Supported** - This is the type assigned for basic readability, including *nvidia-smi* readabile GPUs.
* **Legacy** - Applies to legacy AMD GPUs with very basic parameters available to read. (pre-HD7)
* **LegacyAPU** - Applies to older AMD integrated graphics with very few parameters available. (Ontario)
* **APU** - Applies to AMD integrated graphics with limited parameters available. (Carizzo - Renoir)
* **PStatesNE** - Applies to AMD GPUs with most parameters available, but Pstates not writeable. (HD7 series)
* **PStates** - Applies to modern AMD GPUs with writeable Pstates. (R9 series thr RX-Vega)
* **CurvePts** - Applies to latest generation AMD GPUs that use AVFS curves instead of Pstates. (Vega20 and newer)
With the *gpu-ls* tool, you can determine the type of your installed GPUs. Here are examples of
relevant lines from the output for different types of GPUs:
```
Decoded Device ID: 8th Gen Core Processor Gaussian Mixture Model [Intel CPU with integrated graphics]
GPU Type: Unsupported
Decoded Device ID: GM107 [GeForce GTX 750]
GPU Type: Supported
Decoded Device ID: R9 290X DirectCU II
GPU Type: PStatesNE
Decoded Device ID: RX Vega64
GPU Type: PStates
Decoded Device ID: Radeon VII
GPU Type: CurvePts
Decoded Device ID: Radeon RX 5600 XT
GPU Type: CurvePts
```
Monitor and Control utilities will differ between these types:
* For **Undefined** and **Unsupported** types, only generic PCIe parameters are available. These types are
considered unreadable, unwritable, and as having no compute capability.
* For **Supported** types have the most basic level of readability. This includes NV cards with nvidia-smi support.
* For **Legacy** and **APU**, only basic and limited respectively are readable.
* For **Pstates** and **PstatesNE** type GPUs, pstate details are readable, but for **PstatesNE** they are not
writable. For type **Pstates** pstate Voltages/Frequencies as well as pstate masking can be specified.
* The **CurvePts** type applies to modern (Vega20 and later) AMD GPUs that use AVFS instead of Pstates for
performance control. These have the highest degree of read/write capability. The SCLK and MCLK curve end points
can be controlled, which has the effect of over/under clocking/voltage. You are also able to modify the three
points that define the Vddc-SCLK curve. I have not attempted to OC the card yet, but I assume redefining the 3rd
point would be the best approach. For underclocking, lowering the SCLK end point is effective. I don't see a
curve defined for memory clock on the Radeon VII, so setting memory clock vs. voltage doesn't seem possible at
this time. There also appears to be an inconsistency in the defined voltage ranges for curve points and actual
default settings.
Below is a plot of what I extracted for the Frequency vs Voltage curves of the RX Vega64 and the Radeon VII.

## Using gpu-mon
By default, *gpu-mon* will display a text based table in the current terminal window that updates
every sleep duration, in seconds, as defined by *--sleep N* or 2 seconds by default. If you are using
water cooling, you can use the *--no_fans* to remove fan monitoring functionality.
```
┌─────────────┬────────────────┬────────────────┐
│Card # │card1 │card2 │
├─────────────┼────────────────┼────────────────┤
│Model │GeForce GTX 750 │Radeon RX 5600 X│
│GPU Load % │100 │91 │
│Mem Load % │36 │68 │
│VRAM Usage % │89.297 │11.969 │
│GTT Usage % │None │0.432 │
│Power (W) │15.69 │92.0 │
│Power Cap (W)│38.50 │160.0 │
│Energy (kWh) │0.0 │0.002 │
│T (C) │48.0 │61.0 │
│VddGFX (mV) │nan │925 │
│Fan Spd (%) │40.0 │36 │
│Sclk (MHz) │1163 │1780 │
│Sclk Pstate │0 │2 │
│Mclk (MHz) │2505 │875 │
│Mclk Pstate │0 │3 │
│Perf Mode │[Not Supported] │5-COMPUTE │
└─────────────┴────────────────┴────────────────┘
```
The fields are the same as the GUI version of the display, available with the *--gui* option.

The first row gives the card number for each GPU. This number is the integer used by the driver for each GPU. Most
fields are self describing. The Power Cap field is especially useful in managing compute power efficiency, and
lowering the cap can result in more level loading and overall lower power usage for little compromise in performance.
The Energy field is a derived metric that accumulates GPU energy usage, in kWh, consumed since the monitor started.
Note that total card power usage may be more than reported GPU power usage. Energy is calculated as the product of
the latest power reading and the elapsed time since the last power reading.
The P-states in the table for **CurvePts** type GPU are an indication of frequency vs. voltage curves.
Setting P-states to control the GPU is no longer relevant for this type, but these concepts are used in
reading current states.
The Perf Mode field gives the current power performance mode, which may be modified in with *gpu-pac*. These
modes affect the how frequency and voltage are managed versus loading. This is a very important parameter when
managing compute performance.
Executing *gpu-mon* with the *--plot* option will display a continuously updating plot of the critical
GPU parameters.

Having an *gpu-mon* Gtx window open at startup may be useful if you run GPU compute projects that autostart
and you need to quickly confirm that *gpu-pac* bash scripts ran as expected at startup
(see [Using gpu-pac](#using-gpu-pac)). You can have *gpu-mon --gui* automatically launch at startup
or upon reboot by using the startup utility for your distribution. In Ubuntu, for example, open *Startup Applications
Preferences* app, then in the Preferences window select *Add* and use something like this in the command field:
```shell
/usr/bin/python3 /home/<user>/Desktop/rickslab-gpu-utils/gpu-mon --gui
```
where `/rickslab-gpu-utils` may be a soft link to your current distribution directory. This startup approach does not
work for the default Terminal text execution of *gpu-mon*.
## Using gpu-plot
In addition to being called from *gpu-mon* with the *--plot* option, *gpu-plot* may be ran as a standalone
utility. Just execute *gpu-plot --sleep N* and the plot will update at the defined interval. It is not
recommended to run both the monitor with an independently executed plot, as it will result in twice as many reads
from the driver files. Once the plots are displayed, individual items on the plot can be toggled by selecting the
named button on the plot display.
The *--stdin* option is used by *gpu-mon --plot* in its execution of *gpu-plot*. This option along
with *--simlog* option can be used to simulate a plot output using a log file generated by *gpu-mon --log*.
I use this feature when troubleshooting problems from other users, but it may also be useful in benchmarking
performance. An example of the command line for this is as follows:
```shell
cat log_monitor_0421_081038.txt | gpu-plot --stdin --simlog
```
## Using gpu-pac
By default, *gpu-pac* will open a Gtk based GUI to allow the user to modify GPU performance parameters. I strongly
suggest that you completely understand the implications of changing any of the performance settings before you use
this utility. As per the terms of the GNU General Public License that covers this project, there is no warranty on
the usability of these tools. Any use of this tool is at your own risk.
To help you manage the risk in using this tool, two modes are provided to modify GPU parameters. By default, a bash
file is created that you can review and execute to implement the desired changes. Here is an example of that file:
```
#!/bin/sh
###########################################################################
## rickslab-gpu-pac generated script to modify GPU configuration/settings
###########################################################################
###########################################################################
## WARNING - Do not execute this script without completely
## understanding appropriate values to write to your specific GPUs
###########################################################################
#
# Copyright (C) 2019 RueiKe
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
###########################################################################
#
# Card1 Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev c1)
# /sys/class/drm/card1/device
#
set -x
# Power DPM Force Performance Level: [manual] change to [manual]
sudo sh -c "echo 'manual' > /sys/class/drm/card1/device/power_dpm_force_performance_level"
# Powercap Old: 150 New: 150 Min: 0 Max: 300
sudo sh -c "echo '150000000' > /sys/class/drm/card1/device/hwmon/hwmon2/power1_cap"
# Fan PWM Old: 0 New: 0 Min: 0 Max: 100
sudo sh -c "echo '1' > /sys/class/drm/card1/device/hwmon/hwmon2/pwm1_enable"
sudo sh -c "echo '0' > /sys/class/drm/card1/device/hwmon/hwmon2/pwm1"
# sclk curve end point: 0 : 808 MHz
sudo sh -c "echo 's 0 808' > /sys/class/drm/card1/device/pp_od_clk_voltage"
# sclk curve end point: 1 : 1650 MHz
sudo sh -c "echo 's 1 1650' > /sys/class/drm/card1/device/pp_od_clk_voltage"
# mclk curve end point: 1 : 1050 MHz
sudo sh -c "echo 'm 1 1050' > /sys/class/drm/card1/device/pp_od_clk_voltage"
# vddc curve point: 0 : 808 MHz, 724 mV
sudo sh -c "echo 'vc 0 808 724' > /sys/class/drm/card1/device/pp_od_clk_voltage"
# vddc curve point: 1 : 1304 MHz, 822 mV
sudo sh -c "echo 'vc 1 1304 822' > /sys/class/drm/card1/device/pp_od_clk_voltage"
# vddc curve point: 2 : 1801 MHz, 1124 mV
sudo sh -c "echo 'vc 2 1801 1124' > /sys/class/drm/card1/device/pp_od_clk_voltage"
# Selected: ID=5, name=COMPUTE
sudo sh -c "echo '5' > /sys/class/drm/card1/device/pp_power_profile_mode"
sudo sh -c "echo 'c' > /sys/class/drm/card1/device/pp_od_clk_voltage"
# Sclk P-State Mask Default: 0 1 2 3 4 5 6 7 8 New: 0 1 2 3 4 5 6 7 8
sudo sh -c "echo '0 1 2 3 4 5 6 7 8' > /sys/class/drm/card1/device/pp_dpm_sclk"
# Mclk P-State Mask Default: 0 1 2 New: 0 1 2
sudo sh -c "echo '0 1 2' > /sys/class/drm/card1/device/pp_dpm_mclk"
```
When you execute *gpu-pac*, you will notice a message bar at the bottom of the interface. By default, it informs
you of the mode you are running in. By default, the operation mode is to create a bash file, but with the
*--execute_pac* (or *--execute*) command line option, the bash file will be automatically executed and then deleted.
The message bar will indicate this status. Because the driver files are writable only by root, the commands to
write configuration settings are executed with sudo. The message bar will display in red when credentials are
pending. Once executed, a yellow message will remind you to check the state of the gpu with *gpu-mon*. I
suggest using the monitor routine when executing pac to see and confirm the changes in real-time.
The command line option *--force_write* will result in all configuration parameters to be written to the bash file.
The default behavior since v2.4.0 is to write only changes. The *--force_write* is useful for creating a bash file
that can be execute to set your cards to a known state. As an example, you could use such a file to configure your
GPUs on boot up (see [Running Startup PAC Bash Files](#running-startup-pac-bash-files)).
### The gpu-pac interface for Type PStates and Type CurvePts cards

In the interface, you will notice entry fields for indicating new values for specific parameters. In most cases, the
values in these fields will be the current values, but in the case of P-state masks, it will show the default value
instead of the current value. If you know how to obtain the current value, please let me know!
Note that when a PAC bash file is executed either manually or automatically, the resulting fan PWM (% speed) may
be slightly different from what you see in the Fan PWM entry field. The direction and magnitude of differences
between expected and realized fan speeds can depend on card model. You will need to experiment with different
settings to determine how it works with your card. I recommend running these experimental settings when the GPU
is not under load. If you know the cause of the differences between entered and final fan PWM values, let me know.
Changes made with *gpu-pac* do not persist through a system reboot. To reestablish desired GPU settings after a
reboot, either re-enter them using *gpu-pac* or *gpu-pac --execute*, or execute a previously saved bash file.
*gpu-pac* bash files must retain their originally assigned file name to run properly.
See [Running Startup PAC Bash Files](#running-startup-pac-bash-files) for how to run PAC bash
scripts automatically at system startup.
For Type **Pstates** cards, while changes to power caps and fan speeds can be made while the GPU is under load, for
*gpu-pac* to work properly, other changes may require that the GPU not be under load, *i.e.*, that sclk
P-state and mclk P-state are 0. Possible consequences with making changes under load is that the GPU become
stuck in a 0 P-state or that the entire system becomes slow to respond, where a reboot will be needed to restore
full GPU functions. Note that when you change a P-state mask, default mask values will reappear in the field
after Save, but your specified changes will have been implemented on the card and show up in *gpu-mon*.
Some changes may not persist when a card has a connected display. When changing P-state MHz or mV, the desired
P-state mask, if different from default (no masking), will have to be re-entered for clock or voltage changes to
be applied. Again, save PAC changes to clocks, voltages, or masks only when the GPU is at resting state (state 0).
For Type **CurvePts** cards, although changes to P-state masks cannot be made through *gpu-pac*, changes to all
other fields can be made on-the-fly while the card is under load.
Some basic error checking is done before writing, but I suggest you be very certain of all entries before you save
changes to the GPU. You should always confirm your changes with *gpu-mon*.
## Updating the PCI ID decode file
In determining the GPU display name, **rickslab-gpu-utils** will examine two sources. The output of
`lspci -k -s nn:nn.n` is used to generate a complete name, and an algorithm is used to generate a shortened
version. From the driver files, a set of files (vendor, device, subsystem_vendor, subsystem_device) contain
4 parts of the Device ID are read and used to extract a GPU model name from system pci.ids file which is
sourced from [https://pci-ids.ucw.cz/](https://pci-ids.ucw.cz/) where a comprehensive list is maintained. The
system file can be updated from the original source with the command:
```
sudo update-pciids
```
If your GPU is not listed in the extract, the pci.id website has an interface to allow the user to request an
addition to the master list.
## Optimizing Compute Performance-Power
The **rickslab-gpu-utils** tools can be used to optimize performance vs. power for compute workloads by leveraging
its ability to measure power and control relevant GPU settings. This flexibility allows one to execute a
DOE to measure the effect of GPU settings on the performance in executing specific workloads. In SETI@Home
performance, the Energy feature has also been built into [benchMT](https://github.com/Ricks-Lab/benchMT) to
benchmark power and execution times for various work units. This, combined with the log file produced with
*gpu-mon --gui --log*, may be useful in optimizing performance.

## Running Startup PAC Bash Files
If you set your system to run *gpu-pac* bash scripts automatically, as described in this section, note that
changes in your hardware or graphic drivers may cause potentially serious problems with GPU settings unless new
PAC bash files are generated following the changes. Review the [Using gpu-pac](#using-gpu-pac) section
before proceeding.
One approach is to execute PAC bash scripts as a systemd startup service. From *gpu-pac --force_write*, set your
optimal configurations for each GPU, then Save All. You may need to change ownership to root of each card's bash
file: `sudo chown root pac_writer*.sh`
For each bash file, you could create a symlink (soft link) that corresponds to the card number referenced in each
linked bash file, using simple descriptive names, *e.g.*, pac_writer_card1, pac_writer_card2, *etc.*. These links are
optional, but can make management of new or edited startup bash files easier. Links are used in the startup service
example, below. Don't forget to reform the link(s) each time a new PAC bash file is written for a card.
Next, create a .service file named something like, gpu-pac-startup.service and give it the following content:
```
[Unit]
Description=run at boot rickslab-gpu-utils PAC bash scripts
[Service]
Type=oneshot
ExecStart=/home/<user>/pac_writer_card0
ExecStart=/home/<user>/pac_writer_card1
ExecStart=/home/<user>/pac_writer_card2
[Install]
WantedBy=multi-user.target
```
The Type=oneshot service allows use of more than one ExecStart. In this example, three bash files are used for
two cards, where two alternative files are used for one card that the system may recognize as either card0 or
card1; see further below for an explanation.
Once your .service file is set up, execute the following commands:
```
sudo chown root:root gpu-pac-startup.service
sudo mv gpu-pac-startup.service /etc/systemd/system/
sudo chmod 664 /etc/systemd/system/gpu-pac-startup.service
sudo systemctl daemon-reload
sudo systemctl enable gpu-pac-startup.service
```
The last command should produce a terminal stdout like this:
`Created symlink /etc/systemd/system/multi-user.target.wants/gpu-pac-startup.service → /etc/systemd/system/gpu-pac-startup.service.`
On the next reboot or restart, the GPU(s) will be set with the PAC run parameters. If you want to test the bash
script(s) before rebooting, run: `~$ sudo systemctl start gpu-pac-startup.service`.
If you have a Type PStates card where some PAC parameters can't be changed when it is under load, you will want
to make sure that the PAC bash script executes before the card begins computing. If you have a *boinc-client* that
automatically runs on startup, for example, then consider delaying it for 20 seconds using the cc_config.xml
option *<start_delay>30</start_delay>*.
One or more card numbers that are assigned by amdgpu drivers may change following a system or driver
update and restart. With subsequent updates or restarts, a card can switch back to its original number. When a
switch occurs, the bash file written for a previous card number will still be read at startup, but will have no
effect, causing the renumbered card to run at its default settings. To deal with this possibility, you can create
an alternative PAC bash file after a renumbering event and add these alternative files in your systemd service.
You will probably just need two alternative bash files for a card that is subject to reindexing. A card's
number is shown by *gpu-ls* and also appears in *gpu-mon* and *gpu-plot*. A card's PCI IDs is listed
by *gpu-ls*. If you know what causes GPU card index switching, let me know.
You may find a card running at startup with default power limits and Fan PWM settings instead of what is prescribed
in its startup PAC bash file. If so, it may be that the card's hwmon# is different from what is hard coded in the
bash file, because the hwmon index for devices can also change upon reboot. To work around this, you can edit a
card's bash file to define hwmon# as a variable and modify the hwmon lines to use it. Here is an example for card1:
```
set -x
HWMON=$(ls /sys/class/drm/card1/device/hwmon/)
# Powercap Old: 120 New: 110 Min: 0 Max: 180
sudo sh -c "echo '1100000000' > /sys/class/drm/card1/device/hwmon/$HWMON/power1_cap"
# Fan PWM Old: 44 New: 47 Min: 0 Max: 100
sudo sh -c "echo '1' > /sys/class/drm/card1/device/hwmon/$HWMON/pwm1_enable"
sudo sh -c "echo '119' > /sys/class/drm/card1/device/hwmon/$HWMON/pwm1"
```
|