File: USER_GUIDE.md

package info (click to toggle)
rickslab-gpu-utils 3.9.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,136 kB
  • sloc: python: 5,143; makefile: 11
file content (898 lines) | stat: -rw-r--r-- 43,996 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
# Ricks-Lab GPU Utilities - User Guide

A set of utilities for monitoring GPU performance and modifying control settings.

## Current rickslab-gpu-utils Version: 3.7.x

 - [Installation](#installation)
 - [Getting Started](#getting-started)
 - [Using gpu-ls](#using-gpu-ls)
 - [GPU Type Dependent Behavior](#gpu-type-dependent-behavior)
 - [Using gpu-mon](#using-gpu-mon)
 - [Using gpu-plot](#using-gpu-plot)
 - [Using gpu-pac](#using-gpu-pac)
 - [Updating the PCI ID decode file](#updating-the-PCI-ID-decode-file)
 - [Optimizing Compute Performance-Power](#optimizing-compute-performance-power)
 - [Running Startup PAC Bash Files](#running-startup-pac-bash-files)


## Installation

There are 4 methods of installation available and summarized here:
* [Repository](#repository-installation) - This approach is recommended for those interested in contributing to the project or helping to troubleshoot an issue in realtime with the developer. This type of installation can exist along with any of the other installation type.
* [PyPI](#pypi-installation) - Meant for users wanting to run the very latest version.  All **PATCH** level versions are released here first.  This install method is also meant for users not on a Debian distribution.
* [Rickslab.com Debian](#rickslabcom-debian-installation) - Lags the PyPI release in order to assure robustness. May not include every **PATCH** version.
* [Official Debian](#official-debian-package-installation) - Only **MAJOR/MINOR** releases.  This works for releases of Ubuntu 22.04 or Bullseye 11.3 or later.

### Repository Installation

For a developer/contributor to the project, it is expected that you duplicate the development environment
using a virtual environment. So far, my development activities for this project have used python3.6. 
The following are details on setting up a virtual environment with python3.6:

```commandline
sudo apt install -y python3.6-venv
sudo apt install -y python3.6-dev
```

Clone the repository from GitHub with the following command:

```commandline
git clone https://github.com/Ricks-Lab/gpu-utils.git
cd gpu-utils
```

Initialize your *rickslab-gpu-utils-env* if it is your first time to use it. From the project root directory, execute:

```commandline
python3.6 -m venv rickslab-gpu-utils-env
source rickslab-gpu-utils-env/bin/activate
pip install --upgrade pip
pip install --no-cache-dir -r requirements-venv.txt
```

If you get errors installing `vext`, you may need to use the `--use-pep517`:
```commandline
pip install --no-cache-dir --use-pep517 -r requirements-venv.txt
```

On newer systems, I have found that I get a `ModuleNotFoundError: No module named 'numpy'`, even though `numpy` was
successfully installed in the newly created virtual environment.  To resolve this, I deactivated the venv and installed
it for the system instance of python.  When back in the venv, the issue is resolved.  No idea why this is happening.

You then run the desired commands by specifying the full path: `./gpu-ls`

### PyPI Installation

First, remove any previous Debian package and any ricks-amdgpu-utils PyPI installations:

```commandline
sudo apt purge rickslab-gpu-utils
sudo apt purge ricks-amdgpu-utils
sudo apt autoremove
pip3 uninstall ricks-amdgpu-utils
```

Install the latest package from [PyPI](https://pypi.org/project/rickslab-gpu-utils/) with the following
commands:

```commandline
pip3 install rickslab-gpu-utils
```

Or, use the pip upgrade option if you have already installed a previous version:

```commandline
pip3 install rickslab-gpu-utils -U
```

You may need to open a new terminal window in order for the path to the utilities to be set.

### Rickslab.com Debian Installation

First, remove any previous PyPI installation and exit that terminal.  If you
also have a Debian installed version, the pip uninstall will likely fail,
unless you remove the Debian package first.  You can skip this step if you
are certain no other install types are still installed:

```commandline
sudo apt purge rickslab-gpu-utils
sudo apt purge ricks-amdgpu-utils
sudo apt autoremove
pip uninstall rickslab-gpu-utils
exit
```

If you had previously (before 3.7.6) installed from rickslab.com, you should
delete the key from the apt keyring:

```commandline
sudo apt-key del C98B8839
```

Next, add the *rickslab-gpu-utils* repository:

```shell
wget -q -O - https://debian.rickslab.com/PUBLIC.KEY | sudo gpg --dearmour -o /usr/share/keyrings/rickslab-agent.gpg

echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/rickslab-agent.gpg] https://debian.rickslab.com/gpu-utils/ eddore main' | sudo tee /etc/apt/sources.list.d/rickslab-gpu-utils.list

sudo apt update
```

Then install the package with apt:

```commandline
sudo apt install rickslab-gpu-utils
```

If you decide to no longer use this type of install, you can remove
rickslab-gpu-utils from the system repository list by executing the following:

```shell
echo '' | sudo tee /etc/apt/sources.list.d/rickslab-gpu-utils.list
```

### Official Debian Package Installation

First you should verify the availability of the package by distribution with the following command:
```commandline
rmadison rickslab-gpu-utils
```

Current package availability is as follows:
```text
 rickslab-gpu-utils | 3.6.0-2 | jammy/universe   | source, all
 rickslab-gpu-utils | 3.6.0-3 | kinetic/universe | source, all
 rickslab-gpu-utils | 3.8.0-1 | lunar/universe   | source, all
 rickslab-gpu-utils | 3.8.0-1 | mantic/universe  | source, all
```

Then remove any previous PyPI installation and exit that terminal.  If you also
have a Debian installed versions, the pip uninstall will likely fail, unless you
remove the Debian package first. You can skip this step if you are certain no
over install types have been installed:

```commandline
sudo apt purge ricks-amdgpu-utils
sudo apt purge rickslab-gpu-utils
sudo apt autoremove
pip uninstall rickslab-gpu-utils
exit
```


If you had previously added https://debian.rickslab.com/gpu-utils/ as a repository
source, then you will need to remove this in order to download from the official
debian repository.  This can be accomplished with the following shell command:

```shell
echo '' | sudo tee /etc/apt/sources.list.d/rickslab-gpu-utils.list
```

```commandline
sudo apt update
sudo apt install rickslab-gpu-utils
```


## Getting Started

First, this set of utilities is written and tested with Python3.6.  If you are using an older
version, you will likely see syntax errors.  If you are encountering problems, then execute:

```commandline
gpu-chk
```

This should display a message indicating any Python or Kernel incompatibilities. In order to
get maximum capability of these utilities, you should be running with a kernel that provides
support of the GPUs you have installed.  If using AMD GPUs, installing the latest **amdgpu**
driver or **ROCm** release, may provide additional capabilities. If you have Nvidia GPUs
installed, you should have **nvidia.smi** installed in order for the utility reading of the
cards to be possible.  Writing to GPUs is currently only possible for AMD GPUs, and only
with compatible cards.  Modifying AMD GPU properties requires that the AMD ppfeaturemask
be set to 0xfffd7fff. This can be accomplished by adding `amdgpu.ppfeaturemask=0xfffd7fff`
to the `GRUB_CMDLINE_LINUX_DEFAULT` value in `/etc/default/grub` and executing `sudo update-grub`:

I found a more specific way of determining the ppfeaturemask value that sets only the required
bits.  I have not yet tested on enough systems to know it is robust:

```shell
printf 'amdgpu.ppfeaturemask=0x%x\n' "$(($(cat /sys/module/amdgpu/parameters/ppfeaturemask) | 0x4000))"
```

```commandline
cd /etc/default
sudo vi grub
```

Modify to include the featuremask as follows:

```shell
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ppfeaturemask=0xfffd7fff"
```

After saving, update grub:

```commandline
sudo update-grub
```

and then reboot.

If you have Nvidia GPUs installed, you will need to have nvidia-smi installed.

## Using gpu-ls

After getting your system setup to support **rickslab-gpu-utils**, it is best to verify functionality by
listing your GPU details with the *gpu-ls* command.  The utility will use the system `lspci` command
to identify all installed GPUs.  The utility will also verify system setup/configuration for read, write,
and compute capability.  Additional performance/configuration details are read from the GPU for compatible
GPUs.  Example of the output is as follows:

```
OS command [nvidia-smi] executable found: [/usr/bin/nvidia-smi]
Detected GPUs: INTEL: 1, NVIDIA: 1, AMD: 1
AMD: amdgpu version: 20.10-1048554
AMD: Wattman features enabled: 0xfffd7fff
3 total GPUs, 1 rw, 1 r-only, 0 w-only

Card Number: 0
   Vendor: INTEL
   Readable: False
   Writable: False
   Compute: False
   Device ID: {'device': '0x3e91', 'subsystem_device': '0x8694', 'subsystem_vendor': '0x1043', 'vendor': '0x8086'}
   Decoded Device ID: 8th Gen Core Processor Gaussian Mixture Model
   Card Model: Intel Corporation 8th Gen Core Processor Gaussian Mixture Model
   PCIe ID: 00:02.0
   Driver: i915
   GPU Type: Unsupported
   HWmon: None
   Card Path: /sys/class/drm/card0/device
   System Card Path: /sys/devices/pci0000:00/0000:00:02.0

Card Number: 1
   Vendor: NVIDIA
   Readable: True
   Writable: False
   Compute: True
   GPU UID: GPU-fcbaadc4-4040-c2e5-d5b6-52d1547bcc64
   GPU S/N: [Not Supported]
   Device ID: {'device': '0x1381', 'subsystem_device': '0x1073', 'subsystem_vendor': '0x10de', 'vendor': '0x10de'}
   Decoded Device ID: GM107 [GeForce GTX 750]
   Card Model: GeForce GTX 750
   Display Card Model: GeForce GTX 750
   Card Index: 0
   PCIe ID: 01:00.0
      Link Speed: GEN3
      Link Width: 8
   ##################################################
   Driver: 390.138
   vBIOS Version: 82.07.32.00.32
   Compute Platform: OpenCL 1.2 CUDA
   Compute Mode: Default
   GPU Type: Supported
   HWmon: None
   Card Path: /sys/class/drm/card1/device
   System Card Path: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0
   ##################################################
   Current Power (W): 15.910
   Power Cap (W): 38.50
      Power Cap Range (W): [30.0, 38.5]
   Fan Target Speed (rpm): None
   Current Fan PWM (%): 40.000
   ##################################################
   Current GPU Loading (%): 100
   Current Memory Loading (%): 36
   Current VRAM Usage (%): 91.437
      Current VRAM Used (GB): 0.876
      Total VRAM (GB): 0.958
   Current  Temps (C): {'temperature.gpu': 40.0, 'temperature.memory': None}
   Current Clk Frequencies (MHz): {'clocks.gr': 1163.0, 'clocks.mem': 2505.0, 'clocks.sm': 1163.0, 'clocks.video': 1046.0}
   Maximum Clk Frequencies (MHz): {'clocks.max.gr': 1293.0, 'clocks.max.mem': 2505.0, 'clocks.max.sm': 1293.0}
   Current SCLK P-State: [0, '']
   Power Profile Mode: [Not Supported]

Card Number: 2
   Vendor: AMD
   Readable: True
   Writable: True
   Compute: True
   GPU UID: None
   Device ID: {'device': '0x731f', 'subsystem_device': '0xe411', 'subsystem_vendor': '0x1da2', 'vendor': '0x1002'}
   Decoded Device ID: Radeon RX 5600 XT
   Card Model: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev ca)
   Display Card Model: Radeon RX 5600 XT
   PCIe ID: 04:00.0
      Link Speed: 16 GT/s
      Link Width: 16
   ##################################################
   Driver: amdgpu
   vBIOS Version: 113-5E4111U-X4G
   Compute Platform: OpenCL 2.0 AMD-APP (3075.10)
   GPU Type: CurvePts
   HWmon: /sys/class/drm/card2/device/hwmon/hwmon3
   Card Path: /sys/class/drm/card2/device
   System Card Path: /sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/0000:03:00.0/0000:04:00.0
   ##################################################
   Current Power (W): 99.000
   Power Cap (W): 160.000
      Power Cap Range (W): [0, 192]
   Fan Enable: 0
   Fan PWM Mode: [2, 'Dynamic']
   Fan Target Speed (rpm): 1170
   Current Fan Speed (rpm): 1170
   Current Fan PWM (%): 35
      Fan Speed Range (rpm): [0, 3200]
      Fan PWM Range (%): [0, 100]
   ##################################################
   Current GPU Loading (%): 50
   Current Memory Loading (%): 49
   Current GTT Memory Usage (%): 0.432
      Current GTT Memory Used (GB): 0.026
      Total GTT Memory (GB): 5.984
   Current VRAM Usage (%): 11.969
      Current VRAM Used (GB): 0.716
      Total VRAM (GB): 5.984
   Current  Temps (C): {'edge': 54.0, 'junction': 61.0, 'mem': 68.0}
   Critical Temps (C): {'edge': 118.0, 'junction': 99.0, 'mem': 99.0}
   Current Voltages (V): {'vddgfx': 937}
   Current Clk Frequencies (MHz): {'mclk': 875.0, 'sclk': 1780.0}
   Current SCLK P-State: [2, '1780Mhz']
      SCLK Range: ['800Mhz', '1820Mhz']
   Current MCLK P-State: [3, '875Mhz']
      MCLK Range: ['625Mhz', '930Mhz']
   Power Profile Mode: 5-COMPUTE
   Power DPM Force Performance Level: manual
```

If everything is working fine, you should see no warning or errors.  The listing utility
also has other command line options:

```
usage: gpu-ls [-h] [--about] [--long | --short | --table | --raw]
              [--pstates | --ppm | --features | --clinfo] [--verbose]
              [--force_all] [--no_markup] [--no_fan] [--debug]

optional arguments:
       -h, --help   show this help message and exit
       --about      README
       --long       Long listing of GPU details. Includes pstate, ppm, features,
                    and clinfo.
       --short      Short listing of basic GPU details
       --table      Current status of readable GPUs
       --raw        Show all raw GPU sensor data
       --pstates    Output pstate tables instead of GPU details
       --ppm        Output power/performance mode tables instead of GPU details
       --features   Output amdgpu Feature table instead of GPU details
       --clinfo     Include openCL with card details
       --verbose    Display informational message of GPU util progress
       --force_all  Force attempt to read all sensors
       --no_markup  Do not format ls output
       --no_fan     Do not include fan setting options
       -d, --debug  Debug logger output
```

The *--clinfo* option will make a call to clinfo, if it is installed, and list openCL parameters
along with the basic parameters.  The benefit of running this in *gpu-ls* is that the tool
uses the PCIe slot id to associate clinfo results with the appropriate GPU in the listing.

If you have the clinfo package installed, then the command *gpu-ls --clinfo* should provide something
like this at the end of each card's listing (example shown for an AMD GPU):
```
Card Number: 1
   Vendor:  AMD 
   PP Features: 0x0000000019f0e3cf
   Readable: True
   Writable: True
   Compute: True
   Device ID: {'device': '0x66af', 'subsystem_device': '0x1000', 'subsystem_vendor': '0x1458', 'vendor': '0x1002'}
   Decoded Device ID: Vega 20 [Radeon VII]
   PCIe ID: 43:00.0
   GPU Type: Modern
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon1
   Card Path: /sys/class/drm/card1/device
   System Card Path: /sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.0
   ### CLINFO Table Data ############################
   Device OpenCL C Version: OpenCL C 2.0
   Device Name: gfx906
   Device Version: OpenCL 2.0 AMD-APP (3143.9)
   Driver Version: 3143.9 (PAL,HSAIL)
   Max Compute Units: 60
   SIMD per CU: 4
   SIMD Width: 16
   SIMD Instruction Width: 1
   CL Max Memory Allocation: 14588628172
   Max Work Item Dimensions: 3
   Max Work Item Sizes: 1024 1024 1024
   Max Work Group Size: 1024
   Preferred Work Group Size: 256
   Preferred Work Group Multiple: 64

```
If not, then to see the *clinfo* data you may need to add yourself to the 'video' and 'render' groups by using
these commands:
```
sudo usermod -a -G video $LOGNAME
sudo usermod -a -G render $LOGNAME
```

The *--pstates* option will display the GPU P-State definition table and all other available P-State details.

```
Card Number: 1
   Vendor:  AMD 
   PP Features: 0x0000000019f0e3cf
   Readable: True
   Writable: True
   Compute: True
   Device ID: {'device': '0x66af', 'subsystem_device': '0x1000', 'subsystem_vendor': '0x1458', 'vendor': '0x1002'}
   Decoded Device ID: Vega 20 [Radeon VII]
   PCIe ID: 43:00.0
   GPU Type: CurvePts
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon1
   Card Path: /sys/class/drm/card1/device
   System Card Path: /sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.0
   ### P-State Table Data ###########################
   ##################################################
   DPM States:
   SCLK:                   MCLK:
    0:  701Mhz              0:  351Mhz  
    1:  809Mhz              1:  801Mhz  
    2:  1135Mhz             2:  1001Mhz 
    3:  1373Mhz             
    4:  1547Mhz             
    5:  1684Mhz             
    6:  1750Mhz             
    7:  1774Mhz             
    8:  1802Mhz             
   ##################################################
   PP OD States:
   SCLK:                   MCLK:
    0:  808Mhz    -         
    1:  1801Mhz   -         1:  1000Mhz   -       
   ##################################################
   VDDC_CURVE:
    0: ['808Mhz', '722mV']
    1: ['1304Mhz', '820mV']
    2: ['1801Mhz', '1122mV']
   ##################################################
   All Pstates:
   mclk:
      0: *351Mhz, 1: 801Mhz, 2: 1001Mhz
   dcefclk:
      0: *358Mhz, 1: 454Mhz, 2: 567Mhz, 3: 680Mhz, 4: 756Mhz, 5: 850Mhz, 6: 972Mhz, 7: 1134Mhz
   socclk:
      0: 310Mhz, 1: 524Mhz, 2: 567Mhz, 3: 619Mhz, 4: 680Mhz, 5: 756Mhz, 6: 850Mhz, 7: *972Mhz
   fclk:
      0: 551Mhz, 1: 611Mhz, 2: 691Mhz, 3: 761Mhz, 4: 871Mhz, 5: 961Mhz, 6: 1081Mhz, 7: *1226Mhz
   sclk:
      0: 701Mhz, 1: *809Mhz, 2: 1135Mhz, 3: 1373Mhz, 4: 1547Mhz, 5: 1684Mhz, 6: 1750Mhz, 7: 1774Mhz, 8: 1802Mhz
```

Different generations of cards will provide different information with the --ppm option. Here is an
example for AMD Ellesmere and Polaris cards:

```
Card Number: 1
   Vendor:  AMD 
   PP Features: 0x0000000019f0e3cf
   Readable: True
   Writable: True
   Compute: True
   Device ID: {'device': '0x66af', 'subsystem_device': '0x1000', 'subsystem_vendor': '0x1458', 'vendor': '0x1002'}
   Decoded Device ID: Vega 20 [Radeon VII]
   PCIe ID: 43:00.0
   GPU Type: Modern
   HWmon: /sys/class/drm/card1/device/hwmon/hwmon1
   Card Path: /sys/class/drm/card1/device
   System Card Path: /sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0/0000:42:00.0/0000:43:00.0
   ### PPM Table Data ###############################
   PROFILE_INDEX(NAME) CLOCK_TYPE(NAME) FPS UseRlcBusy MinActiveFreqType MinActiveFreq BoosterFreqType BoosterFreq PD_Data_limit_c PD_Data_error_coeff PD_Data_error_rate_coeff
    0 BOOTUP_DEFAULT*:
                       0(       GFXCLK)       0       0       1       0       4     800 4587520  -65536       0
                       1(       SOCCLK)       0       0       1       0       4     800  327680   -6553       0
                       2(         UCLK)       0       0       1       0       4     800  327680  -65536       0
                       3(         FCLK)       0       0       0       0       4     800  327680   -6553       0
    1 3D_FULL_SCREEN :
                       0(       GFXCLK)       0       1       1       0       4     800 4587520  -65536       0
                       1(       SOCCLK)       0       1       4     850       4     800  327680  -65536       0
                       2(         UCLK)       0       1       4     850       4     800  327680  -65536       0
                       3(         FCLK)       0       1       4     850       4     800  327680  -65536       0
    2   POWER_SAVING :
                       0(       GFXCLK)       0       0       1       0       3       0 5898240  -65536       0
                       1(       SOCCLK)       0       0       1       0       3       0 1310720   -6553       0
                       2(         UCLK)       0       0       1       0       3       0 1966080  -65536       0
                       3(         FCLK)       0       0       0       0       3     800 1966080   -6553       0
    3          VIDEO :
                       0(       GFXCLK)       0       1       1       0       4     500 4587520   -6553       0
                       1(       SOCCLK)       0       0       1       0       4     500 1310720   -6553       0
                       2(         UCLK)       0       0       1       0       4     500 1966080  -65536       0
                       3(         FCLK)       0       0       3       0       4     500 1966080   -6553       0
    4             VR :
                       0(       GFXCLK)       0       1       0    1540       4     800 5898240   -6553   65536
                       1(       SOCCLK)       0       1       2       0       4     800  327680  -32768  -65536
                       2(         UCLK)       0       1       2       0       4     800  327680  -32768  -65536
                       3(         FCLK)       0       1       2       0       4     800  327680  -32768  -65536
    5        COMPUTE :
                       0(       GFXCLK)       0       1       0    1600       3       0 3932160  -65536  -65536
                       1(       SOCCLK)       0       0       4     850       3       0  327680  -65536  -32768
                       2(         UCLK)       0       0       4     850       3       0  327680  -65536  -32768
                       3(         FCLK)       0       0       4     850       3       0  327680  -65536  -32768
    6         CUSTOM :
                       0(       GFXCLK)       0       0       1       0       4     800 4587520  -65536       0
                       1(       SOCCLK)       0       0       1       0       4     800  327680   -6553       0
                       2(         UCLK)       0       0       1       0       4     800  327680  -65536       0
                       3(         FCLK)       0       0       0       0       4     800  327680   -6553       0
```

## GPU Type Dependent Behavior

GPU capability and compatibility varies over the various vendors and generations of hardware.  In
order to manage this variability, **rickslab-gpu-utils** must classify each installed GPU by its vendor
and type.  So far, valid types are as follows:

* **Undefined** - This is the default assigned type, before a valid type can be determined.
* **Unsupported** - This is the type assigned for cards which have no capability of reading beyond basic parameters typical of PCIe devices.
* **Supported** - This is the type assigned for basic readability, including *nvidia-smi* readabile GPUs.
* **Legacy** - Applies to legacy AMD GPUs with very basic parameters available to read. (pre-HD7)
* **LegacyAPU** - Applies to older AMD integrated graphics with very few parameters available. (Ontario)
* **APU** - Applies to AMD integrated graphics with limited parameters available. (Carizzo - Renoir)
* **PStatesNE** - Applies to AMD GPUs with most parameters available, but Pstates not writeable. (HD7 series)
* **PStates** - Applies to modern AMD GPUs with writeable Pstates. (R9 series thr RX-Vega)
* **CurvePts** - Applies to latest generation AMD GPUs that use AVFS curves instead of Pstates. (Vega20 and newer)

With the *gpu-ls* tool, you can determine the type of your installed GPUs. Here are examples of
relevant lines from the output for different types of GPUs:

```
Decoded Device ID: 8th Gen Core Processor Gaussian Mixture Model [Intel CPU with integrated graphics]
GPU Type: Unsupported

Decoded Device ID: GM107 [GeForce GTX 750]
GPU Type: Supported

Decoded Device ID: R9 290X DirectCU II
GPU Type: PStatesNE

Decoded Device ID: RX Vega64
GPU Type: PStates

Decoded Device ID: Radeon VII
GPU Type: CurvePts

Decoded Device ID: Radeon RX 5600 XT
GPU Type: CurvePts
```

Monitor and Control utilities will differ between these types:

* For **Undefined** and **Unsupported** types, only generic PCIe parameters are available.  These types are
considered unreadable, unwritable, and as having no compute capability.
* For **Supported** types have the most basic level of readability.  This includes NV cards with nvidia-smi support.
* For **Legacy** and **APU**, only basic and limited respectively are readable.
* For **Pstates** and **PstatesNE** type GPUs, pstate details are readable, but for **PstatesNE** they are not
writable. For type **Pstates** pstate Voltages/Frequencies as well as pstate masking can be specified.
* The **CurvePts** type applies to modern (Vega20 and later) AMD GPUs that use AVFS instead of Pstates for
performance control.  These have the highest degree of read/write capability. The SCLK and MCLK curve end points
can be controlled, which has the effect of over/under clocking/voltage.  You are also able to modify the three
points that define the Vddc-SCLK curve. I have not attempted to OC the card yet, but I assume redefining the 3rd
point would be the best approach.  For underclocking, lowering the SCLK end point is effective.  I don't see a
curve defined for memory clock on the Radeon VII, so setting memory clock vs. voltage doesn't seem possible at
this time.  There also appears to be an inconsistency in the defined voltage ranges for curve points and actual
default settings. 

Below is a plot of what I extracted for the Frequency vs Voltage curves of the RX Vega64 and the Radeon VII.

![](Type1vsType2.png)

## Using gpu-mon

By default, *gpu-mon* will display a text based table in the current terminal window that updates
every sleep duration, in seconds, as defined by *--sleep N* or 2 seconds by default. If you are using
water cooling, you can use the *--no_fans* to remove fan monitoring functionality.

```
┌─────────────┬────────────────┬────────────────┐
│Card #       │card1           │card2           │
├─────────────┼────────────────┼────────────────┤
│Model        │GeForce GTX 750 │Radeon RX 5600 X│
│GPU Load %   │100             │91              │
│Mem Load %   │36              │68              │
│VRAM Usage % │89.297          │11.969          │
│GTT Usage %  │None            │0.432           │
│Power (W)    │15.69           │92.0            │
│Power Cap (W)│38.50           │160.0           │
│Energy (kWh) │0.0             │0.002           │
│T (C)        │48.0            │61.0            │
│VddGFX (mV)  │nan             │925             │
│Fan Spd (%)  │40.0            │36              │
│Sclk (MHz)   │1163            │1780            │
│Sclk Pstate  │0               │2               │
│Mclk (MHz)   │2505            │875             │
│Mclk Pstate  │0               │3               │
│Perf Mode    │[Not Supported] │5-COMPUTE       │
└─────────────┴────────────────┴────────────────┘
```

The fields are the same as the GUI version of the display, available with the *--gui* option.

![](gpu-monitor-gui_scrshot.png)

The first row gives the card number for each GPU.  This number is the integer used by the driver for each GPU.  Most
fields are self describing.  The Power Cap field is especially useful in managing compute power efficiency, and
lowering the cap can result in more level loading and overall lower power usage for little compromise in performance. 
The Energy field is a derived metric that accumulates GPU energy usage, in kWh, consumed since the monitor started.
Note that total card power usage may be more than reported GPU power usage.  Energy is calculated as the product of
the latest power reading and the elapsed time since the last power reading. 

The P-states in the table for **CurvePts** type GPU are an indication of frequency vs. voltage curves.
Setting P-states to control the GPU is no longer relevant for this type, but these concepts are used in
reading current states.

The Perf Mode field gives the current power performance mode, which may be modified in with *gpu-pac*.  These
modes affect the how frequency and voltage are managed versus loading.  This is a very important parameter when
managing compute performance.

Executing *gpu-mon* with the *--plot* option will display a continuously updating plot of the critical
GPU parameters.
![](gpu-plot_scrshot.png)

Having an *gpu-mon* Gtx window open at startup may be useful if you run GPU compute projects that autostart
and you need to quickly confirm that *gpu-pac* bash scripts ran as expected at startup
(see [Using gpu-pac](#using-gpu-pac)). You can have *gpu-mon --gui* automatically launch at startup
or upon reboot by using the startup utility for your distribution. In Ubuntu, for example, open *Startup Applications
Preferences* app, then in the Preferences window select *Add* and use something like this in the command field:

```shell
/usr/bin/python3 /home/<user>/Desktop/rickslab-gpu-utils/gpu-mon --gui
```

where `/rickslab-gpu-utils` may be a soft link to your current distribution directory. This startup approach does not
work for the default Terminal text execution of *gpu-mon*. 

## Using gpu-plot

In addition to being called from *gpu-mon* with the *--plot* option, *gpu-plot* may be ran as a standalone
utility.  Just execute *gpu-plot --sleep N* and the plot will update at the defined interval.  It is not
recommended to run both the monitor with an independently executed plot, as it will result in twice as many reads
from the driver files.  Once the plots are displayed, individual items on the plot can be toggled by selecting the
named button on the plot display.

The *--stdin* option is used by *gpu-mon --plot* in its execution of *gpu-plot*.  This option along
with *--simlog* option can be used to simulate a plot output using a log file generated by *gpu-mon --log*. 
I use this feature when troubleshooting problems from other users, but it may also be useful in benchmarking
performance.  An example of the command line for this is as follows:

```shell
cat log_monitor_0421_081038.txt | gpu-plot --stdin --simlog
```

## Using gpu-pac

By default, *gpu-pac* will open a Gtk based GUI to allow the user to modify GPU performance parameters.  I strongly
suggest that you completely understand the implications of changing any of the performance settings before you use
this utility.  As per the terms of the GNU General Public License that covers this project, there is no warranty on
the usability of these tools.  Any use of this tool is at your own risk.

To help you manage the risk in using this tool, two modes are provided to modify GPU parameters.  By default, a bash
file is created that you can review and execute to implement the desired changes.  Here is an example of that file:

```
#!/bin/sh
###########################################################################
## rickslab-gpu-pac generated script to modify GPU configuration/settings
###########################################################################

###########################################################################
## WARNING - Do not execute this script without completely
## understanding appropriate values to write to your specific GPUs
###########################################################################
#
#    Copyright (C) 2019  RueiKe
#
#    This program is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see <https://www.gnu.org/licenses/>.
###########################################################################
# 
# Card1  Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 (rev c1)
# /sys/class/drm/card1/device
# 
set -x
# Power DPM Force Performance Level: [manual] change to [manual]
sudo sh -c "echo 'manual' >  /sys/class/drm/card1/device/power_dpm_force_performance_level"
# Powercap Old: 150 New: 150 Min: 0 Max: 300
sudo sh -c "echo '150000000' >  /sys/class/drm/card1/device/hwmon/hwmon2/power1_cap"
# Fan PWM Old: 0 New: 0 Min: 0 Max: 100
sudo sh -c "echo '1' >  /sys/class/drm/card1/device/hwmon/hwmon2/pwm1_enable"
sudo sh -c "echo '0' >  /sys/class/drm/card1/device/hwmon/hwmon2/pwm1"
# sclk curve end point: 0 : 808 MHz
sudo sh -c "echo 's 0 808' >  /sys/class/drm/card1/device/pp_od_clk_voltage"
# sclk curve end point: 1 : 1650 MHz
sudo sh -c "echo 's 1 1650' >  /sys/class/drm/card1/device/pp_od_clk_voltage"
# mclk curve end point: 1 : 1050 MHz
sudo sh -c "echo 'm 1 1050' >  /sys/class/drm/card1/device/pp_od_clk_voltage"
# vddc curve point: 0 : 808 MHz, 724 mV
sudo sh -c "echo 'vc 0 808 724' >  /sys/class/drm/card1/device/pp_od_clk_voltage"
# vddc curve point: 1 : 1304 MHz, 822 mV
sudo sh -c "echo 'vc 1 1304 822' >  /sys/class/drm/card1/device/pp_od_clk_voltage"
# vddc curve point: 2 : 1801 MHz, 1124 mV
sudo sh -c "echo 'vc 2 1801 1124' >  /sys/class/drm/card1/device/pp_od_clk_voltage"
# Selected: ID=5, name=COMPUTE
sudo sh -c "echo '5' >  /sys/class/drm/card1/device/pp_power_profile_mode"
sudo sh -c "echo 'c' >  /sys/class/drm/card1/device/pp_od_clk_voltage"
# Sclk P-State Mask Default: 0 1 2 3 4 5 6 7 8 New: 0 1 2 3 4 5 6 7 8
sudo sh -c "echo '0 1 2 3 4 5 6 7 8' >  /sys/class/drm/card1/device/pp_dpm_sclk"
# Mclk P-State Mask Default: 0 1 2 New: 0 1 2
sudo sh -c "echo '0 1 2' >  /sys/class/drm/card1/device/pp_dpm_mclk"
```

When you execute *gpu-pac*, you will notice a message bar at the bottom of the interface.  By default, it informs
you of the mode you are running in.  By default, the operation mode is to create a bash file, but with the
*--execute_pac* (or *--execute*) command line option, the bash file will be automatically executed and then deleted. 
The message bar will indicate this status.  Because the driver files are writable only by root, the commands to
write configuration settings are executed with sudo.  The message bar will display in red when credentials are
pending.  Once executed, a yellow message will remind you to check the state of the gpu with *gpu-mon*.  I
suggest using the monitor routine when executing pac to see and confirm the changes in real-time.

The command line option *--force_write* will result in all configuration parameters to be written to the bash file. 
The default behavior since v2.4.0 is to write only changes.  The *--force_write* is useful for creating a bash file
that can be execute to set your cards to a known state. As an example, you could use such a file to configure your
GPUs on boot up (see [Running Startup PAC Bash Files](#running-startup-pac-bash-files)).

### The gpu-pac interface for Type PStates and Type CurvePts cards

![](gpu-pac_scrshot.png)

In the interface, you will notice entry fields for indicating new values for specific parameters.  In most cases, the
values in these fields will be the current values, but in the case of P-state masks, it will show the default value
instead of the current value.  If you know how to obtain the current value, please let me know!

Note that when a PAC bash file is executed either manually or automatically, the resulting fan PWM (% speed) may
be slightly different from what you see in the Fan PWM entry field.  The direction and magnitude of differences
between expected and realized fan speeds can depend on card model.  You will need to experiment with different
settings to determine how it works with your card.  I recommend running these experimental settings when the GPU
is not under load.  If you know the cause of the differences between entered and final fan PWM values, let me know. 

Changes made with *gpu-pac* do not persist through a system reboot. To reestablish desired GPU settings after a
reboot, either re-enter them using *gpu-pac* or *gpu-pac --execute*, or execute a previously saved bash file.
*gpu-pac* bash files must retain their originally assigned file name to run properly.
See [Running Startup PAC Bash Files](#running-startup-pac-bash-files) for how to run PAC bash
scripts automatically at system startup.

For Type **Pstates** cards, while changes to power caps and fan speeds can be made while the GPU is under load, for
*gpu-pac* to work properly, other changes may require that the GPU not be under load, *i.e.*, that sclk
P-state and mclk P-state are 0. Possible consequences with making changes under load is that the GPU become
stuck in a 0 P-state or that the entire system becomes slow to respond, where a reboot will be needed to restore
full GPU functions. Note that when you change a P-state mask, default mask values will reappear in the field
after Save, but your specified changes will have been implemented on the card and show up in *gpu-mon*.
Some changes may not persist when a card has a connected display. When changing P-state MHz or mV, the desired
P-state mask, if different from default (no masking), will have to be re-entered for clock or voltage changes to
be applied. Again, save PAC changes to clocks, voltages, or masks only when the GPU is at resting state (state 0).

For Type **CurvePts** cards, although changes to P-state masks cannot be made through *gpu-pac*, changes to all
other fields can be made on-the-fly while the card is under load.

Some basic error checking is done before writing, but I suggest you be very certain of all entries before you save
changes to the GPU.  You should always confirm your changes with *gpu-mon*.

## Updating the PCI ID decode file 

In determining the GPU display name, **rickslab-gpu-utils** will examine two sources.  The output of 
`lspci -k -s nn:nn.n` is used to generate a complete name, and an algorithm is used to generate a shortened
version.  From the driver files, a set of files (vendor, device, subsystem_vendor, subsystem_device) contain
4 parts of the Device ID are read and used to extract a GPU model name from system pci.ids file which is
sourced from [https://pci-ids.ucw.cz/](https://pci-ids.ucw.cz/) where a comprehensive list is maintained.  The
system file can be updated from the original source with the command:

```
sudo update-pciids
```

If your GPU is not listed in the extract, the pci.id website has an interface to allow the user to request an
addition to the master list.  

## Optimizing Compute Performance-Power

The **rickslab-gpu-utils** tools can be used to optimize performance vs. power for compute workloads by leveraging
its ability to measure power and control relevant GPU settings.  This flexibility allows one to execute a
DOE to measure the effect of GPU settings on the performance in executing specific workloads.  In SETI@Home
performance, the Energy feature has also been built into [benchMT](https://github.com/Ricks-Lab/benchMT) to
benchmark power and execution times for various work units.  This, combined with the log file produced with
*gpu-mon --gui --log*, may be useful in optimizing performance.

![](https://i.imgur.com/YPuDez2l.png)

## Running Startup PAC Bash Files

If you set your system to run *gpu-pac* bash scripts automatically, as described in this section, note that
changes in your hardware or graphic drivers may cause potentially serious problems with GPU settings unless new
PAC bash files are generated following the changes. Review the [Using gpu-pac](#using-gpu-pac) section
before proceeding.

One approach is to execute PAC bash scripts as a systemd startup service. From *gpu-pac --force_write*, set your
optimal configurations for each GPU, then Save All. You may need to change ownership to root of each card's bash
file: `sudo chown root pac_writer*.sh`

For each bash file, you could create a symlink (soft link) that corresponds to the card number referenced in each
linked bash file, using simple descriptive names, *e.g.*, pac_writer_card1, pac_writer_card2, *etc.*. These links are
optional, but can make management of new or edited startup bash files easier. Links are used in the startup service
example, below. Don't forget to reform the link(s) each time a new PAC bash file is written for a card. 
 
Next, create a .service file named something like, gpu-pac-startup.service and give it the following content:

```
[Unit]
Description=run at boot rickslab-gpu-utils PAC bash scripts

[Service]
Type=oneshot

ExecStart=/home/<user>/pac_writer_card0
ExecStart=/home/<user>/pac_writer_card1
ExecStart=/home/<user>/pac_writer_card2

[Install]
WantedBy=multi-user.target
```

The Type=oneshot service allows use of more than one ExecStart.  In this example, three bash files are used for
two cards, where two alternative files are used for one card that the system may recognize as either card0 or
card1; see further below for an explanation. 

Once your .service file is set up, execute the following commands:

```
sudo chown root:root gpu-pac-startup.service 
sudo mv gpu-pac-startup.service /etc/systemd/system/
sudo chmod 664 /etc/systemd/system/gpu-pac-startup.service
sudo systemctl daemon-reload
sudo systemctl enable gpu-pac-startup.service
```

The last command should produce a terminal stdout like this:
`Created symlink /etc/systemd/system/multi-user.target.wants/gpu-pac-startup.service → /etc/systemd/system/gpu-pac-startup.service.`

On the next reboot or restart, the GPU(s) will be set with the PAC run parameters. If you want to test the bash
script(s) before rebooting, run: `~$ sudo systemctl start gpu-pac-startup.service`. 

If you have a Type PStates card where some PAC parameters can't be changed when it is under load, you will want
to make sure that the PAC bash script executes before the card begins computing. If you have a *boinc-client* that
automatically runs on startup, for example, then consider delaying it for 20 seconds using the cc_config.xml
option *<start_delay>30</start_delay>*.

One or more card numbers that are assigned by amdgpu drivers may change following a system or driver
update and restart. With subsequent updates or restarts, a card can switch back to its original number. When a
switch occurs, the bash file written for a previous card number will still be read at startup, but will have no
effect, causing the renumbered card to run at its default settings. To deal with this possibility, you can create
an alternative PAC bash file after a renumbering event and add these alternative files in your systemd service.
You will probably just need two alternative bash files for a card that is subject to reindexing. A card's
number is shown by *gpu-ls* and also appears in *gpu-mon* and *gpu-plot*. A card's PCI IDs is listed
by *gpu-ls*. If you know what causes GPU card index switching, let me know.

You may find a card running at startup with default power limits and Fan PWM settings instead of what is prescribed
in its startup PAC bash file. If so, it may be that the card's hwmon# is different from what is hard coded in the
bash file, because the hwmon index for devices can also change upon reboot. To work around this, you can edit a
card's bash file to define hwmon# as a variable and modify the hwmon lines to use it. Here is an example for card1:

```
set -x
HWMON=$(ls /sys/class/drm/card1/device/hwmon/)
# Powercap Old: 120 New: 110 Min: 0 Max: 180
sudo sh -c "echo '1100000000' >  /sys/class/drm/card1/device/hwmon/$HWMON/power1_cap"
# Fan PWM Old:  44 New:  47 Min:  0 Max:  100
sudo sh -c "echo '1' >  /sys/class/drm/card1/device/hwmon/$HWMON/pwm1_enable"
sudo sh -c "echo '119' >  /sys/class/drm/card1/device/hwmon/$HWMON/pwm1"
```