1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355
|
Workload descriptor format
==========================
Lines starting with '#' are treated as comments and will not create a work step.
ctx.engine.duration_us.dependency.wait,...
<uint>.<str>.<uint>[-<uint>]|*.<int <= 0>[/<int <= 0>][...].<0|1>,...
B.<uint>
M.<uint>.<str>[|<str>]...
P|S|X.<uint>.<int>
d|p|s|t|q|a|T.<int>,...
b.<uint>.<str>[|<str>].<str>
w|W.<uint>.<str>[/<str>]...
f
For duration a range can be given from which a random value will be picked
before every submit. Since this and seqno management requires CPU access to
objects, care needs to be taken in order to ensure the submit queue is deep
enough these operations do not affect the execution speed unless that is
desired.
Additional workload steps are also supported:
'd' - Adds a delay (in microseconds).
'p' - Adds a delay relative to the start of previous loop so that the each loop
starts execution with a given period.
's' - Synchronises the pipeline to a batch relative to the step.
't' - Throttle every n batches.
'q' - Throttle to n max queue depth.
'f' - Create a sync fence.
'a' - Advance the previously created sync fence.
'B' - Turn on context load balancing.
'b' - Set up engine bonds.
'M' - Set up engine map.
'P' - Context priority.
'S' - Context SSEU configuration.
'T' - Terminate an infinite batch.
'w' - Working set. (See Working sets section.)
'W' - Shared working set.
'X' - Context preemption control.
Engine ids: DEFAULT, RCS, BCS, VCS, VCS1, VCS2, VECS
Example (leading spaces must not be present in the actual file):
----------------------------------------------------------------
1.VCS1.3000.0.1
1.RCS.500-1000.-1.0
1.RCS.3700.0.0
1.RCS.1000.-2.0
1.VCS2.2300.-2.0
1.RCS.4700.-1.0
1.VCS2.600.-1.1
p.16000
The above workload described in human language works like this:
1. A batch is sent to the VCS1 engine which will be executing for 3ms on the
GPU and userspace will wait until it is finished before proceeding.
2-4. Now three batches are sent to RCS with durations of 0.5-1.5ms (random
duration range), 3.7ms and 1ms respectively. The first batch has a data
dependency on the preceding VCS1 batch, and the last of the group depends
on the first from the group.
5. Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms
RCS batch.
6. This is followed by a 4.7ms RCS batch with a data dependency on the 2.3ms
VCS2 batch.
7. Then a 0.6ms VCS2 batch is sent depending on the previous RCS one. In the
same step the tool is told to wait for the batch completes before
proceeding.
8. Finally the tool is told to wait long enough to ensure the next iteration
starts 16ms after the previous one has started.
When workload descriptors are provided on the command line, commas must be used
instead of new lines.
Multiple dependencies can be given separated by forward slashes.
Example:
1.VCS1.3000.0.1
1.RCS.3700.0.0
1.VCS2.2300.-1/-2.0
I this case the last step has a data dependency on both first and second steps.
Batch durations can also be specified as infinite by using the '*' in the
duration field. Such batches must be ended by the terminate command ('T')
otherwise they will cause a GPU hang to be reported.
Xe and i915 differences
------------------------
There are differences between Xe and i915, like not allowing a BO list to
be passed to an exec (and create implicit syncs). For more details see:
https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_exec.c
Currently following batch steps are equal on Xe:
1.1000.-2.0 <==> 1.1000.f-2.0
and will create explicit sync fence dependency (via syncobjects).
The data dependency need to wait for working sets implementation.
Sync (fd) fences
----------------
Sync fences are also supported as dependencies.
To use them put a "f<N>" token in the step dependecy list. N is this case the
same relative step offset to the dependee batch, but instead of the data
dependency an output fence will be emitted at the dependee step, and passed in
as a dependency in the current step.
Example:
1.VCS1.3000.0.0
1.RCS.500-1000.-1/f-1.0
In this case the second step will have both a data dependency and a sync fence
dependency on the previous step.
Example:
1.RCS.500-1000.0.0
1.VCS1.3000.f-1.0
1.VCS2.3000.f-2.0
VCS1 and VCS2 batches will have a sync fence dependency on the RCS batch.
Example:
1.RCS.500-1000.0.0
f
2.VCS1.3000.f-1.0
2.VCS2.3000.f-2.0
1.RCS.500-1000.0.1
a.-4
s.-4
s.-4
VCS1 and VCS2 batches have an input sync fence dependecy on the standalone fence
created at the second step. They are submitted ahead of time while still not
runnable. When the second RCS batch completes the standalone fence is signaled
which allows the two VCS batches to be executed. Finally we wait until the both
VCS batches have completed before starting the (optional) next iteration.
Submit fences (i915 only)
-------------
Submit fences are a type of input fence which are signalled when the originating
batch buffer is submitted to the GPU. (In contrary to normal sync fences, which
are signalled when completed.)
Submit fences have the identical syntax as the sync fences with the lower-case
's' being used to select them. Eg:
1.RCS.500-1000.0.0
1.VCS1.3000.s-1.0
1.VCS2.3000.s-2.0
Here VCS1 and VCS2 batches will only be submitted for executing once the RCS
batch enters the GPU.
Context priority (i915 only)
----------------
P.1.-1
1.RCS.1000.0.0
P.2.1
2.BCS.1000.-2.0
Context 1 is marked as low priority (-1) and then a batch buffer is submitted
against it. Context 2 is marked as high priority (1) and then a batch buffer
is submitted against it which depends on the batch from context 1.
Context priority command is executed at workload runtime and is valid until
overriden by another (optional) same context priority change. Actual driver
ioctls are executed only if the priority level has changed for the context.
Context preemption control
--------------------------
X.1.0
1.RCS.1000.0.0
X.1.500
1.RCS.1000.0.0
Context 1 is marked as non-preemptable batches and a batch is sent against 1.
The same context is then marked to have batches which can be preempted every
500us and another batch is submitted.
Same as with context priority, context preemption commands are valid until
optionally overriden by another preemption control change on the same context.
Engine maps
-----------
Engine maps are a per context feature which changes the way engine selection is
done in the driver.
Example:
M.1.VCS1|VCS2
This sets up context 1 with an engine map containing VCS1 and VCS2 engine.
Submission to this context can now only reference these two engines.
Engine maps can also be defined based on class like VCS.
Example:
M.1.VCS
This sets up the engine map to all available VCS class engines.
Context load balancing
----------------------
Context load balancing (aka Virtual Engine) is an i915 feature where the driver
will pick the best engine (most idle) to submit to given previously configured
engine map.
Example:
B.1
This enables load balancing for context number one.
Engine bonds (i915 only)
------------
Engine bonds are extensions on load balanced contexts. They allow expressing
rules of engine selection between two co-operating contexts tied with submit
fences. In other words, the rule expression is telling the driver: "If you pick
this engine for context one, then you have to pick that engine for context two".
Syntax is:
b.<context>.<engine_list>.<master_engine>
Engine list is a list of one or more sibling engines separated by a pipe
character (eg. "VCS1|VCS2").
There can be multiple bonds tied to the same context.
Example:
M.1.RCS|VECS
B.1
M.2.VCS1|VCS2
B.2
b.2.VCS1.RCS
b.2.VCS2.VECS
This tells the driver that if it picked RCS for context one, it has to pick VCS1
for context two. And if it picked VECS for context one, it has to pick VCS1 for
context two.
If we extend the above example with more workload directives:
1.DEFAULT.1000.0.0
2.DEFAULT.1000.s-1.0
We get to a fully functional example where two batch buffers are submitted in a
load balanced fashion, telling the driver they should run simultaneously and
that valid engine pairs are either RCS + VCS1 (for two contexts respectively),
or VECS + VCS2.
This can also be extended using sync fences to improve chances of the first
submission not getting on the hardware after the second one. Second block would
then look like:
f
1.DEFAULT.1000.f-1.0
2.DEFAULT.1000.s-1.0
a.-3
Context SSEU configuration (i915 only)
--------------------------
S.1.1
1.RCS.1000.0.0
S.2.-1
2.RCS.1000.0.0
Context 1 is configured to run with one enabled slice (slice mask 1) and a batch
is sumitted against it. Context 2 is configured to run with all slices (this is
the default so the command could also be omitted) and a batch submitted against
it.
This shows the dynamic SSEU reconfiguration cost beween two contexts competing
for the render engine.
Slice mask of -1 has a special meaning of "all slices". Otherwise any integer
can be specifying as the slice mask, but beware any apart from 1 and -1 can make
the workload not portable between different GPUs.
Working sets (i915 only)
------------
When used plainly workload steps can create implicit data dependencies by
relatively referencing another workload steps of a batch buffer type. Fourth
field contains the relative data dependncy. For example:
1.RCS.1000.0.0
1.BCS.1000.-1.0
This means the second batch buffer will be marked as having a read data
dependency on the first one. (The shared buffer is always marked as written to
by the dependency target buffer.) This will cause a serialization between the
two batch buffers.
Working sets are used where more complex data dependencies are required. Each
working set has an id, a list of buffers, and can either be local to the
workload or shared within the cloned workloads (-c command line option).
Lower-case 'w' command defines a local working set while upper-case 'W' defines
a shared version. Syntax is as follows:
w.<id>.<size>[/<size>]...
For size a byte size can be given, or suffix 'k', 'm' or 'g' can be used (case
insensitive). Prefix in the format of "<int>n<size>" can also be given to create
multiple objects of the same size.
Ranges can also be specified using the <min>-<max> syntax.
Examples:
w.1.4k - Working set 1 with a single 4KiB object in it.
W.2.2M/32768 - Working set 2 with one 2MiB and one 32768 byte object.
w.3.10n4k/2n20000 - Working set 3 with ten 4KiB and two 20000 byte objects.
w.4.4n4k-1m - Working set 4 with four objects of random size between 4KiB and
1MiB.
Working set objects can be referenced as data dependency targets using the new
'r'/'w' syntax. Simple example:
w.1.4k
W.2.1m
1.RCS.1000.r1-0/w2-0.0
1.BCS.1000.r2-0.0
In this example the RCS batch is reading from working set 1 object 0 and writing
to working set 2 object 0. BCS batch is reading from working set 2 object 0.
Because working set 2 is of a shared type, should two instances of the same
workload be executed (-c 2) then the 1MiB buffer would be shared and written
and read by both clients creating a serialization point.
Apart from single objects, ranges can also be given as depenencies:
w.1.10n4k
1.RCS.1000.r1-0-9.0
Here the RCS batch has a read dependency on working set 1 objects 0 to 9.
|