1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555
|
/* SPDX-License-Identifier: MIT */
/*
* Copyright © 2022 Intel Corporation
*/
#ifndef _XE_VM_DOC_H_
#define _XE_VM_DOC_H_
/**
* DOC: XE VM (user address space)
*
* VM creation
* ===========
*
* Allocate a physical page for root of the page table structure, create default
* bind engine, and return a handle to the user.
*
* Scratch page
* ------------
*
* If the VM is created with the flag, DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE, set the
* entire page table structure defaults pointing to blank page allocated by the
* VM. Invalid memory access rather than fault just read / write to this page.
*
* VM bind (create GPU mapping for a BO or userptr)
* ================================================
*
* Creates GPU mappings for a BO or userptr within a VM. VM binds uses the same
* in / out fence interface (struct drm_xe_sync) as execs which allows users to
* think of binds and execs as more or less the same operation.
*
* Operations
* ----------
*
* DRM_XE_VM_BIND_OP_MAP - Create mapping for a BO
* DRM_XE_VM_BIND_OP_UNMAP - Destroy mapping for a BO / userptr
* DRM_XE_VM_BIND_OP_MAP_USERPTR - Create mapping for userptr
*
* Implementation details
* ~~~~~~~~~~~~~~~~~~~~~~
*
* All bind operations are implemented via a hybrid approach of using the CPU
* and GPU to modify page tables. If a new physical page is allocated in the
* page table structure we populate that page via the CPU and insert that new
* page into the existing page table structure via a GPU job. Also any existing
* pages in the page table structure that need to be modified also are updated
* via the GPU job. As the root physical page is prealloced on VM creation our
* GPU job will always have at least 1 update. The in / out fences are passed to
* this job so again this is conceptually the same as an exec.
*
* Very simple example of few binds on an empty VM with 48 bits of address space
* and the resulting operations:
*
* .. code-block::
*
* bind BO0 0x0-0x1000
* alloc page level 3a, program PTE[0] to BO0 phys address (CPU)
* alloc page level 2, program PDE[0] page level 3a phys address (CPU)
* alloc page level 1, program PDE[0] page level 2 phys address (CPU)
* update root PDE[0] to page level 1 phys address (GPU)
*
* bind BO1 0x201000-0x202000
* alloc page level 3b, program PTE[1] to BO1 phys address (CPU)
* update page level 2 PDE[1] to page level 3b phys address (GPU)
*
* bind BO2 0x1ff000-0x201000
* update page level 3a PTE[511] to BO2 phys address (GPU)
* update page level 3b PTE[0] to BO2 phys address + 0x1000 (GPU)
*
* GPU bypass
* ~~~~~~~~~~
*
* In the above example the steps using the GPU can be converted to CPU if the
* bind can be done immediately (all in-fences satisfied, VM dma-resv kernel
* slot is idle).
*
* Address space
* -------------
*
* Depending on platform either 48 or 57 bits of address space is supported.
*
* Page sizes
* ----------
*
* The minimum page size is either 4k or 64k depending on platform and memory
* placement (sysmem vs. VRAM). We enforce that binds must be aligned to the
* minimum page size.
*
* Larger pages (2M or 1GB) can be used for BOs in VRAM, the BO physical address
* is aligned to the larger pages size, and VA is aligned to the larger page
* size. Larger pages for userptrs / BOs in sysmem should be possible but is not
* yet implemented.
*
* Sync error handling mode
* ------------------------
*
* In both modes during the bind IOCTL the user input is validated. In sync
* error handling mode the newly bound BO is validated (potentially moved back
* to a region of memory where is can be used), page tables are updated by the
* CPU and the job to do the GPU binds is created in the IOCTL itself. This step
* can fail due to memory pressure. The user can recover by freeing memory and
* trying this operation again.
*
* Async error handling mode
* -------------------------
*
* In async error handling the step of validating the BO, updating page tables,
* and generating a job are deferred to an async worker. As this step can now
* fail after the IOCTL has reported success we need an error handling flow for
* which the user can recover from.
*
* The solution is for a user to register a user address with the VM which the
* VM uses to report errors to. The ufence wait interface can be used to wait on
* a VM going into an error state. Once an error is reported the VM's async
* worker is paused. While the VM's async worker is paused sync,
* DRM_XE_VM_BIND_OP_UNMAP operations are allowed (this can free memory). Once the
* uses believe the error state is fixed, the async worker can be resumed via
* XE_VM_BIND_OP_RESTART operation. When VM async bind work is restarted, the
* first operation processed is the operation that caused the original error.
*
* Bind queues / engines
* ---------------------
*
* Think of the case where we have two bind operations A + B and are submitted
* in that order. A has in fences while B has none. If using a single bind
* queue, B is now blocked on A's in fences even though it is ready to run. This
* example is a real use case for VK sparse binding. We work around this
* limitation by implementing bind engines.
*
* In the bind IOCTL the user can optionally pass in an engine ID which must map
* to an engine which is of the special class DRM_XE_ENGINE_CLASS_VM_BIND.
* Underneath this is a really virtual engine that can run on any of the copy
* hardware engines. The job(s) created each IOCTL are inserted into this
* engine's ring. In the example above if A and B have different bind engines B
* is free to pass A. If the engine ID field is omitted, the default bind queue
* for the VM is used.
*
* TODO: Explain race in issue 41 and how we solve it
*
* Array of bind operations
* ------------------------
*
* The uAPI allows multiple binds operations to be passed in via a user array,
* of struct drm_xe_vm_bind_op, in a single VM bind IOCTL. This interface
* matches the VK sparse binding API. The implementation is rather simple, parse
* the array into a list of operations, pass the in fences to the first operation,
* and pass the out fences to the last operation. The ordered nature of a bind
* engine makes this possible.
*
* Munmap semantics for unbinds
* ----------------------------
*
* Munmap allows things like:
*
* .. code-block::
*
* 0x0000-0x2000 and 0x3000-0x5000 have mappings
* Munmap 0x1000-0x4000, results in mappings 0x0000-0x1000 and 0x4000-0x5000
*
* To support this semantic in the above example we decompose the above example
* into 4 operations:
*
* .. code-block::
*
* unbind 0x0000-0x2000
* unbind 0x3000-0x5000
* rebind 0x0000-0x1000
* rebind 0x4000-0x5000
*
* Why not just do a partial unbind of 0x1000-0x2000 and 0x3000-0x4000? This
* falls apart when using large pages at the edges and the unbind forces us to
* use a smaller page size. For simplity we always issue a set of unbinds
* unmapping anything in the range and at most 2 rebinds on the edges.
*
* Similar to an array of binds, in fences are passed to the first operation and
* out fences are signaled on the last operation.
*
* In this example there is a window of time where 0x0000-0x1000 and
* 0x4000-0x5000 are invalid but the user didn't ask for these addresses to be
* removed from the mapping. To work around this we treat any munmap style
* unbinds which require a rebind as a kernel operations (BO eviction or userptr
* invalidation). The first operation waits on the VM's
* DMA_RESV_USAGE_PREEMPT_FENCE slots (waits for all pending jobs on VM to
* complete / triggers preempt fences) and the last operation is installed in
* the VM's DMA_RESV_USAGE_KERNEL slot (blocks future jobs / resume compute mode
* VM). The caveat is all dma-resv slots must be updated atomically with respect
* to execs and compute mode rebind worker. To accomplish this, hold the
* vm->lock in write mode from the first operation until the last.
*
* Deferred binds in fault mode
* ----------------------------
*
* If a VM is in fault mode (TODO: link to fault mode), new bind operations that
* create mappings are by default deferred to the page fault handler (first
* use). This behavior can be overridden by setting the flag
* DRM_XE_VM_BIND_FLAG_IMMEDIATE which indicates to creating the mapping
* immediately.
*
* User pointer
* ============
*
* User pointers are user allocated memory (malloc'd, mmap'd, etc..) for which the
* user wants to create a GPU mapping. Typically in other DRM drivers a dummy BO
* was created and then a binding was created. We bypass creating a dummy BO in
* XE and simply create a binding directly from the userptr.
*
* Invalidation
* ------------
*
* Since this a core kernel managed memory the kernel can move this memory
* whenever it wants. We register an invalidation MMU notifier to alert XE when
* a user pointer is about to move. The invalidation notifier needs to block
* until all pending users (jobs or compute mode engines) of the userptr are
* idle to ensure no faults. This done by waiting on all of VM's dma-resv slots.
*
* Rebinds
* -------
*
* Either the next exec (non-compute) or rebind worker (compute mode) will
* rebind the userptr. The invalidation MMU notifier kicks the rebind worker
* after the VM dma-resv wait if the VM is in compute mode.
*
* Compute mode
* ============
*
* A VM in compute mode enables long running workloads and ultra low latency
* submission (ULLS). ULLS is implemented via a continuously running batch +
* semaphores. This enables the user to insert jump to new batch commands
* into the continuously running batch. In both cases these batches exceed the
* time a dma fence is allowed to exist for before signaling, as such dma fences
* are not used when a VM is in compute mode. User fences (TODO: link user fence
* doc) are used instead to signal operation's completion.
*
* Preempt fences
* --------------
*
* If the kernel decides to move memory around (either userptr invalidate, BO
* eviction, or mumap style unbind which results in a rebind) and a batch is
* running on an engine, that batch can fault or cause a memory corruption as
* page tables for the moved memory are no longer valid. To work around this we
* introduce the concept of preempt fences. When sw signaling is enabled on a
* preempt fence it tells the submission backend to kick that engine off the
* hardware and the preempt fence signals when the engine is off the hardware.
* Once all preempt fences are signaled for a VM the kernel can safely move the
* memory and kick the rebind worker which resumes all the engines execution.
*
* A preempt fence, for every engine using the VM, is installed into the VM's
* dma-resv DMA_RESV_USAGE_PREEMPT_FENCE slot. The same preempt fence, for every
* engine using the VM, is also installed into the same dma-resv slot of every
* external BO mapped in the VM.
*
* Rebind worker
* -------------
*
* The rebind worker is very similar to an exec. It is responsible for rebinding
* evicted BOs or userptrs, waiting on those operations, installing new preempt
* fences, and finally resuming executing of engines in the VM.
*
* Flow
* ~~~~
*
* .. code-block::
*
* <----------------------------------------------------------------------|
* Check if VM is closed, if so bail out |
* Lock VM global lock in read mode |
* Pin userptrs (also finds userptr invalidated since last rebind worker) |
* Lock VM dma-resv and external BOs dma-resv |
* Validate BOs that have been evicted |
* Wait on and allocate new preempt fences for every engine using the VM |
* Rebind invalidated userptrs + evicted BOs |
* Wait on last rebind fence |
* Wait VM's DMA_RESV_USAGE_KERNEL dma-resv slot |
* Install preeempt fences and issue resume for every engine using the VM |
* Check if any userptrs invalidated since pin |
* Squash resume for all engines |
* Unlock all |
* Wait all VM's dma-resv slots |
* Retry ----------------------------------------------------------
* Release all engines waiting to resume
* Unlock all
*
* Timeslicing
* -----------
*
* In order to prevent an engine from continuously being kicked off the hardware
* and making no forward progress an engine has a period of time it allowed to
* run after resume before it can be kicked off again. This effectively gives
* each engine a timeslice.
*
* Handling multiple GTs
* =====================
*
* If a GT has slower access to some regions and the page table structure are in
* the slow region, the performance on that GT could adversely be affected. To
* work around this we allow a VM page tables to be shadowed in multiple GTs.
* When VM is created, a default bind engine and PT table structure are created
* on each GT.
*
* Binds can optionally pass in a mask of GTs where a mapping should be created,
* if this mask is zero then default to all the GTs where the VM has page
* tables.
*
* The implementation for this breaks down into a bunch for_each_gt loops in
* various places plus exporting a composite fence for multi-GT binds to the
* user.
*
* Fault mode (unified shared memory)
* ==================================
*
* A VM in fault mode can be enabled on devices that support page faults. If
* page faults are enabled, using dma fences can potentially induce a deadlock:
* A pending page fault can hold up the GPU work which holds up the dma fence
* signaling, and memory allocation is usually required to resolve a page
* fault, but memory allocation is not allowed to gate dma fence signaling. As
* such, dma fences are not allowed when VM is in fault mode. Because dma-fences
* are not allowed, only long running workloads and ULLS are enabled on a faulting
* VM.
*
* Deferred VM binds
* ----------------
*
* By default, on a faulting VM binds just allocate the VMA and the actual
* updating of the page tables is deferred to the page fault handler. This
* behavior can be overridden by setting the flag DRM_XE_VM_BIND_FLAG_IMMEDIATE in
* the VM bind which will then do the bind immediately.
*
* Page fault handler
* ------------------
*
* Page faults are received in the G2H worker under the CT lock which is in the
* path of dma fences (no memory allocations are allowed, faults require memory
* allocations) thus we cannot process faults under the CT lock. Another issue
* is faults issue TLB invalidations which require G2H credits and we cannot
* allocate G2H credits in the G2H handlers without deadlocking. Lastly, we do
* not want the CT lock to be an outer lock of the VM global lock (VM global
* lock required to fault processing).
*
* To work around the above issue with processing faults in the G2H worker, we
* sink faults to a buffer which is large enough to sink all possible faults on
* the GT (1 per hardware engine) and kick a worker to process the faults. Since
* the page faults G2H are already received in a worker, kicking another worker
* adds more latency to a critical performance path. We add a fast path in the
* G2H irq handler which looks at first G2H and if it is a page fault we sink
* the fault to the buffer and kick the worker to process the fault. TLB
* invalidation responses are also in the critical path so these can also be
* processed in this fast path.
*
* Multiple buffers and workers are used and hashed over based on the ASID so
* faults from different VMs can be processed in parallel.
*
* The page fault handler itself is rather simple, flow is below.
*
* .. code-block::
*
* Lookup VM from ASID in page fault G2H
* Lock VM global lock in read mode
* Lookup VMA from address in page fault G2H
* Check if VMA is valid, if not bail
* Check if VMA's BO has backing store, if not allocate
* <----------------------------------------------------------------------|
* If userptr, pin pages |
* Lock VM & BO dma-resv locks |
* If atomic fault, migrate to VRAM, else validate BO location |
* Issue rebind |
* Wait on rebind to complete |
* Check if userptr invalidated since pin |
* Drop VM & BO dma-resv locks |
* Retry ----------------------------------------------------------
* Unlock all
* Issue blocking TLB invalidation |
* Send page fault response to GuC
*
* Access counters
* ---------------
*
* Access counters can be configured to trigger a G2H indicating the device is
* accessing VMAs in system memory frequently as hint to migrate those VMAs to
* VRAM.
*
* Same as the page fault handler, access counters G2H cannot be processed the
* G2H worker under the CT lock. Again we use a buffer to sink access counter
* G2H. Unlike page faults there is no upper bound so if the buffer is full we
* simply drop the G2H. Access counters are a best case optimization and it is
* safe to drop these unlike page faults.
*
* The access counter handler itself is rather simple flow is below.
*
* .. code-block::
*
* Lookup VM from ASID in access counter G2H
* Lock VM global lock in read mode
* Lookup VMA from address in access counter G2H
* If userptr, bail nothing to do
* Lock VM & BO dma-resv locks
* Issue migration to VRAM
* Unlock all
*
* Notice no rebind is issued in the access counter handler as the rebind will
* be issued on next page fault.
*
* Caveats with eviction / user pointer invalidation
* -------------------------------------------------
*
* In the case of eviction and user pointer invalidation on a faulting VM, there
* is no need to issue a rebind rather we just need to blow away the page tables
* for the VMAs and the page fault handler will rebind the VMAs when they fault.
* The caveat is to update / read the page table structure the VM global lock is
* needed. In both the case of eviction and user pointer invalidation locks are
* held which make acquiring the VM global lock impossible. To work around this
* every VMA maintains a list of leaf page table entries which should be written
* to zero to blow away the VMA's page tables. After writing zero to these
* entries a blocking TLB invalidate is issued. At this point it is safe for the
* kernel to move the VMA's memory around. This is a necessary lockless
* algorithm and is safe as leafs cannot be changed while either an eviction or
* userptr invalidation is occurring.
*
* Locking
* =======
*
* VM locking protects all of the core data paths (bind operations, execs,
* evictions, and compute mode rebind worker) in XE.
*
* Locks
* -----
*
* VM global lock (vm->lock) - rw semaphore lock. Outer most lock which protects
* the list of userptrs mapped in the VM, the list of engines using this VM, and
* the array of external BOs mapped in the VM. When adding or removing any of the
* aforementioned state from the VM should acquire this lock in write mode. The VM
* bind path also acquires this lock in write while the exec / compute mode
* rebind worker acquires this lock in read mode.
*
* VM dma-resv lock (vm->gpuvm.r_obj->resv->lock) - WW lock. Protects VM dma-resv
* slots which is shared with any private BO in the VM. Expected to be acquired
* during VM binds, execs, and compute mode rebind worker. This lock is also
* held when private BOs are being evicted.
*
* external BO dma-resv lock (bo->ttm.base.resv->lock) - WW lock. Protects
* external BO dma-resv slots. Expected to be acquired during VM binds (in
* addition to the VM dma-resv lock). All external BO dma-locks within a VM are
* expected to be acquired (in addition to the VM dma-resv lock) during execs
* and the compute mode rebind worker. This lock is also held when an external
* BO is being evicted.
*
* Putting it all together
* -----------------------
*
* 1. An exec and bind operation with the same VM can't be executing at the same
* time (vm->lock).
*
* 2. A compute mode rebind worker and bind operation with the same VM can't be
* executing at the same time (vm->lock).
*
* 3. We can't add / remove userptrs or external BOs to a VM while an exec with
* the same VM is executing (vm->lock).
*
* 4. We can't add / remove userptrs, external BOs, or engines to a VM while a
* compute mode rebind worker with the same VM is executing (vm->lock).
*
* 5. Evictions within a VM can't be happen while an exec with the same VM is
* executing (dma-resv locks).
*
* 6. Evictions within a VM can't be happen while a compute mode rebind worker
* with the same VM is executing (dma-resv locks).
*
* dma-resv usage
* ==============
*
* As previously stated to enforce the ordering of kernel ops (eviction, userptr
* invalidation, munmap style unbinds which result in a rebind), rebinds during
* execs, execs, and resumes in the rebind worker we use both the VMs and
* external BOs dma-resv slots. Let try to make this as clear as possible.
*
* Slot installation
* -----------------
*
* 1. Jobs from kernel ops install themselves into the DMA_RESV_USAGE_KERNEL
* slot of either an external BO or VM (depends on if kernel op is operating on
* an external or private BO)
*
* 2. In non-compute mode, jobs from execs install themselves into the
* DMA_RESV_USAGE_BOOKKEEP slot of the VM
*
* 3. In non-compute mode, jobs from execs install themselves into the
* DMA_RESV_USAGE_WRITE slot of all external BOs in the VM
*
* 4. Jobs from binds install themselves into the DMA_RESV_USAGE_BOOKKEEP slot
* of the VM
*
* 5. Jobs from binds install themselves into the DMA_RESV_USAGE_BOOKKEEP slot
* of the external BO (if the bind is to an external BO, this is addition to #4)
*
* 6. Every engine using a compute mode VM has a preempt fence in installed into
* the DMA_RESV_USAGE_PREEMPT_FENCE slot of the VM
*
* 7. Every engine using a compute mode VM has a preempt fence in installed into
* the DMA_RESV_USAGE_PREEMPT_FENCE slot of all the external BOs in the VM
*
* Slot waiting
* ------------
*
* 1. The execution of all jobs from kernel ops shall wait on all slots
* (DMA_RESV_USAGE_PREEMPT_FENCE) of either an external BO or VM (depends on if
* kernel op is operating on external or private BO)
*
* 2. In non-compute mode, the execution of all jobs from rebinds in execs shall
* wait on the DMA_RESV_USAGE_KERNEL slot of either an external BO or VM
* (depends on if the rebind is operatiing on an external or private BO)
*
* 3. In non-compute mode, the execution of all jobs from execs shall wait on the
* last rebind job
*
* 4. In compute mode, the execution of all jobs from rebinds in the rebind
* worker shall wait on the DMA_RESV_USAGE_KERNEL slot of either an external BO
* or VM (depends on if rebind is operating on external or private BO)
*
* 5. In compute mode, resumes in rebind worker shall wait on last rebind fence
*
* 6. In compute mode, resumes in rebind worker shall wait on the
* DMA_RESV_USAGE_KERNEL slot of the VM
*
* Putting it all together
* -----------------------
*
* 1. New jobs from kernel ops are blocked behind any existing jobs from
* non-compute mode execs
*
* 2. New jobs from non-compute mode execs are blocked behind any existing jobs
* from kernel ops and rebinds
*
* 3. New jobs from kernel ops are blocked behind all preempt fences signaling in
* compute mode
*
* 4. Compute mode engine resumes are blocked behind any existing jobs from
* kernel ops and rebinds
*
* Future work
* ===========
*
* Support large pages for sysmem and userptr.
*
* Update page faults to handle BOs are page level grainularity (e.g. part of BO
* could be in system memory while another part could be in VRAM).
*
* Page fault handler likely we be optimized a bit more (e.g. Rebinds always
* wait on the dma-resv kernel slots of VM or BO, technically we only have to
* wait the BO moving. If using a job to do the rebind, we could not block in
* the page fault handler rather attach a callback to fence of the rebind job to
* signal page fault complete. Our handling of short circuting for atomic faults
* for bound VMAs could be better. etc...). We can tune all of this once we have
* benchmarks / performance number from workloads up and running.
*/
#endif
|