File: README.md

package info (click to toggle)
criu 4.2-2
links: PTS, VCS
area: main
in suites: forky, sid
size: 11,584 kB
sloc: ansic: 139,280; python: 7,484; sh: 3,824; java: 2,799; makefile: 2,659; asm: 1,137; perl: 206; xml: 117; exp: 45
file content (59 lines) | stat: -rw-r--r-- 3,300 bytes
parent folder | download | duplicates (2)
Checkpoint and Restore for CUDA applications with CRIU
======================================================

# Requirements
The cuda-checkpoint utility should be placed somewhere in your $PATH and an r555
or higher GPU driver is required for CUDA CRIU integration support.

## cuda-checkpoint
The cuda-checkpoint utility can be found at:
https://github.com/NVIDIA/cuda-checkpoint

cuda-checkpoint is a binary utility used to issue checkpointing commands to CUDA
applications. Updating the cuda-checkpoint utility between driver releases
should not be necessary as the utility simply exposes some extra driver behavior
so driver updates are all that's needed to get access to newer features.

# Checkpointing Procedure
cuda-checkpoint exposes 4 actions used in the checkpointing process: lock,
checkpoint, restore, unlock.

* lock - Used with the PAUSE_DEVICES hook while a process is still running to
  quiesce the application into a state where it can be checkpointed
* checkpoint - Used with the CHECKPOINT_DEVICES hook once a process has been
  seized/frozen to perform the actual checkpointing operation
* restore/unlock - Used with the RESUME_DEVICES_LATE hook to restore the CUDA
  state and release the process back to it's running state

These actions are facilitated by a CUDA checkpoint+restore thread that the CUDA
plugin will re-wake when needed.

# Known Limitations
* Currently GPU memory contents are brought into main system memory and CRIU
  then checkpoints that as part of the normal procedure. On systems with many
  GPU's with high GPU memory usage this can cause memory thrashing. A future
  CUDA release will add support for dumping the memory contents to files to
  alleviate this as well as support in the CRIU plugin.
* There's currently a small race between when a PAUSE_DEVICES hook is called on
  a running process and a process calls cuInit() and finishes initializing CUDA
  after the PAUSE is issued but before the process is frozen to checkpoint. This
  will cause cuda-checkpoint to report that the process is in an illegal state
  for checkpointing and it's recommended to just attempt the CRIU procedure
  again, this should be very rare.
* Applications that use NVML will leave some leftover device references as NVML
  is not currently supported for checkpointing. There will be support for this
  in later drivers. A possible temporary workaround is to have the
  {DUMP,RESTORE}_EXT_FILE hook just ignore /dev/nvidiactl and /dev/nvidia{0..N}
  remaining references for these applications as in most cases NVML is used to
  get info such as gpu count and some capabilities and these values are never
  accessed again and unlikely to change.
* CUDA applications that fork() but don't call exec() but also don't issue any
  CUDA API calls will have some leftover references to /dev/nvidia* and fail to
  checkpoint as a result. This can be worked around in a similar fashion to the
  NVML case where the leftover references can be ignored as CUDA is not fork()
  safe anyway.
* Restore currently requires that you restore on a system with similar GPU's and
  same GPU count.
* NVIDIA UVM Managed Memory, MIG (Multi Instance GPU), and MPS (Multi-Process
  Service) are currently not supported for checkpointing. Future CUDA releases
  will add support for these.