Provided by: criu_4.0-4_amd64 bug

NAME criu-amdgpu-plugin - A plugin extension to CRIU to support checkpoint/restore in userspace for AMD GPUs.


CURRENT SUPPORT


       Single  and Multi GPU systems (Gfx9) Checkpoint / Restore on different system Checkpoint / Restore inside
       a docker container Pytorch Tensorflow Using CRIU Image Streamer

DESCRIPTION


       Though criu is a great tool  for  checkpointing  and  restoring  running  applications,  it  has  certain
       limitations  such  as it cannot handle applications that have device files open. In order to support ROCm
       based workloads with criu we need to augment criu’s core functionality  with  a  plugin  based  extension
       mechanism.  criu-amdgpu-plugin  provides the necessary support to criu to allow Checkpoint / Restore with
       ROCm.

DEPENDENCIES


       amdkfd support
           In order to snapshot the VRAM and  other  GPU  device  states,  we  require  an  updated  version  of
           amdkfd(amdgpu) driver.

OPTIONS


       Optional parameters can be passed in as environment variables before executing criu command.

       KFD_FW_VER_CHECK
           Enable  or  disable  firmware version check. If enabled, firmware version on restored gpu needs to be
           greater than or equal firmware version on checkpointed GPU. Default:Enabled

               E.g:
               KFD_FW_VER_CHECK=0

       KFD_SDMA_FW_VER_CHECK
           Enable or disable SDMA firmware version check. If enabled, SDMA  firmware  version  on  restored  gpu
           needs to be greater than or equal firmware version on checkpointed GPU. Default:Enabled

               E.g:
               KFD_SDMA_FW_VER_CHECK=0

       KFD_CACHES_COUNT_CHECK
           Enable  or  disable  caches  count  check.  If  enabled, the caches count on restored GPU needs to be
           greater than or equal caches count on checkpointed GPU. Default:Enabled

               E.g:
               KFD_CACHES_COUNT_CHECK=0

       KFD_NUM_GWS_CHECK
           Enable or disable num_gws check. If enabled, the num_gws on restored GPU needs to be greater than  or
           equal num_gws on checkpointed GPU. Default:Enabled

               E.g:
               KFD_NUM_GWS_CHECK=0

       KFD_VRAM_SIZE_CHECK
           Enable or disable VRAM size check. If enabled, the VRAM size on restored GPU needs to be greater than
           or equal VRAM size on checkpointed GPU. Default:Enabled

               E.g:
               KFD_VRAM_SIZE_CHECK=0

       KFD_NUMA_CHECK
           Enable  or disable NUMA CPU region check. If enabled, the plugin will restore GPUs that belong to one
           CPU NUMA region to the same CPU NUMA region. Default:Enabled

               E.g:
               KFD_NUMA_CHECK=1

       KFD_CAPABILITY_CHECK
           Enable or disable capability check. If enabled, the capability on restored GPU needs to be  equal  to
           the capability on the checkpointed GPU. Default:Enabled

               E.g:
               KFD_CAPABILITY_CHECK=1

       KFD_MAX_BUFFER_SIZE
           On  some  systems, VRAM sizes may exceed RAM sizes, and so buffers for dumping and restoring VRAM may
           be unable to fit. Set to a nonzero value (in bytes) to set a limit  on  the  plugin’s  memory  usage.
           Default:0 (Disabled)

               E.g:
               KFD_MAX_BUFFER_SIZE="2G"

AUTHOR


       The AMDKFD team.

COPYRIGHT


       Copyright (C) 2020-2021, Advanced Micro Devices, Inc. (AMD)

                                                   03/10/2025                                    ROCM SUPPORT(1)