Skip to content

autopkgtest-virt-podman+rocm: support minimal device passthrough

To restrict GPU access within podman containers, the ROCm documentation on restricting GPU access notes that you can expose /dev/dri/renderD<N> rather than /dev/dri. The devices can be looked up by PCIe ID using the symlinks in by-id. For example,

$ readlink -f  /dev/dri/by-path/pci-0000:04:00.0-render
/dev/dri/renderD128

This would have to change in autopkgtest-virt-podman+rocm. Perhaps if --gpu=<pcieid> is passed, then autopkgtest-virt-podman+rocm could lookup /dev/dri/by-path/pci-0000:<pcieid>-render and pass the resolved device instead of /dev/dri (or multiple devices if --gpu is passed multiple times).

This mechanism is not strictly required for running multiple GPU workers on a single node with podman. It might still possible to use debci_autopkgtest_args="--env ROCR_VISIBLE_DEVICES=<N>" to restrict execution to a single GPU. Unfortunately, this use of ROCR_VISIBLE_DEVICES is introducing segfaults in hipsparse when I try on Argo. It may be that this is because the GPU is partly visible via /dev/dri, or it could be related to NUMA. Or, maybe it's related to the weird nature of the AMD FirePro S9300 x2 with its two GPUs per card.

Edited by Cordell Bloor