autopkgtest-virt-podman+rocm: support minimal device passthrough
To restrict GPU access within podman containers, the ROCm documentation on restricting GPU access notes that you can expose /dev/dri/renderD<N>
rather than /dev/dri
. The devices can be looked up by PCIe ID using the symlinks in by-id. For example,
$ readlink -f /dev/dri/by-path/pci-0000:04:00.0-render
/dev/dri/renderD128
This would have to change in autopkgtest-virt-podman+rocm. Perhaps if --gpu=<pcieid>
is passed, then autopkgtest-virt-podman+rocm could lookup /dev/dri/by-path/pci-0000:<pcieid>-render
and pass the resolved device instead of /dev/dri
(or multiple devices if --gpu
is passed multiple times).
This mechanism is not strictly required for running multiple GPU workers on a single node with podman. It might still possible to use debci_autopkgtest_args="--env ROCR_VISIBLE_DEVICES=<N>"
to restrict execution to a single GPU. Unfortunately, this use of ROCR_VISIBLE_DEVICES is introducing segfaults in hipsparse when I try on Argo. It may be that this is because the GPU is partly visible via /dev/dri, or it could be related to NUMA. Or, maybe it's related to the weird nature of the AMD FirePro S9300 x2 with its two GPUs per card.