Periodically check worker health and stop inoperative workers
Cards tested in QEMU workers may experience issues that prevent them from being used in future tests. A common example is the card getting stuck in a lower-power mode from which the host cannot recover, as seen in this dmesg
output:
[ 9600.991035] vfio-pci 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 9600.992350] vfio-pci 0000:09:00.1: Unable to change power state from D3cold to D0, device inaccessible
The host needs to reboot to recover from this.
However, until that happens, the worker will continue to accept jobs, which will all abort with tmpfail
. This can prevent other, still operative workers from accepting jobs.
A small helper is needed that periodically checks dmesg
for known alerts, and stops the worker if such an alter occurs.