GPU reset failure on Radeon Pro W5700
When testing rocrand on gfx1010 with a Radeon Pro W5700, I encountered a few error messages in the log. The first set of tests completed successfully despite the error messages, but then failed when resetting the device for the hiprand tests.
$ sudo debci localtest rocrand
autopkgtest [13:49:15]: starting date and time: 2023-11-08 13:49:15-0700
autopkgtest [13:49:15]: version 5.30+rocm3
autopkgtest [13:49:15]: host cassiopeia; command line: /usr/bin/autopkgtest --no-built-binaries '--setup-commands=echo '"'"'rocrand unstable/amd64+gfx1010'"'"' > /var/tmp/debci.pkg 2>&1 || true' '--setup-commands=echo '"'"'Acquire::Retries "10";'"'"' > /etc/apt/apt.conf.d/75retry 2>&1 || true' --timeout-test 28800 --user debci --apt-upgrade --output-dir=/tmp/debci-localtest.19boZeS0Qv --shell-fail rocrand -- qemu+rocm /var/lib/debci/qemu+rocm/unstable-amd64.img --timeout-poweroff 30 --ram-size 32768 --cpus 8 --gpu 0000:83:00.0
Unsupported device class.
Unsupported device class.
qemu-system-x86_64: vfio: Cannot reset device 0000:83:00.1, depends on group 34 which is not owned.
autopkgtest [13:49:38]: testbed dpkg architecture: amd64
autopkgtest [13:49:38]: testbed apt version: 2.7.6
autopkgtest [13:49:38]: @@@@@@@@@@@@@@@@@@@@ test bed setup
<...>
[ RUN ] sobol_uniform_distribution_tests.half_test
[ OK ] sobol_uniform_distribution_tests.half_test (31 ms)
[----------] 4 tests from sobol_uniform_distribution_tests (58 ms total)
[----------] Global test environment tear-down
[==========] 16 tests from 4 test suites ran. (356 ms total)
[ PASSED ] 16 tests.
autopkgtest [13:54:15]: test command1: -----------------------]
autopkgtest [13:54:15]: test command1: - - - - - - - - - - results - - - - - - - - - -
command1 PASS
autopkgtest [13:54:16]: test command2: preparing testbed
qemu-system-x86_64: vfio: Cannot reset device 0000:83:00.1, depends on group 34 which is not owned.
qemu-system-x86_64: vfio: Error: Failed to setup MSI fds: Invalid argument
qemu-system-x86_64: vfio: Error: Failed to enable MSI
error: kvm run failed Bad address
RAX=ffffa5d3c06ac000 RBX=ffff8adfc3fbd028 RCX=0000000000000000 RDX=0000000000000000
RSI=0000000000000000 RDI=ffff8adfc3fbd4f8 RBP=ffff8adfc3fbd028 RSP=ffffa5d3c07839f8
R8 =0000000000000002 R9 =0000000000000081 R10=0000000000000001 R11=ffffffff99377190
R12=0000000000000000 R13=ffff8adfc12b51b4 R14=0000000000000001 R15=0000000000000000
RIP=ffffffffc087d6fe RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 00000000 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0000 0000000000000000 00000000 00000000
FS =0000 00007fc0f7ebe8c0 00000000 00000000
GS =0000 ffff8ae71fdc0000 00000000 00000000
LDT=0000 fffffe0000000000 00000000 00000000
TR =0040 fffffe08c3b35000 00004087 00008b00 DPL=0 TSS64-busy
GDT= fffffe08c3b33000 0000007f
IDT= fffffe0000000000 00000fff
CR0=80050033 CR2=000055e64e7ce0b8 CR3=000000010a72a000 CR4=00350ee0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=b6 1e 00 f6 85 30 08 00 00 40 0f 85 8e 00 00 00 48 8b 45 20 <66> 44 8b 68 0e 48 89 ef e8 b5 fc ff ff be 01 00 00 00 48 89 ef e8 58 fa ff ff f6 85 3d 06
Unexpected error:
Traceback (most recent call last):
File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 328, in expect
block = sock.recv(4096)
^^^^^^^^^^^^^^^
File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 75, in alarm_handler
raise Timeout(to)
VirtSubproc.Timeout: 60
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/bin/autopkgtest-virt-qemu", line 591, in hook_open
wait_boot()
File "/usr/bin/autopkgtest-virt-qemu", line 123, in wait_boot
VirtSubproc.expect(term, b' login: ', args.timeout_reboot,
File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 325, in expect
with timeout(timeout_sec,
File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 276, in __exit__
bomb(self.exit_msg)
File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 95, in bomb
raise Quit(12, progname + ": failure: %s" % m)
VirtSubproc.Quit: (12, "<VirtSubproc>: failure: timed out waiting for 'login prompt on serial console'")
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 830, in mainloop
command()
File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 759, in command
r = f(c, ce)
^^^^^^^^
File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 392, in cmd_revert
caller.hook_revert()
File "/usr/bin/autopkgtest-virt-qemu", line 623, in hook_revert
hook_open()
File "/usr/bin/autopkgtest-virt-qemu", line 608, in hook_open
hook_cleanup()
File "/usr/bin/autopkgtest-virt-qemu", line 633, in hook_cleanup
VirtSubproc.check_exec(['poweroff'], downp=True, timeout=1)
File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 193, in check_exec
(status, out, err) = execute_timeout(None, timeout, real_argv,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 151, in execute_timeout
sp = subprocess.Popen(*popenargs,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/subprocess.py", line 1024, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.11/subprocess.py", line 1901, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/autopkgtest-qemu.8av79sh9/runcmd'
autopkgtest [13:55:19]: ERROR: testbed failure: unexpected eof from the testbed
Edited by Cordell Bloor