Draft: Introduce timeout for lack of forward progress
In cases where --timeout-test
is necessarily large (hours), it may be possible to detect a hung test and bail out earlier, rather than wait for the test timeout to expire, thus freeing resources.
Some hung tests can be detected by a lack of forward progress, meaning they stopped producing output.
This introduces a timeout --timeout-test-nogprogress
for this case: if nothing has been written to stdout or stderr for longer than this period, VirtSubproc.Timeout
is raised, which then triggers the already existing test timeout functionality.
Generally, when kind == 'test'
, output of the test process is not captured, but goes directly to stdout/stderr. When --timeout-test-noprogress > 0
, we direct output to PIPE
instead, where we wait for it, read it, and forward it to stdout/stderr.
This isn't very efficient, but conceptually simple. Popen.communicate
uses threads (and also waits(), which is why we can't use it even with its timeout parameter). We could also write to an in-memory file (os.memfd_create
) but that would buffer output until the very end.
This is isn't very pretty, but it's the least intrusive approach to changing execute()
.
I've successfully tested this locally but this should probably get a unit test. Advice on where/how to best implement this would be appreciated.
Background: the Debian ROCm Team has tests running for hours, so our global timeout is pretty large. We also have flaky tests that run into problems. These tests then run until the global timeout is reached, which blocks scarce resources (GPUs).