Worker stuck in running state if work-request-completed can't be transmitted
Describe the bug
Worker is stuck in "running" state on the Server, if it failed to transmit a work-request-completed API request (e.g. due to a transient network failure) after a job execution.
The main connection (async websocket) is still active, but the (synchronous API) connection failed.
The Server won't send any other work request, until the Worker is manually restarted.
The Worker could make more attempts to make work-request-completed API call, and/or it could exit with an error and let systemd restart it (which would reset its state on the server, and re-run the work-request).
(Note: having the successful work-request re-run will lead to duplicate artifacts in the work request, which we may or may not consider another bug.)
How to reproduce the bug
- Replace
client.debusine.work_request_completed_update
with araise Exception("network error")
. - Send a work request
- Send more work requests, the Worker will never be assigned any.
Runtime environment
Operating system
bookworm
Versions of debusine and its dependencies
debian/0.1.0-417-g71ccf914