Skip to content

Worker stuck in running state if work-request-completed can't be transmitted

Describe the bug

Worker is stuck in "running" state on the Server, if it failed to transmit a work-request-completed API request (e.g. due to a transient network failure) after a job execution.

The main connection (async websocket) is still active, but the (synchronous API) connection failed.

The Server won't send any other work request, until the Worker is manually restarted.

The Worker could make more attempts to make work-request-completed API call, and/or it could exit with an error and let systemd restart it (which would reset its state on the server, and re-run the work-request).
(Note: having the successful work-request re-run will lead to duplicate artifacts in the work request, which we may or may not consider another bug.)

How to reproduce the bug

  • Replace client.debusine.work_request_completed_update with a raise Exception("network error").
  • Send a work request
  • Send more work requests, the Worker will never be assigned any.

Runtime environment

Operating system

bookworm

Versions of debusine and its dependencies

debian/0.1.0-417-g71ccf914

Edited by Sylvain Beucler
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information