Worker is retrying but retries are still failing without a good reason
I'm still trying to understand why so many workers are failing to send their results on a very regular basis. I noticed the websocket connection seems to be dropped very regularly. When a worker is idle you can see this for example: <details> <summary>Many connected to messages at regular interval</summary> ``` 2026-03-18 22:53:09,784 Connected to https://debusine.debian.net/api 2026-03-18 22:54:04,032 Connected to https://debusine.debian.net/api 2026-03-18 22:56:23,417 Connected to https://debusine.debian.net/api 2026-03-18 23:09:00,644 Connected to https://debusine.debian.net/api 2026-03-18 23:10:28,682 Connected to https://debusine.debian.net/api 2026-03-18 23:14:20,556 Connected to https://debusine.debian.net/api 2026-03-18 23:30:59,643 Connected to https://debusine.debian.net/api 2026-03-19 00:27:59,967 Connected to https://debusine.debian.net/api 2026-03-19 00:56:21,631 Connected to https://debusine.debian.net/api 2026-03-19 01:29:19,369 Connected to https://debusine.debian.net/api 2026-03-19 02:03:16,360 Connected to https://debusine.debian.net/api 2026-03-19 02:15:22,194 Connected to https://debusine.debian.net/api 2026-03-19 02:29:50,818 Connected to https://debusine.debian.net/api 2026-03-19 02:29:56,211 Connected to https://debusine.debian.net/api 2026-03-19 02:51:40,422 Connected to https://debusine.debian.net/api 2026-03-19 02:55:53,834 Connected to https://debusine.debian.net/api 2026-03-19 03:47:48,704 Connected to https://debusine.debian.net/api 2026-03-19 05:08:14,251 Connected to https://debusine.debian.net/api 2026-03-19 06:19:28,734 Connected to https://debusine.debian.net/api 2026-03-19 06:53:01,040 Connected to https://debusine.debian.net/api 2026-03-19 07:13:42,103 Connected to https://debusine.debian.net/api 2026-03-19 07:33:55,523 Connected to https://debusine.debian.net/api 2026-03-19 08:20:57,471 Connected to https://debusine.debian.net/api ``` </details> Given the frequency this also happens when the worker is not idle, and it gives this: ``` 2026-03-18 21:00:17,544 Connected to https://debusine.debian.net/api 2026-03-18 21:01:18,764 Worker is busy and can't execute a new work request 2026-03-18 21:01:21,149 Connected to https://debusine.debian.net/api 2026-03-18 21:02:22,422 Worker is busy and can't execute a new work request 2026-03-18 21:02:29,126 Connected to https://debusine.debian.net/api 2026-03-18 21:03:30,374 Worker is busy and can't execute a new work request 2026-03-18 21:03:35,680 Connected to https://debusine.debian.net/api 2026-03-18 21:04:36,913 Worker is busy and can't execute a new work request 2026-03-18 21:04:38,476 Connected to https://debusine.debian.net/api 2026-03-18 21:05:39,763 Worker is busy and can't execute a new work request ``` But it also happens when the worker is trying to upload artifacts and send results: <details> <summary>Long log extract with retries</summary> ``` 2026-03-18 19:28:17,198 Work request 519552: Cleaning up 2026-03-18 19:28:17,568 Uploading /tmp/debusine-task-debug-log-files-1pmvghfs/cmd-output.log to https://debusine.debian.net/api/1.0/artifact/3449871/files/cmd-output.log/ 2026-03-18 19:28:38,271 Worker is busy and can't execute a new work request 2026-03-18 19:28:43,935 Connected to https://debusine.debian.net/api 2026-03-18 19:29:45,177 Worker is busy and can't execute a new work request 2026-03-18 19:29:49,363 Connected to https://debusine.debian.net/api 2026-03-18 19:30:50,533 Worker is busy and can't execute a new work request 2026-03-18 19:30:55,182 Connected to https://debusine.debian.net/api 2026-03-18 19:31:56,452 Worker is busy and can't execute a new work request 2026-03-18 19:32:00,747 Connected to https://debusine.debian.net/api 2026-03-18 19:33:02,098 Worker is busy and can't execute a new work request 2026-03-18 19:33:05,144 Connected to https://debusine.debian.net/api 2026-03-18 19:33:23,580 Cannot reach server to report work request completed. Exiting. urllib3.exceptions.ResponseError: too many 504 error responses The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/requests/adapters.py", line 667, in send resp = conn.urlopen( method=request.method, ...<9 lines>... chunked=chunked, ) File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 942, in urlopen return self.urlopen( ~~~~~~~~~~~~^ method, ^^^^^^^ ...<13 lines>... **response_kw, ^^^^^^^^^^^^^^ ) ^ File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 942, in urlopen return self.urlopen( ~~~~~~~~~~~~^ method, ^^^^^^^ ...<13 lines>... **response_kw, ^^^^^^^^^^^^^^ ) ^ File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 942, in urlopen return self.urlopen( ~~~~~~~~~~~~^ method, ^^^^^^^ ...<13 lines>... **response_kw, ^^^^^^^^^^^^^^ ) ^ [Previous line repeated 1 more time] File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 932, in urlopen retries = retries.increment(method, url, response=response, _pool=self) File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 519, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='debusine.debian.net', port=443): Max retries exceeded with url: /api/1.0/work-request/519552/completed/ (Caused by ResponseError('too many 504 error responses')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/debusine/client/debusine_http_client.py", line 298, in _api_request response = self._method(method)( url, headers=headers, **optional_kwargs ) File "/usr/lib/python3/dist-packages/requests/sessions.py", line 649, in put return self.request("PUT", url, data=data, **kwargs) ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "/usr/lib/python3/dist-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "/usr/lib/python3/dist-packages/requests/adapters.py", line 691, in send raise RetryError(e, request=request) requests.exceptions.RetryError: HTTPSConnectionPool(host='debusine.debian.net', port=443): Max retries exceeded with url: /api/1.0/work-request/519552/completed/ (Caused by ResponseError('too many 504 error responses')) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/debusine/worker/_worker.py", line 796, in _send_task_result await asyncio.to_thread( ...<4 lines>... ) File "/usr/lib/python3.13/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.13/concurrent/futures/thread.py", line 59, in run result = self.fn(*self.args, **self.kwargs) File "/usr/lib/python3/dist-packages/debusine/client/debusine.py", line 279, in work_request_completed_update return self._debusine_http_client.put( ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ path=self.work_request_completed_path(work_request_id), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...<6 lines>... expected_class=None, ^^^^^^^^^^^^^^^^^^^^ ) ^ File "/usr/lib/python3/dist-packages/debusine/client/debusine_http_client.py", line 120, in put return self._api_request( ~~~~~~~~~~~~~~~~~^ "PUT", ^^^^^^ ...<3 lines>... expected_statuses=expected_statuses, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/usr/lib/python3/dist-packages/debusine/client/debusine_http_client.py", line 302, in _api_request raise exceptions.ClientConnectionError( f'Cannot connect to {url}. Error: {str(exc)}' ) from exc debusine.client.exceptions.ClientConnectionError: Cannot connect to https://debusine.debian.net/api/1.0/work-request/519552/completed/. Error: HTTPSConnectionPool(host='debusine.debian.net', port=443): Max retries exceeded with url: /api/1.0/work-request/519552/completed/ (Caused by ResponseError('too many 504 error responses')) 2026-03-18 19:33:23,673 Unclosed client session client_session: <aiohttp.client.ClientSession object at 0xffffa9d46e40> 2026-03-18 19:33:23,674 Unclosed client session client_session: <aiohttp.client.ClientSession object at 0xffffa9d88410> 2026-03-18 19:33:30,350 Connected to https://debusine.debian.net/api ``` </details> Is it possible that the retries are doomed to fail when that happens? In theory no because sending the results uses plain HTTP and not websocket connections. But then I can't understand why it failed over 5 minutes. So I looked deeper in the load balancer logs. The first thing I noticed is that we seem to have two times the failing PUT request in the logs and it's effectively an HTTP 504 that is sent: ``` 2a01:4f8:2190:1869::1:2:33404 [18/Mar/2026:19:28:21.555] WILDCARD~ DEBIAN_NET_DEBUSINE/poseidon 0/0/49/97/+146 201 +351 - - --NN 116/112/30/29/0 0/0 {debusine.debian.net|python-requests/2.32.3|} {70} "POST /api/1.0/artifact-relation/ HTTP/1.1" 2a01:4f8:2190:1869::1:2:57156 [18/Mar/2026:19:28:43.807] WILDCARD~ DEBIAN_NET_DEBUSINE/poseidon 0/0/51/51/+102 101 +176 - - --NN 121/117/34/34/0 0/0 {debusine.debian.net|Python/3.13 aiohttp/3.11.16|} {} "GET /api/ws/1.0/worker/connect/ HTTP/1.1" 2a01:4f8:2190:1869::1:2:57166 [18/Mar/2026:19:28:44.987] WILDCARD~ DEBIAN_NET_DEBUSINE/poseidon 0/0/49/61/+110 204 +235 - - --NN 124/120/34/33/0 0/0 {debusine.debian.net|Python/3.13 aiohttp/3.11.16|} {0} "PUT /api/1.0/worker/dynamic-metadata/ HTTP/1.1" 2a01:4f8:2190:1869::1:2:33404 [18/Mar/2026:19:28:21.709] WILDCARD~ DEBIAN_NET_DEBUSINE/poseidon 0/0/49/60025/+60074 504 +290 - - --NN 115/111/36/35/0 0/0 {debusine.debian.net|python-requests/2.32.3|} {160} "PUT /api/1.0/work-request/519552/completed/ HTTP/1.1" 2a01:4f8:2190:1869::1:2:33404 [18/Mar/2026:19:28:21.709] WILDCARD~ DEBIAN_NET_DEBUSINE/poseidon 0/0/49/60025/+60074 504 +290 - - --NN 115/111/36/35/0 0/0 {debusine.debian.net|python-requests/2.32.3|} {160} "PUT /api/1.0/work-request/519552/completed/ HTTP/1.1" ``` That HTTP 504 is sent by the load balancer and at the nginx level I can see this: ``` 2026/03/18 19:29:21 [error] 960699#960699: *1996599 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 2a01:4f8:2190:1869::1:2, server: debusine.debian.net, request: "PUT /api/1.0/work-request/519552/completed/ HTTP/1.1", upstream: "http://unix:/var/lib/debusine/server/debusine.sock/api/1.0/work-request/519552/completed/", host: "debusine.debian.net" ``` But at the same time all the others workers are trying to report their results and are failing in the same way! But other URLs are working fine at that time, I see lots of activity in access.log. That pattern happened many times yesterday. So it looks like those requests are timing out likely due to database lock contention? At that time the Debusine server log is full of "[SCHEDULER] Failed to lock WorkRequest" every minute and nothing else happening in between. But what happened in that minute to create that situation? I couldn't identify anything obvious in the journal. FTR the worker analyzed is debusine-worker-arm64-demeter-02.
issue