Bad worker-server interaction when "work-request completed" API call fails
(Diagnosis links here point to a restricted workspace)
Right now https://debusine.freexian.com/freexian/elts-secure-boot/work-request/7036/ shows "lintian for i386" (WR 7050) as "Running". The task data shows that it has been in that situation for 14h (from yesterday ~18h to today 8h). Looking at the logs of the worker, we can see:
2025-03-25 18:10:52,465 Work request 7050: Fetching input
2025-03-25 18:10:52,560 Downloading artifact files into /tmp/debusine-fetch-exec-uplo
ad-9y3z3456
2025-03-25 18:10:56,527 Downloading artifact files into /tmp/debusine-fetch-exec-uplo
ad-9y3z3456
2025-03-25 18:11:02,321 Downloading artifact files into /tmp/debusine-fetch-exec-uplo
ad-9y3z3456
2025-03-25 18:12:28,073 Exception in Task lintian
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/debusine/tasks/_task.py", line 511, in execute_logging_exceptions
return self.execute()
^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/debusine/tasks/_task.py", line 524, in execute
result = self._execute()
^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/debusine/tasks/_task.py", line 837, in _execute
if not self.fetch_input(download_directory):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/debusine/tasks/lintian.py", line 486, in fetch_input
self.fetch_artifact(artifact_id, destination)
File "/usr/lib/python3/dist-packages/debusine/tasks/_task.py", line 748, in fetch_artifact
artifact_response = self.debusine.download_artifact(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/debusine/client/debusine.py", line 495, in download_artifact
self._streaming_download(file.url), path
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/debusine/client/debusine.py", line 463, in _streaming_download
response.raise_for_status()
File "/usr/lib/python3/dist-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://debusine.freexian.com/freexian/elts-secure-boot/artifact/39406/download/linux-image-5.10-686-dbg_5.10.234-1~deb10u1_i386.deb
2025-03-25 18:12:28,080 Task: lintian Error execute: 503 Server Error: Service Unavailable for url: https://debusine.freexian.com/freexian/elts-secure-boot/artifact/39406/download/linux-image-5.10-686-dbg_5.10.234-1~deb10u1_i386.deb
2025-03-25 18:12:29,556 Cannot reach server to report work request completed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 884, in urlopen
return self.urlopen(
^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 884, in urlopen
return self.urlopen(
^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 884, in urlopen
return self.urlopen(
^^^^^^^^^^^^^
[Previous line repeated 1 more time]
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 874, in urlopen
retries = retries.increment(method, url, response=response, _pool=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 594, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='debusine.freexian.com', port=443): Max retries exceeded with url: /api/1.0/work-request/7050/completed/ (Caused by ResponseError('too many 503 error responses'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/debusine/client/debusine_http_client.py", line 285, in _api_request
response = self._method(method)(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 647, in put
return self.request("PUT", url, data=data, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 556, in send
raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='debusine.freexian.com', port=443): Max retries exceeded with url: /api/1.0/work-request/7050/completed/ (Caused by ResponseError('too many 503 error responses'))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/debusine/worker/_worker.py", line 745, in _send_task_result
await asyncio.to_thread(
File "/usr/lib/python3.11/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/debusine/client/debusine.py", line 172, in work_request_completed_update
return self._debusine_http_client.put(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/debusine/client/debusine_http_client.py", line 123, in put
return self._api_request(
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/debusine/client/debusine_http_client.py", line 289, in _api_request
raise exceptions.ClientConnectionError(
debusine.client.exceptions.ClientConnectionError: Cannot connect to https://debusine.freexian.com/api/1.0/work-request/7050/completed/. Error: HTTPSConnectionPool(host='debusine.freexian.com', port=443): Max retries exceeded with url: /api/1.0/work-request/7050/completed/ (Caused by ResponseError('too many 503 error responses'))
The log ends here, nothing else happened (or has been logged) since that failure.
On the server side, the logs look like this:
2a01:4f8:2190:1c1f::1:7:58036 [25/Mar/2025:18:11:57.228] WILDCARD~ FREEXIAN_DEBUSINE/hestia 0/0/9/617/+626 200 +15200 - - --NN 60/59/17/17/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {879364792} "GET /freexian/elts-secure-boot/artifact/39406/download/linux-image-5.10.0-0.deb10.34-rt-686-pae-dbg_5.10.234-1~deb10u1_i386.deb HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:36558 [25/Mar/2025:18:12:16.055] WILDCARD~ FREEXIAN_DEBUSINE/hestia 0/0/6/10505/+10511 200 +15200 - - --NN 66/65/17/17/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {48136880} "GET /freexian/elts-secure-boot/artifact/39406/download/linux-image-5.10.0-0.deb10.34-rt-686-pae-unsigned_5.10.234-1~deb10u1_i386.deb HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45014 [25/Mar/2025:18:12:27.601] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 66/65/15/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "GET /freexian/elts-secure-boot/artifact/39406/download/linux-image-5.10-686-dbg_5.10.234-1~deb10u1_i386.deb HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45014 [25/Mar/2025:18:12:27.601] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 66/65/15/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "GET /freexian/elts-secure-boot/artifact/39406/download/linux-image-5.10-686-dbg_5.10.234-1~deb10u1_i386.deb HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45016 [25/Mar/2025:18:12:28.143] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 66/65/14/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "PUT /api/1.0/work-request/7050/completed/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45016 [25/Mar/2025:18:12:28.143] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 66/65/14/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "PUT /api/1.0/work-request/7050/completed/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45016 [25/Mar/2025:18:12:28.146] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 66/65/14/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "PUT /api/1.0/work-request/7050/completed/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45016 [25/Mar/2025:18:12:28.146] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 66/65/14/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "PUT /api/1.0/work-request/7050/completed/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45016 [25/Mar/2025:18:12:28.350] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 65/64/14/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "PUT /api/1.0/work-request/7050/completed/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45016 [25/Mar/2025:18:12:28.350] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 65/64/14/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "PUT /api/1.0/work-request/7050/completed/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45016 [25/Mar/2025:18:12:28.755] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 65/64/14/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "PUT /api/1.0/work-request/7050/completed/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45016 [25/Mar/2025:18:12:28.755] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 65/64/14/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "PUT /api/1.0/work-request/7050/completed/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45016 [25/Mar/2025:18:12:29.560] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 66/65/14/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "PUT /api/1.0/work-request/7050/completed/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:45016 [25/Mar/2025:18:12:29.560] WILDCARD~ FREEXIAN_DEBUSINE/<NOSRV> 0/-1/-1/-1/+0 503 +217 - - SCNN 66/65/14/0/0 0/0 {debusine.freexian.com|python-requests/2.28.1} {} "PUT /api/1.0/work-request/7050/completed/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:37478 [25/Mar/2025:18:24:37.494] WILDCARD~ FREEXIAN_DEBUSINE/hestia 0/0/8/12/+20 204 +235 - - --NN 59/58/15/14/0 0/0 {debusine.freexian.com|Python/3.11 aiohttp/3.8.4} {0} "PUT /api/1.0/worker/dynamic-metadata/ HTTP/1.1"
2a01:4f8:2190:1c1f::1:7:41784 [25/Mar/2025:19:24:37.147] WILDCARD~ FREEXIAN_DEBUSINE/hestia 0/0/7/12/+19 204 +235 - - --NN 62/61/15/14/0 0/0 {debusine.freexian.com|Python/3.11 aiohttp/3.8.4} {0} "PUT /api/1.0/worker/dynamic-metadata/ HTTP/1.1"
I don't have any explanation why within 10 seconds we turned from HTTP 200 to HTTP 403 but what's worth pointing out is that I don't see any request to "/api/1.0/work-request/get-next-for-worker/" after the failure to report the result, and as such the worker is still believed to be busy on the server side. Anyway that extra HTTP call would have likely failed as well, and as such it would make sense to have some infinite retry (with some backoff) of that call until it succeeds.
Alternate idea: we could also hook something into the regular "dynamic-metadata" report to indicate whether the worker is busy or not, and have the server react on that if there's a mismatch.