Scalability issues with large workflows
When creating a workflow with many work requests (e.g. reverse dependency autopkgtests of glibc
), a celery worker creates all autopkgtest work requests upfront. That scheduling can take significant time (several hours), but the webinterface remains accessible during that time as no big transaction lock is taken.
As one of the builds succeeds, eventually a web request causes daphne to mark the build as completed and the same daphne worker then proceeds to marking all autopkgtests as pending. It takes about 5 seconds per work request, so once there are more than say 60 work requests, daphne is stuck for 5 minutes.
If you are unfortunate enough to restart Debusine at this point, it'll abort its transaction and - as a result - marks the workflow as aborted. Consequently, it then proceeds to transitioning all of those work requests to the aborted state (one every five seconds) and is stuck again. Attempting to restart the debusine-server.service
makes it abort all those work requests again.