sometimes jobs are processed and get a final status in the database, but don't leave the rabbitmq queue. for example we just checked and the database says s390x has 14010 pending jobs, while munin says the queue has ~ 14540 items. so there is a discrepancy of ~500 there.
looking into the history, we can see that when the last peak of jobs was fully consumed, all architectures went down to 0, but s390x never went below those ~500:
this could be related to unreliable connectivity between the workers and the server. maybe making the amqp-consume processes live for less time (i.e. exit after processing 100 jobs) could alleviate this. but this needs to be investigated.
Edited
Designs
Child items
...
Linked items
0
Link issues together to show that they're related.
Learn more.
@elbrus I was researching a little bit on this. out debci-worker@.service unit file already has Restart=on-failure and RestartSec=5 so it should already restart on failures. However, there is also this in systemd.unit(5):
StartLimitIntervalSec=interval, StartLimitBurst=burst Configure unit start rate limiting. Units which are started more than burst times within an interval time interval are not permitted to start any more. Use StartLimitIntervalSec= to configure the checking interval (defaults to DefaultStartLimitIntervalSec= in manager configuration file, set it to 0 to disable any kind of rate limiting). Use StartLimitBurst= to configure how many starts per interval are allowed (defaults to DefaultStartLimitBurst= in manager configuration file). These configuration options are particularly useful in conjunction with the service setting Restart= (see systemd.service(5)); however, they apply to all kinds of starts (including manual), not just those triggered by the Restart= logic. Note that units which are configured for Restart= and which reach the start limit are not attempted to be restarted anymore; however, they may still be restarted manually at a later point, after the interval has passed. From this point on, the restart logic is activated again. Note that systemctl reset-failed will cause the restart rate counter for a service to be flushed, which is useful if the administrator wants to manually start a unit and the start limit interferes with that. Note that this rate-limiting is enforced after any unit condition checks are executed, and hence unit activations with failing conditions do not count towards this rate limit. This setting does not apply to slice, target, device, and scope units, since they are unit types whose activation may either never fail, or may succeed only a single time. When a unit is unloaded due to the garbage collection logic (see above) its rate limit counters are flushed out too. This means that configuring start rate limiting for a unit that is not referenced continuously has no effect.
so maybe we can tweak those values to ensure that in case there is a network disconnection, we don't retry too many times in a short interval and get the service in a state that needs a manual restart.
@terceiro I wonder... I fear that most of the worker that I restarted manually were actually still running, but just not doing anything useful. I think it's the underlying code that isn't properly detecting the network problems.
I'm seeing quite a few "waiting for header frame: a SSL error occurred" errors (and debci-worker@* restarts) on unmatched031 and on ci-worker-s390x-01.
I wonder if raising the timeouts in rabbitmq would fix that and this issue.
I guess we could increase ssl_handshake_timeout and handshake_timeout.
I'm not sure that would help, as the ssl handshake happens at the beginning of connections, and it seems out problem happens after a while in long-running connections?
I don't know the internals of amqp, but I read that there's a heartbeat thingy and also debci-worker will restart internally after finishing a job. I don't think I can say in which of these stages there's a connection started up.
# if the user calls this, we run forever with consuming messages;# amqp-consume calls ourselves with the (hidden) --do-request optionamqp_queue="${debci_amqp_queue}${tags}"log "I: Connecting to AMQP queue $amqp_queue on ${debci_amqp_server_display}"debci amqp declare-queueexec amqp-consume \ --url ${debci_amqp_server} \ $debci_amqp_tools_options \ --queue=$amqp_queue \ --prefetch-count 1 \ -- \ $0 --do-request
I tried the longer timeout (60 seconds), unfortunately it doesn't solve the issue.
unmatched031: Active: active (running) since Tue 2022-06-28 19:58:18 UTC; 28min ago unmatched031: Active: active (running) since Tue 2022-06-28 19:58:18 UTC; 28min ago unmatched031: Active: active (running) since Tue 2022-06-28 20:24:36 UTC; 2min 9s ago unmatched032: Active: active (running) since Tue 2022-06-28 19:55:38 UTC; 28min ago unmatched032: Active: active (running) since Tue 2022-06-28 19:55:38 UTC; 28min ago unmatched032: Active: active (running) since Tue 2022-06-28 19:55:38 UTC; 28min ago unmatched034: Active: active (running) since Tue 2022-06-28 20:23:36 UTC; 30s ago unmatched034: Active: active (running) since Tue 2022-06-28 20:22:27 UTC; 1min 39s ago unmatched034: Active: active (running) since Tue 2022-06-28 19:55:38 UTC; 28min ago unmatched033: Active: active (running) since Tue 2022-06-28 20:23:12 UTC; 54s ago unmatched033: Active: active (running) since Tue 2022-06-28 19:55:38 UTC; 28min ago unmatched033: Active: active (running) since Tue 2022-06-28 19:55:38 UTC; 28min ago
Do we know whether adjusting/removing the server confirmations limit in the debci-worker makes a difference? (working theory: if the ppc and s390x workers are on relatively-high-latency (and/or reduced bandwidth) network connections, then any degradations to throughput from that setting might be amplified)
(another thought: since both ppc and s390x queues currently exhibit the problem, perhaps applying the change only to one of them would be a way to get some feedback on whether a modification has a beneficial effect?)
I haven't seen this in a while. I suspect that having debci-publish now also helps, maybe it's even the solution. Therefor I'm closing this issue, although I'm not 100% confident to say what fixed it.