improve worker reliability

sometimes jobs are processed and get a final status in the database, but don't leave the rabbitmq queue. for example we just checked and the database says s390x has 14010 pending jobs, while munin says the queue has ~ 14540 items. so there is a discrepancy of ~500 there.

looking into the history, we can see that when the last peak of jobs was fully consumed, all architectures went down to 0, but s390x never went below those ~500:

this could be related to unreliable connectivity between the workers and the server. maybe making the amqp-consume processes live for less time (i.e. exit after processing 100 jobs) could alleviate this. but this needs to be investigated.

Edited Jun 09, 2022 by Antonio Terceiro