do not restart on syntax errors

In commit 1a61bbb1 ("Change systemd service Restart directive from always to on-failure"), we correctly fixed the systemd unit to not restart when Prometheus shuts down cleanly (for example) through the API and so on.

What our current configuration does not correctly handle, however, is configuration errors. If, by some bad luck or bad will, you end up with a broken Prometheus configuration file (or alerting rules files, for that matter), systemd will repeatedly (and infinitely) try to restart Prometheus.

That is wrong; we should fail the unit and stop retrying, when a non-recoverable error condition happens.

This is particularly nasty when there's an error in alerting rules as those will load the WAL and modify it, which means it will keep growing with each restart. That (separate) problem was reported upstream as:

https://github.com/prometheus/prometheus/issues/11486

... and might therefore eventually be fixed. But it seems better to have defense in depth here and correctly handle such failures mode.

I have also considered using StartLimitBurst and StartLimitIntervalSec but those are highly context-dependent. One production server might do its startup routine in a few miliseconds, while another might take a minute to parse its entire WAL (which was my case). How do you make those thresholds work correctly to cover all those possible use cases? I would argue it's basically impossible to tweak those settings to handle the most use cases correctly.

Also note that there's a (apparently undocumented?) NRestarts settings that could help with this as well: it would keep such a runaway spiral from consuming all disk space and eventually stop trying to restart the service, at least. But then we get to a similar dilemma than the StartLimit* settings: how often is enough or too much?

Really, Restart=on-abornmal is what we want here. We do not want to systematically restart Prometheus automatically on a "normal failure", but yes, we do want to recover from crashes automatically.

This is what this patch does.

Closes: #1022724

Edited by Antoine Beaupré

Merge request reports

Loading