only restart on abormal failure conditions
In my lab, I hit some error condition where the server would crash on startup with:
jan 30 10:09:50 marcos java[1748495]: Exception in thread "main" java.lang.ClassCastException: class schema.core.FnSchema cannot be cast to class schema.core.FnSchema (schema.core.FnSchema is in unnamed module of loader clojure.lang.DynamicClassLoader @6af78a48; schema.core.FnSchema is in unnamed module of loader 'app') jan 30 10:09:50 marcos java[1748495]: at plumbing.fnk.pfnkinput.invokeStatic(pfnk.cljc:19) jan 30 10:09:50 marcos java[1748495]: at plumbing.fnk.pfnkinput.invoke(pfnk.cljc:18) jan 30 10:09:50 marcos java[1748495]: at clojure.corejuxtfn__5891.invoke(core.clj:2611) jan 30 10:09:50 marcos java[1748495]: at plumbing.fnk.pfnkfn__4109.invokeStatic(pfnk.cljc:35) jan 30 10:09:50 marcos java[1748495]: at plumbing.fnk.pfnkfn__4109.invoke(pfnk.cljc:30) jan 30 10:09:50 marcos java[1748495]: at plumbing.fnk.pfnkfn__4090G__4085__4095.invoke(pfnk.cljc:13) jan 30 10:09:50 marcos java[1748495]: at plumbing.fnk.pfnkinput_schema.invokeStatic(pfnk.cljc:38) jan 30 10:09:50 marcos java[1748495]: at plumbing.fnk.pfnkinput_schema.invoke(pfnk.cljc:37) jan 30 10:09:50 marcos java[1748495]: at plumbing.fnk.pfnkinput_schema_keys.invokeStatic(pfnk.cljc:44) jan 30 10:09:50 marcos java[1748495]: at plumbing.fnk.pfnkinput_schema_keys.invoke(pfnk.cljc:43) jan 30 10:09:50 marcos java[1748495]: at plumbing.coremap_valsfn__4262.invoke(core.cljc:50) jan 30 10:09:50 marcos java[1748495]: at clojure.lang.PersistentHashMapNodeSeq.kvreduce(PersistentHashMap.java:1307) jan 30 10:09:50 marcos java[1748495]: at clojure.lang.PersistentHashMapBitmapIndexedNode.kvreduce(PersistentHashMap.java:802) jan 30 10:09:50 marcos java[1748495]: at clojure.lang.PersistentHashMapArrayNode.kvreduce(PersistentHashMap.java:466) jan 30 10:09:50 marcos java[1748495]: at clojure.lang.PersistentHashMap.kvreduce(PersistentHashMap.java:236) jan 30 10:09:50 marcos java[1748495]: at clojure.corefn__8525.invokeStatic(core.clj:6908) jan 30 10:09:50 marcos java[1748495]: at clojure.corefn__8525.invoke(core.clj:6888) jan 30 10:09:50 marcos java[1748495]: at clojure.core.protocolsfn__8257G__8252__8266.invoke(protocols.clj:175) jan 30 10:09:50 marcos java[1748495]: at clojure.corereduce_kv.invokeStatic(core.clj:6919) jan 30 10:09:50 marcos java[1748495]: at clojure.corereduce_kv.invoke(core.clj:6910) jan 30 10:09:50 marcos java[1748495]: at plumbing.coremap_vals.invokeStatic(core.cljc:50) jan 30 10:09:50 marcos java[1748495]: at plumbing.coremap_vals.invoke(core.cljc:43) jan 30 10:09:50 marcos java[1748495]: at plumbing.graph__GT_graph.invokeStatic(graph.cljc:64) jan 30 10:09:50 marcos java[1748495]: at plumbing.graph__GT_graph.invoke(graph.cljc:47) jan 30 10:09:50 marcos java[1748495]: at plumbing.grapheager_compileeager_compile__5949.invoke(graph.cljc:148) jan 30 10:09:50 marcos java[1748495]: at plumbing.grapheager_compile.invokeStatic(graph.cljc:155) jan 30 10:09:50 marcos java[1748495]: at plumbing.grapheager_compile.invoke(graph.cljc:130) jan 30 10:09:50 marcos java[1748495]: at plumbing.grapheager_compile.invokeStatic(graph.cljc:141) jan 30 10:09:50 marcos java[1748495]: at plumbing.grapheager_compile.invoke(graph.cljc:130) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.internalcompile_graphfn__14443.invoke(internal.clj:143) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.internalcompile_graph.invokeStatic(internal.clj:142) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.internalcompile_graph.invoke(internal.clj:136) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.internalfn__15135build_app_STAR___15144fn__15145.invoke(internal.clj:590) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.internalfn__15135build_app_STAR___15144.invoke(internal.clj:563) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.internalfn__15241boot_services_STAR___15250fn__15251fn__15252.invoke(internal.clj:671) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.internalfn__15241boot_services_STAR___15250fn__15251.invoke(internal.clj:670) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.internalfn__15241boot_services_STAR___15250.invoke(internal.clj:665) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.corefn__16017boot_with_cli_data__16024fn__16025.invoke(core.clj:132) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.corefn__16017boot_with_cli_data__16024.invoke(core.clj:97) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.corefn__16046run__16051fn__16052.invoke(core.clj:155) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.corefn__16046run__16051.invoke(core.clj:149) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.coremain.invokeStatic(core.clj:224) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.coremain.doInvoke(core.clj:210) jan 30 10:09:50 marcos java[1748495]: at clojure.lang.RestFn.applyTo(RestFn.java:137) jan 30 10:09:50 marcos java[1748495]: at clojure.lang.Var.applyTo(Var.java:705) jan 30 10:09:50 marcos java[1748495]: at clojure.coreapply.invokeStatic(core.clj:667) jan 30 10:09:50 marcos java[1748495]: at clojure.coreapply.invoke(core.clj:662) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.main_main.invokeStatic(main.clj:7) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.main$_main.doInvoke(main.clj:4) jan 30 10:09:50 marcos java[1748495]: at clojure.lang.RestFn.applyTo(RestFn.java:137) jan 30 10:09:50 marcos java[1748495]: at puppetlabs.trapperkeeper.main.main(Unknown Source) jan 30 10:09:50 marcos systemd[1]: puppetserver.service: Main process exited, code=exited, status=1/FAILURE
Eventually, systemd kills the process because it fails to start, in which case it enters this state:
jan 30 10:09:45 marcos systemd[1]: puppetserver.service: start-post operation timed out. Terminating. jan 30 10:09:45 marcos systemd[1]: puppetserver.service: Control process exited, code=killed, status=15/TERM jan 30 10:09:45 marcos systemd[1]: puppetserver.service: Failed with result 'exit-code'. jan 30 10:09:45 marcos systemd[1]: Failed to start Puppet Server. jan 30 10:09:45 marcos systemd[1]: puppetserver.service: Consumed 16.434s CPU time, no IP traffic.
... and then it starts again, and again, and again...
jan 30 10:09:46 marcos systemd[1]: puppetserver.service: Scheduled restart job, restart counter is at 326. jan 30 10:09:46 marcos systemd[1]: Stopped Puppet Server. jan 30 10:09:46 marcos systemd[1]: puppetserver.service: Consumed 16.434s CPU time, no IP traffic. jan 30 10:09:46 marcos systemd[1]: Starting Puppet Server...
Thankfully there's a 5 minute timeout there, but if that messy startup script gets fixed, that will be much worse.
Cut through all that crap and don't try to restart on bad exit values, only on real crashes.
For details on this policy, look at systemd.service(7), grep for 'on-abnormal', there's a nice table there explaining what it does.