Bug #11221
closed[systemd] Always restart exited services, even after 3 startup failures
Description
Currently, when postgresql is unavailable for ~300ms when arvados-ws is trying to start (which is normal for a botting machine), the default systemd settings try 3x with 100ms between attempts, then gives up:
... Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Starting Arvados websocket server... Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"level":"info","msg":"started","time":"2017-02-24T21:12:10.388627496Z"} Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"Listen":":8100","level":"info","msg":"listening","time":"2017-02-24T21:12:10.388767216Z"} Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"error":"dial tcp [::1]:5432: getsockopt: connection refused","level":"fatal","msg":"db.Ping failed","time":"2017-02-24T21:12:10.389441626Z"} Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Started Arvados websocket server. Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service: main process exited, code=exited, status=1/FAILURE Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Unit arvados-ws.service entered failed state. Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service holdoff time over, scheduling restart. Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Stopping Arvados websocket server... Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Starting Arvados websocket server... Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service start request repeated too quickly, refusing to start. Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Failed to start Arvados websocket server. Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Unit arvados-ws.service entered failed state.
After systemd gives up, arvados-ws is down for hours/days until someone intervenes manually.
It would make more sense for all enabled services to keep trying until startup succeeds, like runit.
Except there might be some installations where the transition from runit to systemd is incomplete, both are trying to run the same service, and sysops are relying on the "give up after 3 attempts" behavior.
Updated by Tom Clegg almost 8 years ago
- Description updated (diff)
- Category set to Deployment
- Assigned To set to Tom Clegg
11221-always-restart-services @ 273a233818ae39e843fab0276f9e381da6645d28
Updated by Nico César almost 8 years ago
review at 273a233818ae39e843fab0276f9e381da6645d28
Ready to merge
Updated by Tom Clegg almost 8 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:a8378b8deaa2bbf9d2c154d9d9bb072538c288cc.