Project

General

Profile

Actions

Bug #11221

closed

[systemd] Always restart exited services, even after 3 startup failures

Added by Tom Clegg almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Story points:
0.5

Description

Currently, when postgresql is unavailable for ~300ms when arvados-ws is trying to start (which is normal for a botting machine), the default systemd settings try 3x with 100ms between attempts, then gives up:

...
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Starting Arvados websocket server...
Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"level":"info","msg":"started","time":"2017-02-24T21:12:10.388627496Z"}
Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"Listen":":8100","level":"info","msg":"listening","time":"2017-02-24T21:12:10.388767216Z"}
Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"error":"dial tcp [::1]:5432: getsockopt: connection refused","level":"fatal","msg":"db.Ping failed","time":"2017-02-24T21:12:10.389441626Z"}
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Started Arvados websocket server.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service: main process exited, code=exited, status=1/FAILURE
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Unit arvados-ws.service entered failed state.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service holdoff time over, scheduling restart.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Stopping Arvados websocket server...
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Starting Arvados websocket server...
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service start request repeated too quickly, refusing to start.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Failed to start Arvados websocket server.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Unit arvados-ws.service entered failed state.

After systemd gives up, arvados-ws is down for hours/days until someone intervenes manually.

It would make more sense for all enabled services to keep trying until startup succeeds, like runit.

Except there might be some installations where the transition from runit to systemd is incomplete, both are trying to run the same service, and sysops are relying on the "give up after 3 attempts" behavior.


Subtasks 1 (0 open1 closed)

Task #11311: Review 11221-always-restart-servicesResolvedNico César03/23/2017Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Feature #10766: [Docs] [arvados-ws] make the arvados-ws documentation official, remove all mentions of the old puma websockets setupResolvedTom Clegg03/23/2017Actions
Actions

Also available in: Atom PDF