Bug #11221

[systemd] Always restart exited services, even after 3 startup failures

Added by Tom Clegg 9 months ago. Updated 8 months ago.

Status:ResolvedStart date:03/23/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:Deployment
Target version:2017-03-29 sprint
Story points0.5Remaining (hours)0.00 hour
Velocity based estimate0 days

Description

Currently, when postgresql is unavailable for ~300ms when arvados-ws is trying to start (which is normal for a botting machine), the default systemd settings try 3x with 100ms between attempts, then gives up:

...
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Starting Arvados websocket server...
Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"level":"info","msg":"started","time":"2017-02-24T21:12:10.388627496Z"}
Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"Listen":":8100","level":"info","msg":"listening","time":"2017-02-24T21:12:10.388767216Z"}
Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"error":"dial tcp [::1]:5432: getsockopt: connection refused","level":"fatal","msg":"db.Ping failed","time":"2017-02-24T21:12:10.389441626Z"}
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Started Arvados websocket server.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service: main process exited, code=exited, status=1/FAILURE
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Unit arvados-ws.service entered failed state.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service holdoff time over, scheduling restart.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Stopping Arvados websocket server...
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Starting Arvados websocket server...
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service start request repeated too quickly, refusing to start.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Failed to start Arvados websocket server.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Unit arvados-ws.service entered failed state.

After systemd gives up, arvados-ws is down for hours/days until someone intervenes manually.

It would make more sense for all enabled services to keep trying until startup succeeds, like runit.

Except there might be some installations where the transition from runit to systemd is incomplete, both are trying to run the same service, and sysops are relying on the "give up after 3 attempts" behavior.


Subtasks

Task #11311: Review 11221-always-restart-servicesResolvedNico César


Related issues

Related to Arvados - Feature #10766: [Docs] [arvados-ws] make the arvados-ws documentation off... Resolved 03/23/2017

Associated revisions

Revision a8378b8d
Added by Tom Clegg 8 months ago

Merge branch '11221-always-restart-services'

closes #11221

History

#1 Updated by Tom Clegg 9 months ago

  • Description updated (diff)
  • Category set to Deployment
  • Assignee set to Tom Clegg

11221-always-restart-services @ 273a233818ae39e843fab0276f9e381da6645d28

#2 Updated by Tom Clegg 8 months ago

  • Target version set to 2017-03-29 sprint

#3 Updated by Nico César 8 months ago

#4 Updated by Tom Clegg 8 months ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:a8378b8deaa2bbf9d2c154d9d9bb072538c288cc.

Also available in: Atom PDF