Project

General

Profile

Actions

Bug #11221

closed

[systemd] Always restart exited services, even after 3 startup failures

Added by Tom Clegg about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Story points:
0.5

Description

Currently, when postgresql is unavailable for ~300ms when arvados-ws is trying to start (which is normal for a botting machine), the default systemd settings try 3x with 100ms between attempts, then gives up:

...
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Starting Arvados websocket server...
Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"level":"info","msg":"started","time":"2017-02-24T21:12:10.388627496Z"}
Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"Listen":":8100","level":"info","msg":"listening","time":"2017-02-24T21:12:10.388767216Z"}
Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"error":"dial tcp [::1]:5432: getsockopt: connection refused","level":"fatal","msg":"db.Ping failed","time":"2017-02-24T21:12:10.389441626Z"}
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Started Arvados websocket server.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service: main process exited, code=exited, status=1/FAILURE
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Unit arvados-ws.service entered failed state.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service holdoff time over, scheduling restart.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Stopping Arvados websocket server...
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Starting Arvados websocket server...
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service start request repeated too quickly, refusing to start.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Failed to start Arvados websocket server.
Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Unit arvados-ws.service entered failed state.

After systemd gives up, arvados-ws is down for hours/days until someone intervenes manually.

It would make more sense for all enabled services to keep trying until startup succeeds, like runit.

Except there might be some installations where the transition from runit to systemd is incomplete, both are trying to run the same service, and sysops are relying on the "give up after 3 attempts" behavior.


Subtasks 1 (0 open1 closed)

Task #11311: Review 11221-always-restart-servicesResolvedNico César03/23/2017Actions

Related issues

Related to Arvados - Feature #10766: [Docs] [arvados-ws] make the arvados-ws documentation official, remove all mentions of the old puma websockets setupResolvedTom Clegg03/23/2017Actions
Actions #1

Updated by Tom Clegg about 7 years ago

  • Description updated (diff)
  • Category set to Deployment
  • Assigned To set to Tom Clegg

11221-always-restart-services @ 273a233818ae39e843fab0276f9e381da6645d28

Actions #2

Updated by Tom Clegg about 7 years ago

  • Target version set to 2017-03-29 sprint
Actions #3

Updated by Nico César about 7 years ago

Actions #4

Updated by Tom Clegg about 7 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:a8378b8deaa2bbf9d2c154d9d9bb072538c288cc.

Actions

Also available in: Atom PDF