Project

General

Profile

Bug #11221

Updated by Tom Clegg about 7 years ago

Currently, when postgresql is unavailable for ~300ms when arvados-ws is trying to start (which is normal for a botting machine), the default systemd settings try 3x with 100ms between attempts, then gives up: 

 <pre> 
 ... 
 Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Starting Arvados websocket server... 
 Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"level":"info","msg":"started","time":"2017-02-24T21:12:10.388627496Z"} 
 Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"Listen":":8100","level":"info","msg":"listening","time":"2017-02-24T21:12:10.388767216Z"} 
 Feb 24 21:12:10 4xphq.arvadosapi.com arvados-ws[1641]: {"error":"dial tcp [::1]:5432: getsockopt: connection refused","level":"fatal","msg":"db.Ping failed","time":"2017-02-24T21:12:10.389441626Z"} 
 Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Started Arvados websocket server. 
 Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service: main process exited, code=exited, status=1/FAILURE 
 Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Unit arvados-ws.service entered failed state. 
 Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service holdoff time over, scheduling restart. 
 Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Stopping Arvados websocket server... 
 Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Starting Arvados websocket server... 
 Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: arvados-ws.service start request repeated too quickly, refusing to start. 
 Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Failed to start Arvados websocket server. 
 Feb 24 21:12:10 4xphq.arvadosapi.com systemd[1]: Unit arvados-ws.service entered failed state. 
 </pre> 

 After systemd gives up, arvados-ws is down for hours/days until someone intervenes manually. 

 It would make more sense for all enabled services to keep trying until startup succeeds, like runit. 

 *Except* there might be some installations where the transition from runit to systemd is incomplete, both are trying to run the same service, and sysops are relying on the "give up after 3 attempts" behavior.

Back