Bug #14693
open[arvbox] runsv fatal: unable to lock supervise/lock
Description
- runit: $Id: 25da3b86f7bed4038b8a039d2f8e8c9bbcf0822b $: booting.
- runit: warning: unable to open /dev/console: file does not exist
- runit: enter stage: /etc/runit/1
- runit: leave stage: /etc/runit/1
- runit: enter stage: /etc/runit/2
Arvados-in-a-box starting
runsv workbench: fatal: unable to lock supervise/lock: temporary failure
runsv crunch-dispatch1: fatal: unable to lock supervise/lock: temporary failure
runsv crunch-dispatch-local: fatal: unable to lock supervise/lock: temporary failure
runsv sdk: fatal: unable to lock supervise/lock: temporary failure
runsv sso: fatal: unable to lock supervise/lock: temporary failure
runsv keepstore0: fatal: unable to lock supervise/lock: temporary failure
runsv arv-git-httpd: fatal: unable to lock supervise/lock: temporary failure
runsv keepstore1: fatal: unable to lock supervise/lock: temporary failure
runsv keep-web: fatal: unable to lock supervise/lock: temporary failure
runsv keep-web: fatal: unable to lock supervise/lock: temporary failure
runsv sso: fatal: unable to lock supervise/lock: temporary failure
runsv crunch-dispatch-local: fatal: unable to lock supervise/lock: temporary failure
runsv crunch-dispatch1: fatal: unable to lock supervise/lock: temporary failure
runsv arv-git-httpd: fatal: unable to lock supervise/lock: temporary failure
runsv sdk: fatal: unable to lock supervise/lock: temporary failure
runsv keepstore0: fatal: unable to lock supervise/lock: temporary failure
runsv keepstore1: fatal: unable to lock supervise/lock: temporary failure
runsv workbench: fatal: unable to lock supervise/lock: temporary failure
runsv crunch-dispatch-local: fatal: unable to lock supervise/lock: temporary failure
Updated by Peter Amstutz almost 6 years ago
- Subject changed from Running Arvbox on VM to [arvbox] runsv fatal: unable to lock supervise/lock
This is a very weird error.
Looking at the process tree inside the container shows (a) defunct runsv processes and (b) daemon processes that should be under a runsvdir->runsv instance are owned by the pid 1 runit process instead. This suggests that runsv is crashing/exiting abnormally.
The locking error presumably happens because runsv is clever and shares the lockfile descriptor when it spawns the child daemon process, so as long as the child daemon continues to run, the runsv process won't be able to get the lock, so it won't run another instance of the service. But it reports the lock error.
Updated by Peter Amstutz almost 6 years ago
- Status changed from New to In Progress
Differences between my workstation and the systems showing the problem:
My system:
Debian 9
Docker 18.09.0
Kernel 4.9.0-8-amd64
VM:
Ubuntu 18.04
Docker 17.05.0-ce
Kernel 4.15.0-1036-azure
runsvdir uses the inode and device number to decide if a service directory matches one seen previously. I had a theory that overlayfs could be reporting a different inode or device number, which would cause runsvdir to start a new instance of the service, but I haven't been able to confirm that is what is happening.
For some reason, everything except ssh eventually settles down and stops getting restarted. This suggests that the runsv warning is correlated with service restarts, despite the fact that service restarts are supposed to be handled by runsv, so runsvdir should only spin up new instances of runsv if runsv itself terminated. However, the runsv processes all have low pids, suggesting they haven't restarted.