Bug #4967
closed
[Crunch] Doesn't cope well with FUSE mounts left hanging around after killing tasks with SIGKILL
Added by Bryan Cosca almost 10 years ago.
Updated almost 10 years ago.
Description
Taken from qr1hi-8i9sb-ve5v94njtcw66yw. I re-ran the job and it seemed to work fine, I just wanted to bring it to attention.
1/12/2015 2:16:09 PM compute16 1 task-print 0 fuse: failed to open mountpoint for reading: Transport endpoint is not connected
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 srun: error: compute16: task 0: Exited with exit code 1
1/12/2015 2:16:09 PM compute16 1 task-print 0 Traceback (most recent call last):
1/12/2015 2:16:09 PM compute16 1 task-print 0 File "/usr/local/bin/arv-mount", line 149, in <module>
1/12/2015 2:16:09 PM compute16 1 task-print 0 llfuse.init(operations, args.mountpoint, opts)
1/12/2015 2:16:09 PM compute16 1 task-print 0 File "fuse_api.pxi", line 153, in llfuse.init (src/llfuse.c:17409)
1/12/2015 2:16:09 PM compute16 1 task-print 0 RuntimeError: fuse_mount failed
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 child 22088 on compute16.1 exit 1 success=
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 failure (#1, permanent) after 1 seconds
- Target version set to Bug Triage
- Subject changed from Fuse mount fails to [Crunch] Doesn't cope well with FUSE mounts left hanging around after killing tasks with SIGKILL
- Category set to Crunch
This was happening because compute16 had a bunch of stale mount points for FUSE. They were listed by mount
, but there was no corresponding arv-mount process.
qr1hi-8i9sb-u7i5v3t19dmegf5 started a batch of tasks, but a couple early ones failed. Because of that, Crunch started to kill the ones that did launch. However, they did not respond to nice signals, so eventually Crunch escalated to SIGKILL. That ended the processes, but it gave arv-mount no opportunity to clean up, so the mount points remained.
Crunch has its own code to unmount Keep before running jobs: search for fusermount
in the source. Apparently, this is not working as intended. It should be fixed.
Instead of if mount|grep -q $JOB_WORK/; then ....
, we probably need something like
mount -l -t fuse,fuse.keep | cut -d' ' -f3 | xargs --no-run-if-empty -n 1 fusermount -z -u
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version changed from Bug Triage to 2015-01-28 Sprint
- Story points set to 0.5
I can locally reproduce #4967 and #4970 by sending the right signal to arv-mount:
brinstar % arv-mount --foreground /tmp/keep &
[1] 22405
brinstar % kill -KILL 22405
[1] + killed arv-mount --foreground /tmp/keep
brinstar % mkdir -p /tmp/keep
mkdir: cannot create directory `/tmp/keep': File exists
brinstar % arv-mount --foreground /tmp/keep
fuse: failed to access mountpoint /tmp/keep: Transport endpoint is not connected
2015-01-21 10:59:26 arvados.arv-mount[22437] ERROR: arv-mount: exception during mount
Traceback (most recent call last):
File "/home/brett/.local/bin/arv-mount", line 186, in <module>
llfuse.init(operations, args.mountpoint, opts)
File "fuse_api.pxi", line 248, in llfuse.capi.init (src/llfuse/capi_linux.c:20443)
RuntimeError: fuse_mount failed
Unmounting fixes both issues:
brinstar % fusermount -u /tmp/keep
brinstar % mkdir -p /tmp/keep
brinstar % arv-mount --foreground /tmp/keep &
[1] 22478
brinstar % b /tmp/keep
total 2.5K
dr-xr-xr-x 1 brett brett 0 Jan 21 11:01 by_id
dr-xr-xr-x 1 brett brett 0 Jan 21 11:01 by_tag
dr-xr-xr-x 1 brett brett 0 Jan 21 11:01 home
-r--r--r-- 1 brett brett 509 Jan 21 11:01 README
dr-xr-xr-x 1 brett brett 0 Jan 21 11:01 shared
Therefore I'm marking #4970 as a duplicate and implementing a solution along the lines of Tom's suggestion.
reviewing 5754435 LGTM (tested the modified code on a compute node in this state, and it DTRT).
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:ef969ca8dabe571a9866a7b3b7c39098785022fa.
Also available in: Atom
PDF