Bug #4967
closed[Crunch] Doesn't cope well with FUSE mounts left hanging around after killing tasks with SIGKILL
Description
Taken from qr1hi-8i9sb-ve5v94njtcw66yw. I re-ran the job and it seemed to work fine, I just wanted to bring it to attention.
1/12/2015 2:16:09 PM compute16 1 task-print 0 fuse: failed to open mountpoint for reading: Transport endpoint is not connected
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 srun: error: compute16: task 0: Exited with exit code 1
1/12/2015 2:16:09 PM compute16 1 task-print 0 Traceback (most recent call last):
1/12/2015 2:16:09 PM compute16 1 task-print 0 File "/usr/local/bin/arv-mount", line 149, in <module>
1/12/2015 2:16:09 PM compute16 1 task-print 0 llfuse.init(operations, args.mountpoint, opts)
1/12/2015 2:16:09 PM compute16 1 task-print 0 File "fuse_api.pxi", line 153, in llfuse.init (src/llfuse.c:17409)
1/12/2015 2:16:09 PM compute16 1 task-print 0 RuntimeError: fuse_mount failed
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 child 22088 on compute16.1 exit 1 success=
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 failure (#1, permanent) after 1 seconds
Related issues
Updated by Bryan Cosca almost 10 years ago
more data points: qr1hi-8i9sb-14gstze8dil8c3e
Updated by Bryan Cosca almost 10 years ago
different day, still compute16 qr1hi-8i9sb-v78fsmqn3o24lua
Updated by Brett Smith almost 10 years ago
- Subject changed from Fuse mount fails to [Crunch] Doesn't cope well with FUSE mounts left hanging around after killing tasks with SIGKILL
- Category set to Crunch
This was happening because compute16 had a bunch of stale mount points for FUSE. They were listed by mount
, but there was no corresponding arv-mount process.
qr1hi-8i9sb-u7i5v3t19dmegf5 started a batch of tasks, but a couple early ones failed. Because of that, Crunch started to kill the ones that did launch. However, they did not respond to nice signals, so eventually Crunch escalated to SIGKILL. That ended the processes, but it gave arv-mount no opportunity to clean up, so the mount points remained.
Crunch has its own code to unmount Keep before running jobs: search for fusermount
in the source. Apparently, this is not working as intended. It should be fixed.
Updated by Tim Pierce almost 10 years ago
Recent example:
I started qr1hi-8i9sb-2dszutc1qfgz5lf, which began running on compute48 and then quickly cancelled it.
A few seconds later I started qr1hi-8i9sb-h42r804cfjorr4a (same job, new script_version), which also started on script48 and immediately failed due to this problem.
This describes the same or a very similar problem: http://stackoverflow.com/questions/27825678/mounted-filesystem-transport-endpoint-is-not-connected
Possibility that this is a bug in FUSE that is fixed in 2.9.2: https://bugs.launchpad.net/ubuntu/+source/fuse/+bug/1072270 (Are we using a very old FUSE driver?)
Updated by Tom Clegg almost 10 years ago
Instead of if mount|grep -q $JOB_WORK/; then ....
, we probably need something like
mount -l -t fuse,fuse.keep | cut -d' ' -f3 | xargs --no-run-if-empty -n 1 fusermount -z -u
Updated by Brett Smith almost 10 years ago
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version changed from Bug Triage to 2015-01-28 Sprint
- Story points set to 0.5
I can locally reproduce #4967 and #4970 by sending the right signal to arv-mount:
brinstar % arv-mount --foreground /tmp/keep & [1] 22405 brinstar % kill -KILL 22405 [1] + killed arv-mount --foreground /tmp/keep brinstar % mkdir -p /tmp/keep mkdir: cannot create directory `/tmp/keep': File exists brinstar % arv-mount --foreground /tmp/keep fuse: failed to access mountpoint /tmp/keep: Transport endpoint is not connected 2015-01-21 10:59:26 arvados.arv-mount[22437] ERROR: arv-mount: exception during mount Traceback (most recent call last): File "/home/brett/.local/bin/arv-mount", line 186, in <module> llfuse.init(operations, args.mountpoint, opts) File "fuse_api.pxi", line 248, in llfuse.capi.init (src/llfuse/capi_linux.c:20443) RuntimeError: fuse_mount failed
Unmounting fixes both issues:
brinstar % fusermount -u /tmp/keep brinstar % mkdir -p /tmp/keep brinstar % arv-mount --foreground /tmp/keep & [1] 22478 brinstar % b /tmp/keep total 2.5K dr-xr-xr-x 1 brett brett 0 Jan 21 11:01 by_id dr-xr-xr-x 1 brett brett 0 Jan 21 11:01 by_tag dr-xr-xr-x 1 brett brett 0 Jan 21 11:01 home -r--r--r-- 1 brett brett 509 Jan 21 11:01 README dr-xr-xr-x 1 brett brett 0 Jan 21 11:01 shared
Therefore I'm marking #4970 as a duplicate and implementing a solution along the lines of Tom's suggestion.
Updated by Ward Vandewege almost 10 years ago
reviewing 5754435 LGTM (tested the modified code on a compute node in this state, and it DTRT).
Updated by Brett Smith almost 10 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:ef969ca8dabe571a9866a7b3b7c39098785022fa.