Bug #4967

[Crunch] Doesn't cope well with FUSE mounts left hanging around after killing tasks with SIGKILL

Added by Bryan Cosca about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Crunch
Target version:
Start date:
01/21/2015
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
0.5

Description

Taken from qr1hi-8i9sb-ve5v94njtcw66yw. I re-ran the job and it seemed to work fine, I just wanted to bring it to attention.

1/12/2015 2:16:09 PM compute16 1 task-print 0 fuse: failed to open mountpoint for reading: Transport endpoint is not connected
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 srun: error: compute16: task 0: Exited with exit code 1
1/12/2015 2:16:09 PM compute16 1 task-print 0 Traceback (most recent call last):
1/12/2015 2:16:09 PM compute16 1 task-print 0 File "/usr/local/bin/arv-mount", line 149, in <module>
1/12/2015 2:16:09 PM compute16 1 task-print 0 llfuse.init(operations, args.mountpoint, opts)
1/12/2015 2:16:09 PM compute16 1 task-print 0 File "fuse_api.pxi", line 153, in llfuse.init (src/llfuse.c:17409)
1/12/2015 2:16:09 PM compute16 1 task-print 0 RuntimeError: fuse_mount failed
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 child 22088 on compute16.1 exit 1 success=
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 failure (#1, permanent) after 1 seconds


Subtasks

Task #5039: Review 4967-crunch-mount-cleanup-wipResolvedWard Vandewege


Related issues

Related to Arvados - Feature #5036: [arv-mount] Change default mount type from "fuse" to "fuse.arvados"New01/20/2015

Has duplicate Arvados - Bug #4970: [Crunch] Cannot create directory `/tmp/crunch-job/task/compute14.1.keep': File existsResolved01/12/2015

Has duplicate Arvados - Bug #5046: Jobs failing to start. Logs show "rm:cannot remove `/tmp/crunch-job/task/compute19.1.keep': Is a directory"Closed01/21/2015

Associated revisions

Revision ef969ca8
Added by Brett Smith about 5 years ago

Merge branch '4967-crunch-mount-cleanup-wip'

Closes #4967, #4970, #5039.

Revision 7c34347e (diff)
Added by Brett Smith about 5 years ago

4967: API server bundle uses bugfixed crunch-job.

Refs #4967, #4970.

Revision 26119240 (diff)
Added by Brett Smith about 5 years ago

4967: Fix API server Gemfile.

I'm kind of at a loss to explain what happened here. Refs #4967.

History

#1 Updated by Bryan Cosca about 5 years ago

more data points: qr1hi-8i9sb-14gstze8dil8c3e

#2 Updated by Tim Pierce about 5 years ago

  • Target version set to Bug Triage

#3 Updated by Bryan Cosca about 5 years ago

different day, still compute16 qr1hi-8i9sb-v78fsmqn3o24lua

#4 Updated by Brett Smith about 5 years ago

  • Subject changed from Fuse mount fails to [Crunch] Doesn't cope well with FUSE mounts left hanging around after killing tasks with SIGKILL
  • Category set to Crunch

This was happening because compute16 had a bunch of stale mount points for FUSE. They were listed by mount, but there was no corresponding arv-mount process.

qr1hi-8i9sb-u7i5v3t19dmegf5 started a batch of tasks, but a couple early ones failed. Because of that, Crunch started to kill the ones that did launch. However, they did not respond to nice signals, so eventually Crunch escalated to SIGKILL. That ended the processes, but it gave arv-mount no opportunity to clean up, so the mount points remained.

Crunch has its own code to unmount Keep before running jobs: search for fusermount in the source. Apparently, this is not working as intended. It should be fixed.

#5 Updated by Tim Pierce about 5 years ago

Recent example:

I started qr1hi-8i9sb-2dszutc1qfgz5lf, which began running on compute48 and then quickly cancelled it.

A few seconds later I started qr1hi-8i9sb-h42r804cfjorr4a (same job, new script_version), which also started on script48 and immediately failed due to this problem.

This describes the same or a very similar problem: http://stackoverflow.com/questions/27825678/mounted-filesystem-transport-endpoint-is-not-connected

Possibility that this is a bug in FUSE that is fixed in 2.9.2: https://bugs.launchpad.net/ubuntu/+source/fuse/+bug/1072270 (Are we using a very old FUSE driver?)

#6 Updated by Tom Clegg about 5 years ago

Instead of if mount|grep -q $JOB_WORK/; then ...., we probably need something like

mount -l -t fuse,fuse.keep | cut -d' ' -f3 | xargs --no-run-if-empty -n 1 fusermount -z -u

#7 Updated by Brett Smith about 5 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Bug Triage to 2015-01-28 Sprint
  • Story points set to 0.5

I can locally reproduce #4967 and #4970 by sending the right signal to arv-mount:

brinstar % arv-mount --foreground /tmp/keep &
[1] 22405
brinstar % kill -KILL 22405
[1]  + killed     arv-mount --foreground /tmp/keep
brinstar % mkdir -p /tmp/keep
mkdir: cannot create directory `/tmp/keep': File exists
brinstar % arv-mount --foreground /tmp/keep
fuse: failed to access mountpoint /tmp/keep: Transport endpoint is not connected
2015-01-21 10:59:26 arvados.arv-mount[22437] ERROR: arv-mount: exception during mount
Traceback (most recent call last):
  File "/home/brett/.local/bin/arv-mount", line 186, in <module>
    llfuse.init(operations, args.mountpoint, opts)
  File "fuse_api.pxi", line 248, in llfuse.capi.init (src/llfuse/capi_linux.c:20443)
RuntimeError: fuse_mount failed

Unmounting fixes both issues:

brinstar % fusermount -u /tmp/keep
brinstar % mkdir -p /tmp/keep
brinstar % arv-mount --foreground /tmp/keep &
[1] 22478
brinstar % b /tmp/keep
total 2.5K
dr-xr-xr-x 1 brett brett   0 Jan 21 11:01 by_id
dr-xr-xr-x 1 brett brett   0 Jan 21 11:01 by_tag
dr-xr-xr-x 1 brett brett   0 Jan 21 11:01 home
-r--r--r-- 1 brett brett 509 Jan 21 11:01 README
dr-xr-xr-x 1 brett brett   0 Jan 21 11:01 shared

Therefore I'm marking #4970 as a duplicate and implementing a solution along the lines of Tom's suggestion.

#8 Updated by Ward Vandewege about 5 years ago

reviewing 5754435 LGTM (tested the modified code on a compute node in this state, and it DTRT).

#9 Updated by Brett Smith about 5 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:ef969ca8dabe571a9866a7b3b7c39098785022fa.

Also available in: Atom PDF