Project

General

Profile

Actions

Bug #4967

closed

[Crunch] Doesn't cope well with FUSE mounts left hanging around after killing tasks with SIGKILL

Added by Bryan Cosca almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
0.5

Description

Taken from qr1hi-8i9sb-ve5v94njtcw66yw. I re-ran the job and it seemed to work fine, I just wanted to bring it to attention.

1/12/2015 2:16:09 PM compute16 1 task-print 0 fuse: failed to open mountpoint for reading: Transport endpoint is not connected
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 srun: error: compute16: task 0: Exited with exit code 1
1/12/2015 2:16:09 PM compute16 1 task-print 0 Traceback (most recent call last):
1/12/2015 2:16:09 PM compute16 1 task-print 0 File "/usr/local/bin/arv-mount", line 149, in <module>
1/12/2015 2:16:09 PM compute16 1 task-print 0 llfuse.init(operations, args.mountpoint, opts)
1/12/2015 2:16:09 PM compute16 1 task-print 0 File "fuse_api.pxi", line 153, in llfuse.init (src/llfuse.c:17409)
1/12/2015 2:16:09 PM compute16 1 task-print 0 RuntimeError: fuse_mount failed
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 child 22088 on compute16.1 exit 1 success=
1/12/2015 2:16:09 PM compute16 1 task-dispatch 0 failure (#1, permanent) after 1 seconds


Subtasks 1 (0 open1 closed)

Task #5039: Review 4967-crunch-mount-cleanup-wipResolvedWard Vandewege01/21/2015Actions

Related issues 3 (0 open3 closed)

Related to Arvados - Feature #5036: [arv-mount] Change default mount type from "fuse" to "fuse.arvados"Closed01/20/2015Actions
Has duplicate Arvados - Bug #4970: [Crunch] Cannot create directory `/tmp/crunch-job/task/compute14.1.keep': File existsResolved01/12/2015Actions
Has duplicate Arvados - Bug #5046: Jobs failing to start. Logs show "rm:cannot remove `/tmp/crunch-job/task/compute19.1.keep': Is a directory"Closed01/21/2015Actions
Actions #1

Updated by Bryan Cosca almost 10 years ago

more data points: qr1hi-8i9sb-14gstze8dil8c3e

Actions #2

Updated by Tim Pierce almost 10 years ago

  • Target version set to Bug Triage
Actions #3

Updated by Bryan Cosca almost 10 years ago

different day, still compute16 qr1hi-8i9sb-v78fsmqn3o24lua

Actions #4

Updated by Brett Smith almost 10 years ago

  • Subject changed from Fuse mount fails to [Crunch] Doesn't cope well with FUSE mounts left hanging around after killing tasks with SIGKILL
  • Category set to Crunch

This was happening because compute16 had a bunch of stale mount points for FUSE. They were listed by mount, but there was no corresponding arv-mount process.

qr1hi-8i9sb-u7i5v3t19dmegf5 started a batch of tasks, but a couple early ones failed. Because of that, Crunch started to kill the ones that did launch. However, they did not respond to nice signals, so eventually Crunch escalated to SIGKILL. That ended the processes, but it gave arv-mount no opportunity to clean up, so the mount points remained.

Crunch has its own code to unmount Keep before running jobs: search for fusermount in the source. Apparently, this is not working as intended. It should be fixed.

Actions #5

Updated by Tim Pierce almost 10 years ago

Recent example:

I started qr1hi-8i9sb-2dszutc1qfgz5lf, which began running on compute48 and then quickly cancelled it.

A few seconds later I started qr1hi-8i9sb-h42r804cfjorr4a (same job, new script_version), which also started on script48 and immediately failed due to this problem.

This describes the same or a very similar problem: http://stackoverflow.com/questions/27825678/mounted-filesystem-transport-endpoint-is-not-connected

Possibility that this is a bug in FUSE that is fixed in 2.9.2: https://bugs.launchpad.net/ubuntu/+source/fuse/+bug/1072270 (Are we using a very old FUSE driver?)

Actions #6

Updated by Tom Clegg almost 10 years ago

Instead of if mount|grep -q $JOB_WORK/; then ...., we probably need something like

mount -l -t fuse,fuse.keep | cut -d' ' -f3 | xargs --no-run-if-empty -n 1 fusermount -z -u
Actions #7

Updated by Brett Smith almost 10 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Bug Triage to 2015-01-28 Sprint
  • Story points set to 0.5

I can locally reproduce #4967 and #4970 by sending the right signal to arv-mount:

brinstar % arv-mount --foreground /tmp/keep &
[1] 22405
brinstar % kill -KILL 22405
[1]  + killed     arv-mount --foreground /tmp/keep
brinstar % mkdir -p /tmp/keep
mkdir: cannot create directory `/tmp/keep': File exists
brinstar % arv-mount --foreground /tmp/keep
fuse: failed to access mountpoint /tmp/keep: Transport endpoint is not connected
2015-01-21 10:59:26 arvados.arv-mount[22437] ERROR: arv-mount: exception during mount
Traceback (most recent call last):
  File "/home/brett/.local/bin/arv-mount", line 186, in <module>
    llfuse.init(operations, args.mountpoint, opts)
  File "fuse_api.pxi", line 248, in llfuse.capi.init (src/llfuse/capi_linux.c:20443)
RuntimeError: fuse_mount failed

Unmounting fixes both issues:

brinstar % fusermount -u /tmp/keep
brinstar % mkdir -p /tmp/keep
brinstar % arv-mount --foreground /tmp/keep &
[1] 22478
brinstar % b /tmp/keep
total 2.5K
dr-xr-xr-x 1 brett brett   0 Jan 21 11:01 by_id
dr-xr-xr-x 1 brett brett   0 Jan 21 11:01 by_tag
dr-xr-xr-x 1 brett brett   0 Jan 21 11:01 home
-r--r--r-- 1 brett brett 509 Jan 21 11:01 README
dr-xr-xr-x 1 brett brett   0 Jan 21 11:01 shared

Therefore I'm marking #4970 as a duplicate and implementing a solution along the lines of Tom's suggestion.

Actions #8

Updated by Ward Vandewege almost 10 years ago

reviewing 5754435 LGTM (tested the modified code on a compute node in this state, and it DTRT).

Actions #9

Updated by Brett Smith almost 10 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:ef969ca8dabe571a9866a7b3b7c39098785022fa.

Actions

Also available in: Atom PDF