Bug #12538

crunch-run failing to terminate after complete

Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
11/06/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

/tmp/slurm-43045.out

su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:25:19.562765000Z Executing container 'su92l-dz642-nwm0iyvzj9s41sb'
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:25:19.562833200Z Executing on host 'compute2.su92l.arvadosapi.com'
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:25:19.585576000Z Fetching Docker image from collection '816fafa3f6c232ba55dde9f73b2e8e0a+1435'
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:25:19.618615000Z Using Docker image id 'sha256:bf3a9bd0230a9f581d5a5c8f7816a529e9b84da644766b79bd926b7594fd52e4'
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:25:19.619444200Z Loading Docker image from keep
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:28:38.601230800Z Docker response: {"stream":"Loaded image ID: sha256:bf3a9bd0230a9f581d5a5c8f7816a529e9b84da644766b79bd926b7594fd52e4\n"}
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:28:53.393393600Z Creating Docker container
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:28:53.449673000Z Attaching container streams
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:28:53.727443500Z Starting Docker container id 'e20e8603bc7a916cb0169ba8f8440842bd01fe8c46aeed73678f314d39b01054'
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:28:54.182723200Z Waiting for container to finish
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:05:36.140991200Z Container exited with code: 0
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:05:43.607339600Z Uploading indexed/filtered_ADNI_021_S_4245.raw_variants.chr1.gvcf.gz (199723384 bytes)
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:05:45.021751100Z Uploading indexed/filtered_ADNI_021_S_4245.raw_variants.chr1.gvcf.gz.tbi (93262 bytes)
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:05:45.022187400Z Uploading indexed/filtered_ADNI_021_S_4245.raw_variants.chr10.gvcf.gz (110956864 bytes)
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:05:46.166695200Z Uploading indexed/filtered_ADNI_021_S_4245.raw_variants.chr10.gvcf.gz.tbi (53918 bytes)
...
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:56.511282300Z Uploading stage/ADNI_021_S_4245/035d.fj.gz (305872 bytes)
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:56.512564500Z Uploading stage/ADNI_021_S_4245/035d.fj.gz.gzi (15320 bytes)
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:56.512861700Z Uploading stage/ADNI_021_S_4245/035d.pa (61200691 bytes)
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:57.622218300Z Uploading stage/ADNI_021_S_4245/035e.fj.gz (8877 bytes)
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:57.623806300Z Uploading stage/ADNI_021_S_4245/035e.fj.gz.gzi (8 bytes)
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:57.624148100Z Uploading stage/ADNI_021_S_4245/035e.pa (33168 bytes)
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:08:04.146936600Z Complete
2017/11/01 14:08:17 StopContainer failed: Error response from daemon: No such container: e20e8603bc7a916cb0169ba8f8440842bd01fe8c46aeed73678f314d39b01054

arv-mount is still running:

 ps ax |grep arv-mount
 4225 ?        Sl    21:52 /usr/bin/python2.7 /usr/bin/arv-mount --foreground --allow-other --read-write --file-cache 4294967296 --mount-by-pdh by_id /tmp/keep694723160

keep mount is working, I can browse it.

There are no Docker containers running.

slurm-43045.out (301 KB) slurm-43045.out Peter Amstutz, 11/01/2017 05:48 PM

Subtasks

Task #12554: Review 12538-crunchrun-kill-arvmountResolvedTom Clegg


Related issues

Related to Arvados - Bug #11994: [arv-mount] Do not crash if /sys/fs/fuse/connections is emptyResolved2017-07-19

Related to Arvados - Bug #12306: [arv-mount] --unmount should work on an unresponsive mountResolved2017-09-22

Related to Arvados - Feature #12566: [arv-mount] --unmount should sync/wait and unmount gracefully unless otherwise specifiedNew2017-11-06

Associated revisions

Revision 4b23309e
Added by Peter Amstutz over 1 year ago

Merge branch '12538-crunchrun-kill-arvmount' closes #12538

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#2 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#3 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#5 Updated by Peter Amstutz over 1 year ago

Tentatively it looks looks like we are expecting arv-mount to terminate, but it never does:

goroutine 1 [chan receive, 217 minutes]:
main.(*ContainerRunner).CleanupDirs(0xc4204c6000)
    /tmp/tmp.VtuKmcBNQm/src/git.curoverse.com/arvados.git/services/crunch-run/crunchrun.go:1148 +0x328
main.(*ContainerRunner).Run(0xc4204c6000, 0x0, 0x0)
    /tmp/tmp.VtuKmcBNQm/src/git.curoverse.com/arvados.git/services/crunch-run/crunchrun.go:1388 +0x637
main.main()
    /tmp/tmp.VtuKmcBNQm/src/git.curoverse.com/arvados.git/services/crunch-run/crunchrun.go:1488 +0x841
goroutine 231 [syscall, 317 minutes]:
syscall.Syscall6(0xf7, 0x1, 0x1081, 0xc42028dda0, 0x1000004, 0x0, 0x0, 0xc42028df9c, 0xc42028df38, 0xc42001b300)
    /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
os.(*Process).blockUntilWaitable(0xc4201a2c30, 0xa, 0xc42028df19, 0x4)
    /usr/local/go/src/os/wait_waitid.go:28 +0xa5
os.(*Process).wait(0xc4201a2c30, 0x480e72, 0x59f9bd43, 0x3574f024)
    /usr/local/go/src/os/exec_unix.go:22 +0x4d
os.(*Process).Wait(0xc4201a2c30, 0xc42045e070, 0xc420155b47, 0x5c)
    /usr/local/go/src/os/exec.go:115 +0x2b
os/exec.(*Cmd).Wait(0xc42049a6e0, 0xc42045e070, 0x0)
    /usr/local/go/src/os/exec/exec.go:435 +0x62
main.(*ContainerRunner).ArvMountCmd.func2(0xc4204c6000, 0xc420080090)
    /tmp/tmp.VtuKmcBNQm/src/git.curoverse.com/arvados.git/services/crunch-run/crunchrun.go:320 +0x2e
created by main.(*ContainerRunner).ArvMountCmd
    /tmp/tmp.VtuKmcBNQm/src/git.curoverse.com/arvados.git/services/crunch-run/crunchrun.go:322 +0x476

#6 Updated by Peter Amstutz over 1 year ago

So crunch-dispatch-slurm knows to scancel containers which are "Complete" but still in squeue, which has the effect of compensating for the fact that crunch-run deadlocks.

#7 Updated by Peter Amstutz over 1 year ago

tom:

seems like crunch-run needs to be more aggressive about killing arv-mount (perhaps abandoning it after some timeout?)

Seems like it. Need something like:

  1. Try "fusermount -u -z" several times, with a timeout
  2. If arv-mount doesn't exit on its own, kill arv-mount and then run "arv-mount --unmount"

#8 Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress
  • Assigned To set to Peter Amstutz

#9 Updated by Peter Amstutz over 1 year ago

This is somewhat dependent on #11994 and #12306 because it now assumes that arv-mount --unmount will clean up whatever mess is left over.

#10 Updated by Tom Clegg over 1 year ago

Looking at 12538-crunchrun-kill-arvmount @ e16989ed7897833f14766341838abd4e382a32a1

Why call "fusermount -u path" first? If that is the first step in the best "unmount" recipe, shouldn't "arv-mount --unmount" be doing it instead of crunch-run? But we spent time on this in #11209. The fuse docs say /sys/fs/.../abort is the way that "always works". Do we need to revisit?

With that, tryUnmount can look more like just tryUnmount(path). Might also be neater if it logs directly to runner.CrunchLog, instead of returning an error for CleanupDirs() to log. The word "try" suggests it's not really an error when it doesn't succeed.

If umount.Start() fails, tryUnmount should return right away rather than wait for arv-mount to exit by coincidence.

Instead of "if tmpdir == runner.ArvMountPoint { continue }" could we just avoid adding ArvMountPoint to CleanupTempDir in the first place?

The deferred "crunch-run finished" message looks like it's too late: by then, we've already closed the re-opened runner.CrunchLog. Maybe this should be "defer runner.CrunchLog.Printf(...)" in UpdateContainerFinal() instead?

Nit: "<-time.After(9*time.Second)" would be a bit less verbose than NewTimer(). The 8 and 9 second figures should probably be consts like a, b=a+1.

Might be nice to log the unmount command being run, just before running it.

#11 Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote:

Looking at 12538-crunchrun-kill-arvmount @ e16989ed7897833f14766341838abd4e382a32a1

Why call "fusermount -u path" first? If that is the first step in the best "unmount" recipe, shouldn't "arv-mount --unmount" be doing it instead of crunch-run? But we spent time on this in #11209. The fuse docs say /sys/fs/.../abort is the way that "always works". Do we need to revisit?

I prefer to try and unmount it "nicely" before going and hitting it with a hammer. As I understand it, there's several possible unmount behaviors:

  1. "fusermount -u" → only umount if not in use, fails if the mount is busy
  2. "fusermount -u -z" → lazy unmount, doesn't necessarily fail if busy, detaches from parent FS tree but open FDs continue to be readable (can cause strange behavior because paths into the mount are no longer resolvable), daemon is closed when there are no more users
  3. abort → hard unmount, daemon and clients closed immediately, anything accessing the mount gets an error

I'm for compute jobs, we know we are done using the mount, so any of the above should work. However for user mounts, something could still be happening in the background, so it seems better to ask the mount to shut down nicely. So if "arv-mount --unmount" is supposed to be an all-purpose tool, it should start by being nice before being mean. That said, we can simplify things by making "crunch-run" use "arv-mount --unmount" only.

#12 Updated by Tom Clegg over 1 year ago

Offering two variants of unmount (nice/mean, soft/hard, normal/force?) sounds like a good feature. Added #12566. Currently arv-mount only offers the mean kind, which is more appropriate for crunch-run's use case, so I think we agree crunch-run doesn't need to add any custom tricks of its own here.

#13 Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote:

Looking at 12538-crunchrun-kill-arvmount @ e16989ed7897833f14766341838abd4e382a32a1

With that, tryUnmount can look more like just tryUnmount(path). Might also be neater if it logs directly to runner.CrunchLog, instead of returning an error for CleanupDirs() to log. The word "try" suggests it's not really an error when it doesn't succeed.

Moved the code back into CleanupDirs(), as it now only makes a single call to "arv-mount --unmount".

If umount.Start() fails, tryUnmount should return right away rather than wait for arv-mount to exit by coincidence.

Done. Also now sends kill signal to the umount process if it takes too long (but still doesn't block on it.)

Instead of "if tmpdir == runner.ArvMountPoint { continue }" could we just avoid adding ArvMountPoint to CleanupTempDir in the first place?

Done.

The deferred "crunch-run finished" message looks like it's too late: by then, we've already closed the re-opened runner.CrunchLog. Maybe this should be "defer runner.CrunchLog.Printf(...)" in UpdateContainerFinal() instead?

I refactored this, there doesn't appear to be any reason to have multiple defer block in that method, so now it is all done in the one exit handler.

Nit: "<-time.After(9*time.Second)" would be a bit less verbose than NewTimer(). The 8 and 9 second figures should probably be consts like a, b=a+1.

Done.

Might be nice to log the unmount command being run, just before running it.

It logs the unmount command on error.

12538-crunchrun-kill-arvmount @ f830a9450430422f6aaf6ec36bb760db530f4376

#14 Updated by Tom Clegg over 1 year ago

Peter Amstutz wrote:

Tom Clegg wrote:

The deferred "crunch-run finished" message looks like it's too late: by then, we've already closed the re-opened runner.CrunchLog. Maybe this should be "defer runner.CrunchLog.Printf(...)" in UpdateContainerFinal() instead?

I refactored this, there doesn't appear to be any reason to have multiple defer block in that method, so now it is all done in the one exit handler.

Hm. I'm not sure refactoring the shutdown code was a good idea. I'm not sure I'm seeing all of the consequences but it looks like we're now calling CommitLogs() even if final state is Queued. Won't that uselessly create a collection and then fail in UpdateContainerFinal() because you can't set a non-nil log when changing state to Queued?

Having a separate defer runner.CleanupDirs() made some sense in that it does local cleanup that we need regardless of what happens in the big "finalize container" defer.

For the sake of stability, could we leave this part as it was for now?

Might be nice to log the unmount command being run, just before running it.

It logs the unmount command on error.

I noticed that, but I was thinking sometimes it's also nice to see how long an operation like this took, what command claimed to succeed when investigating a case where it didn't truly succeed, etc. Meh.

#15 Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote:

Hm. I'm not sure refactoring the shutdown code was a good idea. I'm not sure I'm seeing all of the consequences but it looks like we're now calling CommitLogs() even if final state is Queued. Won't that uselessly create a collection and then fail in UpdateContainerFinal() because you can't set a non-nil log when changing state to Queued?

Having a separate defer runner.CleanupDirs() made some sense in that it does local cleanup that we need regardless of what happens in the big "finalize container" defer.

For the sake of stability, could we leave this part as it was for now?

Alright. Partly reverted to old flow, but moved a little more cleanup behavior (stopping signals, flushing cleanup-related log messages) into the second "defer".

Note that this means cleanup failures won't appear in the permanent logs (that would require doing the cleanup before calling CommitLogs()).

I noticed that, but I was thinking sometimes it's also nice to see how long an operation like this took, what command claimed to succeed when investigating a case where it didn't truly succeed, etc. Meh.

Sure. Now it always logs the unmount.

Rebased on master.

Now sends kill signal to the "arv-mount" and "arv-mount --unmount" processes on timeout, but doesn't wait for them to terminate.

12538-crunchrun-kill-arvmount @ 47508624a359de86a402030e67737e5e81e78947

#16 Updated by Tom Clegg over 1 year ago

It looks like you went to some trouble to sigkill arv-mount only if it hasn't exited 9 seconds after we call arv-mount --unmount. Is there a benefit to waiting longer, after the unmount command exits?

#17 Updated by Tom Clegg over 1 year ago

(still curious about the waiting thing but) LGTM, thanks

#18 Updated by Anonymous over 1 year ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:4b23309ecedb0010d907a6930844c6928b3755b6.

#19 Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote:

It looks like you went to some trouble to sigkill arv-mount only if it hasn't exited 9 seconds after we call arv-mount --unmount. Is there a benefit to waiting longer, after the unmount command exits?

Well, the idea is give the happy path a chance (unmount exits normally, arv-mount exits normally) before we start throwing around SIGKILLs. Especially with a lazy unmount, there can be a window between when the FUSE connection has been severed and when arv-mount notices and shuts down on its own. If we hit it with a SIGKILL immediately, that seems like it might produce spurious "arv-mount exited 137" in the logs.

Also available in: Atom PDF