Bug #12538
closedcrunch-run failing to terminate after complete
Description
/tmp/slurm-43045.out
su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:25:19.562765000Z Executing container 'su92l-dz642-nwm0iyvzj9s41sb' su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:25:19.562833200Z Executing on host 'compute2.su92l.arvadosapi.com' su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:25:19.585576000Z Fetching Docker image from collection '816fafa3f6c232ba55dde9f73b2e8e0a+1435' su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:25:19.618615000Z Using Docker image id 'sha256:bf3a9bd0230a9f581d5a5c8f7816a529e9b84da644766b79bd926b7594fd52e4' su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:25:19.619444200Z Loading Docker image from keep su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:28:38.601230800Z Docker response: {"stream":"Loaded image ID: sha256:bf3a9bd0230a9f581d5a5c8f7816a529e9b84da644766b79bd926b7594fd52e4\n"} su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:28:53.393393600Z Creating Docker container su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:28:53.449673000Z Attaching container streams su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:28:53.727443500Z Starting Docker container id 'e20e8603bc7a916cb0169ba8f8440842bd01fe8c46aeed73678f314d39b01054' su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T12:28:54.182723200Z Waiting for container to finish su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:05:36.140991200Z Container exited with code: 0 su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:05:43.607339600Z Uploading indexed/filtered_ADNI_021_S_4245.raw_variants.chr1.gvcf.gz (199723384 bytes) su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:05:45.021751100Z Uploading indexed/filtered_ADNI_021_S_4245.raw_variants.chr1.gvcf.gz.tbi (93262 bytes) su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:05:45.022187400Z Uploading indexed/filtered_ADNI_021_S_4245.raw_variants.chr10.gvcf.gz (110956864 bytes) su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:05:46.166695200Z Uploading indexed/filtered_ADNI_021_S_4245.raw_variants.chr10.gvcf.gz.tbi (53918 bytes) ... su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:56.511282300Z Uploading stage/ADNI_021_S_4245/035d.fj.gz (305872 bytes) su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:56.512564500Z Uploading stage/ADNI_021_S_4245/035d.fj.gz.gzi (15320 bytes) su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:56.512861700Z Uploading stage/ADNI_021_S_4245/035d.pa (61200691 bytes) su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:57.622218300Z Uploading stage/ADNI_021_S_4245/035e.fj.gz (8877 bytes) su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:57.623806300Z Uploading stage/ADNI_021_S_4245/035e.fj.gz.gzi (8 bytes) su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:07:57.624148100Z Uploading stage/ADNI_021_S_4245/035e.pa (33168 bytes) su92l-dz642-nwm0iyvzj9s41sb 2017-11-01T14:08:04.146936600Z Complete 2017/11/01 14:08:17 StopContainer failed: Error response from daemon: No such container: e20e8603bc7a916cb0169ba8f8440842bd01fe8c46aeed73678f314d39b01054
arv-mount is still running:
ps ax |grep arv-mount 4225 ? Sl 21:52 /usr/bin/python2.7 /usr/bin/arv-mount --foreground --allow-other --read-write --file-cache 4294967296 --mount-by-pdh by_id /tmp/keep694723160
keep mount is working, I can browse it.
There are no Docker containers running.
Files
Updated by Peter Amstutz about 7 years ago
- File slurm-43045.out slurm-43045.out added
Updated by Peter Amstutz about 7 years ago
Tentatively it looks looks like we are expecting arv-mount to terminate, but it never does:
goroutine 1 [chan receive, 217 minutes]: main.(*ContainerRunner).CleanupDirs(0xc4204c6000) /tmp/tmp.VtuKmcBNQm/src/git.curoverse.com/arvados.git/services/crunch-run/crunchrun.go:1148 +0x328 main.(*ContainerRunner).Run(0xc4204c6000, 0x0, 0x0) /tmp/tmp.VtuKmcBNQm/src/git.curoverse.com/arvados.git/services/crunch-run/crunchrun.go:1388 +0x637 main.main() /tmp/tmp.VtuKmcBNQm/src/git.curoverse.com/arvados.git/services/crunch-run/crunchrun.go:1488 +0x841
goroutine 231 [syscall, 317 minutes]: syscall.Syscall6(0xf7, 0x1, 0x1081, 0xc42028dda0, 0x1000004, 0x0, 0x0, 0xc42028df9c, 0xc42028df38, 0xc42001b300) /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5 os.(*Process).blockUntilWaitable(0xc4201a2c30, 0xa, 0xc42028df19, 0x4) /usr/local/go/src/os/wait_waitid.go:28 +0xa5 os.(*Process).wait(0xc4201a2c30, 0x480e72, 0x59f9bd43, 0x3574f024) /usr/local/go/src/os/exec_unix.go:22 +0x4d os.(*Process).Wait(0xc4201a2c30, 0xc42045e070, 0xc420155b47, 0x5c) /usr/local/go/src/os/exec.go:115 +0x2b os/exec.(*Cmd).Wait(0xc42049a6e0, 0xc42045e070, 0x0) /usr/local/go/src/os/exec/exec.go:435 +0x62 main.(*ContainerRunner).ArvMountCmd.func2(0xc4204c6000, 0xc420080090) /tmp/tmp.VtuKmcBNQm/src/git.curoverse.com/arvados.git/services/crunch-run/crunchrun.go:320 +0x2e created by main.(*ContainerRunner).ArvMountCmd /tmp/tmp.VtuKmcBNQm/src/git.curoverse.com/arvados.git/services/crunch-run/crunchrun.go:322 +0x476
Updated by Peter Amstutz about 7 years ago
So crunch-dispatch-slurm knows to scancel containers which are "Complete" but still in squeue, which has the effect of compensating for the fact that crunch-run deadlocks.
Updated by Peter Amstutz about 7 years ago
tom:
seems like crunch-run needs to be more aggressive about killing arv-mount (perhaps abandoning it after some timeout?)
Seems like it. Need something like:
- Try "fusermount -u -z" several times, with a timeout
- If arv-mount doesn't exit on its own, kill arv-mount and then run "arv-mount --unmount"
Updated by Peter Amstutz about 7 years ago
- Status changed from New to In Progress
- Assigned To set to Peter Amstutz
Updated by Peter Amstutz about 7 years ago
Updated by Tom Clegg about 7 years ago
Looking at 12538-crunchrun-kill-arvmount @ e16989ed7897833f14766341838abd4e382a32a1
Why call "fusermount -u path" first? If that is the first step in the best "unmount" recipe, shouldn't "arv-mount --unmount" be doing it instead of crunch-run? But we spent time on this in #11209. The fuse docs say /sys/fs/.../abort is the way that "always works". Do we need to revisit?
With that, tryUnmount can look more like just tryUnmount(path). Might also be neater if it logs directly to runner.CrunchLog, instead of returning an error for CleanupDirs() to log. The word "try" suggests it's not really an error when it doesn't succeed.
If umount.Start() fails, tryUnmount should return right away rather than wait for arv-mount to exit by coincidence.
Instead of "if tmpdir == runner.ArvMountPoint { continue }" could we just avoid adding ArvMountPoint to CleanupTempDir in the first place?
The deferred "crunch-run finished" message looks like it's too late: by then, we've already closed the re-opened runner.CrunchLog. Maybe this should be "defer runner.CrunchLog.Printf(...)" in UpdateContainerFinal() instead?
Nit: "<-time.After(9*time.Second)" would be a bit less verbose than NewTimer(). The 8 and 9 second figures should probably be consts like a, b=a+1.
Might be nice to log the unmount command being run, just before running it.
Updated by Peter Amstutz about 7 years ago
Tom Clegg wrote:
Looking at 12538-crunchrun-kill-arvmount @ e16989ed7897833f14766341838abd4e382a32a1
Why call "fusermount -u path" first? If that is the first step in the best "unmount" recipe, shouldn't "arv-mount --unmount" be doing it instead of crunch-run? But we spent time on this in #11209. The fuse docs say /sys/fs/.../abort is the way that "always works". Do we need to revisit?
I prefer to try and unmount it "nicely" before going and hitting it with a hammer. As I understand it, there's several possible unmount behaviors:
- "fusermount -u" → only umount if not in use, fails if the mount is busy
- "fusermount -u -z" → lazy unmount, doesn't necessarily fail if busy, detaches from parent FS tree but open FDs continue to be readable (can cause strange behavior because paths into the mount are no longer resolvable), daemon is closed when there are no more users
- abort → hard unmount, daemon and clients closed immediately, anything accessing the mount gets an error
I'm for compute jobs, we know we are done using the mount, so any of the above should work. However for user mounts, something could still be happening in the background, so it seems better to ask the mount to shut down nicely. So if "arv-mount --unmount" is supposed to be an all-purpose tool, it should start by being nice before being mean. That said, we can simplify things by making "crunch-run" use "arv-mount --unmount" only.
Updated by Tom Clegg about 7 years ago
Offering two variants of unmount (nice/mean, soft/hard, normal/force?) sounds like a good feature. Added #12566. Currently arv-mount only offers the mean kind, which is more appropriate for crunch-run's use case, so I think we agree crunch-run doesn't need to add any custom tricks of its own here.
Updated by Peter Amstutz about 7 years ago
Tom Clegg wrote:
Looking at 12538-crunchrun-kill-arvmount @ e16989ed7897833f14766341838abd4e382a32a1
With that, tryUnmount can look more like just tryUnmount(path). Might also be neater if it logs directly to runner.CrunchLog, instead of returning an error for CleanupDirs() to log. The word "try" suggests it's not really an error when it doesn't succeed.
Moved the code back into CleanupDirs(), as it now only makes a single call to "arv-mount --unmount".
If umount.Start() fails, tryUnmount should return right away rather than wait for arv-mount to exit by coincidence.
Done. Also now sends kill signal to the umount process if it takes too long (but still doesn't block on it.)
Instead of "if tmpdir == runner.ArvMountPoint { continue }" could we just avoid adding ArvMountPoint to CleanupTempDir in the first place?
Done.
The deferred "crunch-run finished" message looks like it's too late: by then, we've already closed the re-opened runner.CrunchLog. Maybe this should be "defer runner.CrunchLog.Printf(...)" in UpdateContainerFinal() instead?
I refactored this, there doesn't appear to be any reason to have multiple defer block in that method, so now it is all done in the one exit handler.
Nit: "<-time.After(9*time.Second)" would be a bit less verbose than NewTimer(). The 8 and 9 second figures should probably be consts like a, b=a+1.
Done.
Might be nice to log the unmount command being run, just before running it.
It logs the unmount command on error.
12538-crunchrun-kill-arvmount @ f830a9450430422f6aaf6ec36bb760db530f4376
Updated by Tom Clegg about 7 years ago
Peter Amstutz wrote:
Tom Clegg wrote:
The deferred "crunch-run finished" message looks like it's too late: by then, we've already closed the re-opened runner.CrunchLog. Maybe this should be "defer runner.CrunchLog.Printf(...)" in UpdateContainerFinal() instead?
I refactored this, there doesn't appear to be any reason to have multiple defer block in that method, so now it is all done in the one exit handler.
Hm. I'm not sure refactoring the shutdown code was a good idea. I'm not sure I'm seeing all of the consequences but it looks like we're now calling CommitLogs() even if final state is Queued. Won't that uselessly create a collection and then fail in UpdateContainerFinal() because you can't set a non-nil log when changing state to Queued?
Having a separate defer runner.CleanupDirs() made some sense in that it does local cleanup that we need regardless of what happens in the big "finalize container" defer.
For the sake of stability, could we leave this part as it was for now?
Might be nice to log the unmount command being run, just before running it.
It logs the unmount command on error.
I noticed that, but I was thinking sometimes it's also nice to see how long an operation like this took, what command claimed to succeed when investigating a case where it didn't truly succeed, etc. Meh.
Updated by Peter Amstutz about 7 years ago
Tom Clegg wrote:
Hm. I'm not sure refactoring the shutdown code was a good idea. I'm not sure I'm seeing all of the consequences but it looks like we're now calling CommitLogs() even if final state is Queued. Won't that uselessly create a collection and then fail in UpdateContainerFinal() because you can't set a non-nil log when changing state to Queued?
Having a separate defer runner.CleanupDirs() made some sense in that it does local cleanup that we need regardless of what happens in the big "finalize container" defer.
For the sake of stability, could we leave this part as it was for now?
Alright. Partly reverted to old flow, but moved a little more cleanup behavior (stopping signals, flushing cleanup-related log messages) into the second "defer".
Note that this means cleanup failures won't appear in the permanent logs (that would require doing the cleanup before calling CommitLogs()).
I noticed that, but I was thinking sometimes it's also nice to see how long an operation like this took, what command claimed to succeed when investigating a case where it didn't truly succeed, etc. Meh.
Sure. Now it always logs the unmount.
Rebased on master.
Now sends kill signal to the "arv-mount" and "arv-mount --unmount" processes on timeout, but doesn't wait for them to terminate.
12538-crunchrun-kill-arvmount @ 47508624a359de86a402030e67737e5e81e78947
Updated by Tom Clegg about 7 years ago
It looks like you went to some trouble to sigkill arv-mount only if it hasn't exited 9 seconds after we call arv-mount --unmount
. Is there a benefit to waiting longer, after the unmount command exits?
Updated by Tom Clegg about 7 years ago
(still curious about the waiting thing but) LGTM, thanks
Updated by Anonymous about 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:4b23309ecedb0010d907a6930844c6928b3755b6.
Updated by Peter Amstutz about 7 years ago
Tom Clegg wrote:
It looks like you went to some trouble to sigkill arv-mount only if it hasn't exited 9 seconds after we call
arv-mount --unmount
. Is there a benefit to waiting longer, after the unmount command exits?
Well, the idea is give the happy path a chance (unmount exits normally, arv-mount exits normally) before we start throwing around SIGKILLs. Especially with a lazy unmount, there can be a window between when the FUSE connection has been severed and when arv-mount notices and shuts down on its own. If we hit it with a SIGKILL immediately, that seems like it might produce spurious "arv-mount exited 137" in the logs.