Bug #8288
closedarv-mount / crunchstat in a crunch job fails to exit because reasons
Description
srun / arv-mount / crunchstat doesn't exit because apparently although the docker container has exited and `docker run` has returned, the docker daemon still has an open file on the keep mount that `arv-mount` created.
# ps auxwww | awk '$1=="crunch"' crunch 8848 0.0 0.0 608616 31096 ? Sl Jan22 0:01 /usr/bin/python2.7 /usr/local/bin/arv-mount --read-write --mount-by-pdh=by_pdh --mount-tmp=tmp --crunchstat-interval=10 --allow-other /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep --exec crunchstat -cgroup-root=/sys/fs/cgroup -cgroup-parent=docker -cgroup-cid=/data/crunch-tmp/crunch-job/z8ta6-ot0gb-v0p7tcarqimrnmj-0.cid -poll=10000 /usr/bin/docker run --name=z8ta6-ot0gb-v0p7tcarqimrnmj-0 --attach=stdout --attach=stderr --attach=stdin -i --cidfile=/data/crunch-tmp/crunch-job/z8ta6-ot0gb-v0p7tcarqimrnmj-0.cid --sig-proxy --memory=188032493k --memory-swap=190129641k --volume=/data/crunch-tmp/crunch-job/src:/data/crunch-tmp/crunch-job/src:ro --volume=/data/crunch-tmp/crunch-job/opt:/data/crunch-tmp/crunch-job/opt:ro --volume=/data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep/by_pdh:/keep:ro --volume=/data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep/tmp:/keep_tmp --volume=/tmp --env=TASK_SEQUENCE=0 --env=TASK_KEEPMOUNT=/keep --env=JOB_PARAMETER_INPUTS_COLLECTION=e8ccf6ac8348615a9873149f4ff07353+14726 --env=CRUNCH_SRC_COMMIT=15ee3e583ff9eeb273b52826c8944b43ae21e8bd --env=TASK_QSEQUENCE=0 --env=CRUNCH_INSTALL=/data/crunch-tmp/crunch-job/opt --env=CRUNCH_GIT_ARCHIVE_HASH=085e92f18ba93c6b8eb2507dbfbdf929 --env=CRUNCH_REFRESH_TRIGGER=/tmp/crunch_refresh_trigger --env=ARVADOS_API_TOKEN=2xb1o74nhxb7h2k2fculw4k6xahvss7v1ux8lhmd52conglznk --env=JOB_PARAMETER_SKIP_SQ_SN_REGEX=_decoy$ --env=CRUNCH_WORK=/data/crunch-tmp/crunch-job/work --env=CRUNCH_TMP=/data/crunch-tmp/crunch-job --env=TASK_TMPDIR=/tmp/crunch-job-task-work/humgen-03-02.1 --env=JOB_UUID=z8ta6-8i9sb-fsm22dece9hd043 --env=CRUNCH_JOB_UUID=z8ta6-8i9sb-fsm22dece9hd043 --env=TASK_SLOT_NUMBER=1 --env=CRUNCH_SRC_URL=/var/lib/arvados/internal.git --env=TASK_SLOT_NODE=humgen-03-02 --env=JOB_SCRIPT=bcftools-exp-gvcf-mpileup-cram-chunked.py --env=CRUNCH_NODE_SLOTS=1 --env=JOB_PARAMETER_REFERENCE_COLLECTION=a83bd4e5a26a64612322f21515d93bab+6190 --env=JOB_PARAMETER_GENOME_CHUNKS=400 --env=CRUNCH_JOB_DOCKER_BIN=/usr/bin/docker --env=TASK_WORK=/tmp/crunch-job-task-work/humgen-03-02.1 --env=TASK_KEEPMOUNT_TMP=/keep_tmp --env=ARVADOS_API_HOST=api.arvados.sanger.ac.uk --env=JOB_WORK=/tmp/crunch-job-work --env=TASK_UUID=z8ta6-ot0gb-v0p7tcarqimrnmj --env=CRUNCH_SRC=/data/crunch-tmp/crunch-job/src --env=HOME=/tmp/crunch-job-task-work/humgen-03-02.1 bfdf5bfbb2858302a6b82242018f6d39aade1707e25541710ff85b804c35056c /bin/sh -c python -c "from pkg_resources import get_distribution as get; print \"Using Arvados SDK version\", get(\"arvados-python-client\").version">&2 2>/dev/null; mkdir -p "/tmp/crunch-job-work" "/tmp/crunch-job-task-work/humgen-03-02.1" && if which stdbuf >/dev/null ; then exec stdbuf --output=0 --error=0 \/data\/crunch\-tmp\/crunch\-job\/src\/crunch_scripts\/bcftools\-exp\-gvcf\-mpileup\-cram\-chunked\.py ; else exec \/data\/crunch\-tmp\/crunch\-job\/src\/crunch_scripts\/bcftools\-exp\-gvcf\-mpileup\-cram\-chunked\.py ; fi # pstree -Aap 8848 arv-mount,8848 /usr/local/bin/arv-mount --read-write --mount-by-pdh=by_pdh --mount-tmp=tmp --crunchstat-interval=10 --allow-other /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep--e |-{arv-mount},8860 `-{arv-mount},8863 # fuser -v -m /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep USER PID ACCESS COMMAND /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep: root kernel mount /data crunch 8848 ..c.. arv-mount root 37715 F.... docker # ps auxwww | grep 37715 | grep -v grep root 37715 0.1 0.0 12555452 94712 ? Ssl 2015 166:27 /usr/bin/docker -d --graph=/data/docker # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES # docker version Client version: 1.7.1 Client API version: 1.19 Go version (client): go1.4.2 Git commit (client): 786b29d OS/Arch (client): linux/amd64 Server version: 1.7.1 Server API version: 1.19 Go version (server): go1.4.2 Git commit (server): 786b29d OS/Arch (server): linux/amd64 # lsof -p 37715 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME docker 37715 root cwd DIR 8,6 4096 2 / docker 37715 root rtd DIR 8,6 4096 2 / docker 37715 root txt REG 8,6 16296881 1977972 /usr/bin/docker docker 37715 root 0u CHR 1,3 0t0 1029 /dev/null docker 37715 root 1u CHR 136,2 0t0 5 /dev/pts/2 docker 37715 root 2u CHR 136,2 0t0 5 /dev/pts/2 docker 37715 root 3u unix 0xffff881881334280 0t0 48220470 /var/run/docker.sock docker 37715 root 4u unix 0xffff882ff7a0f700 0t0 1501908 /var/run/docker.sock docker 37715 root 5r CHR 1,9 0t0 1034 /dev/urandom docker 37715 root 6u 0000 0,9 0 6847 anon_inode docker 37715 root 9u unix 0xffff881ff0922300 0t0 48310485 /var/run/docker.sock docker 37715 root 11u unix 0xffff881ff0927a80 0t0 48326780 /var/run/docker.sock docker 37715 root 12u unix 0xffff88191467b800 0t0 48698892 /var/run/docker.sock docker 37715 root 14u unix 0xffff880106038000 0t0 48603069 /var/run/docker.sock docker 37715 root 16u unix 0xffff882005fa6c80 0t0 48334610 /var/run/docker.sock docker 37715 root 17u REG 252,0 1996800 366018572 /data/docker/linkgraph.db docker 37715 root 19u REG 252,0 1996800 366018572 /data/docker/linkgraph.db docker 37715 root 23u unix 0xffff881ff0922d80 0t0 48247780 /var/run/docker.sock docker 37715 root 24u unix 0xffff88191467e580 0t0 48663201 /var/run/docker.sock docker 37715 root 25u unix 0xffff88010603c980 0t0 48683402 /var/run/docker.sock docker 37715 root 26u unix 0xffff881ef218b100 0t0 48310488 /var/run/docker.sock docker 37715 root 27u unix 0xffff882005fa6580 0t0 48309653 /var/run/docker.sock # stat /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep File: `/data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep' Size: 4096 Blocks: 8 IO Block: 4096 directory Device: fc00h/64512d Inode: 262078526 Links: 2 Access: (0755/drwxr-xr-x) Uid: (15324/ crunch) Gid: ( 1593/ arvados) Access: 2016-01-22 21:49:38.729986521 +0000 Modify: 2016-01-22 21:49:38.441989497 +0000 Change: 2016-01-22 21:49:38.441989497 +0000 Birth: - # stat -f /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep File: "/data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep" ID: 6fef924393f3e2f8 Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 17485339344 Free: 16349728046 Available: 16174171418 Inodes: Total: 1097203712 Free: 1096592099 # uname -a Linux humgen-03-02 3.13.0-63-generic #104~precise1-Ubuntu SMP Tue Aug 18 17:03:03 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux # readlink /proc/37715/fd/6 anon_inode:[eventpoll] # fusermount -u /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep: Invalid argument # mount -t fuse # umount /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep umount: /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep: not mounted # rm -rf /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep # pstree -Aasp 8848 init,1 `-slurmstepd,8845 `-arv-mount,8848 /usr/local/bin/arv-mount --read-write --mount-by-pdh=by_pdh --mount-tmp=tmp --crunchstat-interval=10 --allow-other ... |-{arv-mount},8860 `-{arv-mount},8863 # strace -f -p 8848 Process 8848 attached with 3 threads - interrupt to quit [pid 8863] futex(0x7fcc728c7a84, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...> [pid 8860] select(0, NULL, NULL, NULL, {1, 754286} <unfinished ...> [pid 8848] futex(0x32c6e80, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 8860] <... select resumed> ) = 0 (Timeout) [pid 8860] write(2, "crunchstat: keepcalls 0 put 0 ge"..., 74) = 74 [pid 8860] write(2, "crunchstat: net:keep0 0 tx 0 rx "..., 70) = 70 [pid 8860] write(2, "crunchstat: keepcache 0 hit 0 mi"..., 76) = 76 [pid 8860] write(2, "crunchstat: fuseops 0 write 0 re"..., 78) = 78 [pid 8860] write(2, "crunchstat: blkio:0:0 0 write 0 "..., 80) = 80 [pid 8860] select(0, NULL, NULL, NULL, {10, 0}) = 0 (Timeout) [pid 8860] write(2, "crunchstat: keepcalls 0 put 0 ge"..., 74) = 74 [pid 8860] write(2, "crunchstat: net:keep0 0 tx 0 rx "..., 70) = 70 [pid 8860] write(2, "crunchstat: keepcache 0 hit 0 mi"..., 76) = 76 [pid 8860] write(2, "crunchstat: fuseops 0 write 0 re"..., 78) = 78 [pid 8860] write(2, "crunchstat: blkio:0:0 0 write 0 "..., 80) = 80 [pid 8860] select(0, NULL, NULL, NULL, {10, 0}^C <unfinished ...> Process 8848 detached Process 8860 detached Process 8863 detached # kill 8860
Updated by Joshua Randall almost 9 years ago
Further testing shows that docker is not the culprit here. I instrumented a version of arvados_fuse that monitors all operations calls (tracking the pid of the calling process) and no operations were performed by docker. It turns out that I was misusing the `fuser -m` option because at that point the fuse filesystem has already been unmounted, so the mount point is just a directory on the scratch filesystem (on which the docker graph database also resides, hence docker daemon having it open).
There is still an issue in that arv-mount fails to cleanly exit (entirely on its own, it would seem). I'm currently experimenting with adding some additonal code to _run_exec to make sure that it exits once the exec'd command completes.
Updated by Joshua Randall almost 9 years ago
- Subject changed from arv-mount / crunchstat in a crunch job fails to exit because docker daemon has file open for writing (according to fuser) to arv-mount / crunchstat in a crunch job fails to exit because reasons
Updated by Tom Clegg almost 9 years ago
arvados.events.PollClient
(when it can't connect to websocket) contributes to this problem:
- I ran into both of these problems on a dev cluster last night.
- IIRC, Josh mentioned the frequency of "arv-mount gets stuck" seemed to be related to the load on the whole cluster, not just the number of tasks running on a single compute node ... and that high cluster load seemed to be causing websocket failures.
As Josh pointed out in IRC, it looks like we're using operations.destroy()
wrong: the llfuse example suggests we should just call llfuse.close() after the filesystem is unmounted and llfuse.main()
returns, and let llfuse call our operations.destroy()
by itself. (This might explain why we've had to hack mutexes into our shutdown methods: perhaps we wouldn't have to accommodate multiple overlapping shutdown events if we stop going around llfuse's back and calling operations.destroy()
.)
We start llfuse.main()
in a non-daemon thread, and then call operations.destroy()
from a different thread. Calling llfuse.close()
(not destroy()
), and calling it from the same thread that called llfuse.main()
, seems like good form if nothing else. And I'm not sure there's any reason that thread should not be a daemon thread: if we're in --exec
mode and our main thread ends, that means our child process has exited and we have called fusermount -u
, so the expected behavior is for all subsequent IO operations to fail anyway.
Updated by Tom Clegg almost 9 years ago
- When someone (like arv-mount) closes its PollClient event listener, don't wait for the current set of "get logs" API calls to complete.
- In --exec mode, mark the llfuse thread as a daemon thread so Python doesn't stay alive waiting for it to finish after main has decided to exit.
- In --exec mode, if --unmount-timeout (default 2) seconds pass after "fusermount -u -z" and the llfuse thread is still running, log a warning, call
operations.destroy()
, and exit. (Otherwise, don't calloperations.destroy()
at all, just assume llfuse has done the necessary cleanup.)
Would be good to add an "exit within unmount_timeout+1
seconds even if various things are conspiring to keep threads alive" test case.
Updated by Brett Smith almost 9 years ago
- Target version set to 2016-02-17 Sprint
Updated by Brett Smith almost 9 years ago
- Category set to FUSE
- Assigned To set to Tom Clegg
Updated by Tom Clegg almost 9 years ago
Split branch into 8288-arv-mount-deadlock and 8288-poll-client-close-timeout
Updated by Peter Amstutz almost 9 years ago
Reviewing 8288-arv-mount-deadlock:
The strategy here seems to be (a) use lazy unmount and (b) make the llfuse main thread a daemon thread so it doesn't prevent the runtime from exiting.
Some notes:
- The llfuse main thread calls fuse_session_loop_mt() in libfuse which spins up more pthreads. I believe these will die quietly when the process as a whole exits.
- The llfuse main thread normally terminates in response to an exit message from the kernel. So if we don't get the exit message, the kernel doesn't think it is done with the mount point (but we knew that already).
- close() calls fuse_session_destroy() which calls op.destroy(), so I don't think we should ever be calling operations.destroy() explicitly.
- It seems very likely to be a race condition to call close() without first stopping the main loop, so we might be signing up for random segfaults.
- The correct way to terminate the main thread seems to be to call fuse_session_exit(), but llfuse doesn't expose that.
- At best, this seems very likely to result in stale mount points in a weird state.
Updated by Tom Clegg almost 9 years ago
Peter Amstutz wrote:
The strategy here seems to be (a) use lazy unmount and (b) make the llfuse main thread a daemon thread so it doesn't prevent the runtime from exiting.
Right. Also, stop going behind llfuse's back and calling operations.destroy() immediately after fusermount -u
and possibly getting there first, which might have been causing subtle problems of its own.
(I suspect the most common reason the llfuse thread was staying open was that our destroy() method called PollClient.close(), which could take a long time -- addressed by the 8288-poll-client-close-timeout branch -- but surely there are also other ways to keep the llfuse thread alive after the --exec
proc has exited, and we want to cut those off too.)
- The llfuse main thread calls fuse_session_loop_mt() in libfuse which spins up more pthreads. I believe these will die quietly when the process as a whole exits.
- The llfuse main thread normally terminates in response to an exit message from the kernel. So if we don't get the exit message, the kernel doesn't think it is done with the mount point (but we knew that already).
- close() calls fuse_session_destroy() which calls op.destroy(), so I don't think we should ever be calling operations.destroy() explicitly.
Right. I moved it to "only if the normal way doesn't work normally" because it's clear that's not the right way. But yes, terminating the thread by exiting main is probably better than unsafe-cleanup-then-exit. So, removed the call to operations.destroy() in da7eabf.
- It seems very likely to be a race condition to call close() without first stopping the main loop, so we might be signing up for random segfaults.
Yes, I assume this is why the old "close() from a different thread" code was commented out. The new code is straight from the llfuse example: close() is just what you're supposed to do after main() returns.
- The correct way to terminate the main thread seems to be to call fuse_session_exit(), but llfuse doesn't expose that.
I figure we have to make sure "fusermount -u" works properly (that's the correct way that's exposed to users) so I'm not inclined to look too hard for other ways.
- At best, this seems very likely to result in stale mount points in a weird state.
So far I have not seen a mount point still connected after "fusermount -u -z". But yes, getting rid of operations.destroy() should reduce the possibilities to "normal", "process exited, anything still open gets an IO error", "process won't die, llfuse is still doing its thing without interference"... no more "llfuse is still running after we poked its operations
in the eye" option.
If there is still some case where the kernel (or python) keeps our process alive even after we try to exit the main thread, at least we'll have a log message as a clue, and at least we won't have poked llfuse in the eye.
Updated by Peter Amstutz almost 9 years ago
fuse_session_exit()
would let us gracefully terminate the main libfuse threads instead of just abandoning them, but as noted, llfuse doesn't expose that. I'll see about filing a feature request.
8288-arv-mount-deadlock @ da7eabf looks good to me.
Updated by Joshua Randall almost 9 years ago
We could also tell the kernel to abort the connection (and thus terminate the fuse/llfuse process) by either `umount -f` on the mountpoint (which should work after `fusermount -u` but not after `fusermount -u -z`) or signalling the "abort" interface in the fusectl (/sys) filesystem.
Those are the two most effective deadlock-avoidance mechanisms suggested in the linux kernel documentation for fuse:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/fuse.txt#n204
I think the implication of these documentation are that there are situations arv-mount can get into in which this will be the only way to kill off the fuse filesystem.
Perhaps we could follow up on the current `fusermount -u` call with a .join() with timeout on the fuse main thread, and if it doesn't exit in the time allotted, hit it harder with a `umount -f` call.
Updated by Peter Amstutz almost 9 years ago
Joshua Randall wrote:
We could also tell the kernel to abort the connection (and thus terminate the fuse/llfuse process) by either `umount -f` on the mountpoint (which should work after `fusermount -u` but not after `fusermount -u -z`) or signalling the "abort" interface in the fusectl (/sys) filesystem.
umount -f
definitely works, but requires sudo.
I didn't know about "abort", though. It seems that the permissions are associated with the user that is running the fuse FS, so no sudo needed. However the mounts are identified by a numeric id, I'm not sure where that id comes from.
Updated by Peter Amstutz almost 9 years ago
Reviewing 8288-poll-client-close-timeout
In EventClient:
Calling super(EventClient, self).close()
starts a graceful shutdown (WebSocketClient tells the server that we're signing off, waits for an acknowledgement, the acknowledgement is what signals closed()
), but if we don't wait for it, won't happen. Suggest the default timeout should be greater than zero, maybe 1s.
In PollClient:
Similar suggestion for default timeout to be greater than zero. Once _closing.set() is called, the thread should exit almost immediately, except with a timeout of zero the thread might not even get a chance to run to completion.
Updated by Tom Clegg almost 9 years ago
I started out having non-zero default timeouts, but I couldn't think of any advantage.
For example, with EventClient, why do we care whether the server acknowledged our "close"? Either way, the connection will close when the server closes it, it times out, or our process exits.
If there are bugs in ws4py that make it crash during exit hooks if the server hasn't acknowledged shutdown yet, it seems like larger timeouts here would make the resulting problems rarer and harder to reproduce, but wouldn't actually solve them.
Similarly, in PollClient, I don't see an advantage in waiting. OTOH if we close the connection (by exiting the process) there's some chance the server end can notice that we no longer care about the response (in practice I don't think this is a real advantage right now: the nginx proxy might notice, but # nginx connections doesn't tend to be a bottleneck).
The advantages of zero timeouts are- if there are unclean-shutdown bugs, we get to find them sooner/easier, and fix them (instead of "wait 2 seconds every time, and 99.9% of the time that's long enough")
- we just exit as soon as we're done, things run fast
Updated by Peter Amstutz almost 9 years ago
Tom Clegg wrote:
I started out having non-zero default timeouts, but I couldn't think of any advantage.
For example, with EventClient, why do we care whether the server acknowledged our "close"? Either way, the connection will close when the server closes it, it times out, or our process exits.
If there are bugs in ws4py that make it crash during exit hooks if the server hasn't acknowledged shutdown yet, it seems like larger timeouts here would make the resulting problems rarer and harder to reproduce, but wouldn't actually solve them.
Yes, I agree it should just close the socket and not wait for acknowledgement, but ws4py doesn't have an option to do that (I've looked over the code several times). So if you want to poke ws4py in the eye, that's fine, but it's likely to result in intermittent backtraces during shutdown. We should probably fix ws4py at some point.
Similarly, in PollClient, I don't see an advantage in waiting. OTOH if we close the connection (by exiting the process) there's some chance the server end can notice that we no longer care about the response (in practice I don't think this is a real advantage right now: the nginx proxy might notice, but # nginx connections doesn't tend to be a bottleneck).
I agree that for PollClient it shouldn't make much difference one way or the other.
The advantages of zero timeouts are
- if there are unclean-shutdown bugs, we get to find them sooner/easier, and fix them (instead of "wait 2 seconds every time, and 99.9% of the time that's long enough")
- we just exit as soon as we're done, things run fast
In the interests of time, go ahead and merge.
Updated by Tom Clegg almost 9 years ago
Merged, thanks.
Peter Amstutz wrote:
Yes, I agree it should just close the socket and not wait for acknowledgement, but ws4py doesn't have an option to do that (I've looked over the code several times). So if you want to poke ws4py in the eye, that's fine, but it's likely to result in intermittent backtraces during shutdown. We should probably fix ws4py at some point.
The new default behavior is exactly the same as the old behavior (i.e., don't wait for anything) so if we haven't been having this problem I suppose we will continue to not have it.
I don't think this qualifies as "poking ws4py in the eye" unless ws4py's API tells us we're not allowed to exit our program until we receive a "closed" event on every connection...?
... not wait for acknowledgement, but ws4py doesn't have an option to do that
This seems to be exactly what close() does -- we have to go to extra trouble to wait for acknowledgement via closed() callback.
Updated by Brett Smith almost 9 years ago
- Status changed from New to In Progress
Updated by Tom Clegg almost 9 years ago
- Status changed from In Progress to Resolved