Bug #8288

arv-mount / crunchstat in a crunch job fails to exit because reasons

Added by Joshua Randall over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Target version:
Start date:
01/23/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

srun / arv-mount / crunchstat doesn't exit because apparently although the docker container has exited and `docker run` has returned, the docker daemon still has an open file on the keep mount that `arv-mount` created.

# ps auxwww | awk '$1=="crunch"'
crunch    8848  0.0  0.0 608616 31096 ?        Sl   Jan22   0:01 /usr/bin/python2.7 /usr/local/bin/arv-mount --read-write --mount-by-pdh=by_pdh --mount-tmp=tmp --crunchstat-interval=10 --allow-other /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep --exec crunchstat -cgroup-root=/sys/fs/cgroup -cgroup-parent=docker -cgroup-cid=/data/crunch-tmp/crunch-job/z8ta6-ot0gb-v0p7tcarqimrnmj-0.cid -poll=10000 /usr/bin/docker run --name=z8ta6-ot0gb-v0p7tcarqimrnmj-0 --attach=stdout --attach=stderr --attach=stdin -i --cidfile=/data/crunch-tmp/crunch-job/z8ta6-ot0gb-v0p7tcarqimrnmj-0.cid --sig-proxy --memory=188032493k --memory-swap=190129641k --volume=/data/crunch-tmp/crunch-job/src:/data/crunch-tmp/crunch-job/src:ro --volume=/data/crunch-tmp/crunch-job/opt:/data/crunch-tmp/crunch-job/opt:ro --volume=/data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep/by_pdh:/keep:ro --volume=/data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep/tmp:/keep_tmp --volume=/tmp --env=TASK_SEQUENCE=0 --env=TASK_KEEPMOUNT=/keep --env=JOB_PARAMETER_INPUTS_COLLECTION=e8ccf6ac8348615a9873149f4ff07353+14726 --env=CRUNCH_SRC_COMMIT=15ee3e583ff9eeb273b52826c8944b43ae21e8bd --env=TASK_QSEQUENCE=0 --env=CRUNCH_INSTALL=/data/crunch-tmp/crunch-job/opt --env=CRUNCH_GIT_ARCHIVE_HASH=085e92f18ba93c6b8eb2507dbfbdf929 --env=CRUNCH_REFRESH_TRIGGER=/tmp/crunch_refresh_trigger --env=ARVADOS_API_TOKEN=2xb1o74nhxb7h2k2fculw4k6xahvss7v1ux8lhmd52conglznk --env=JOB_PARAMETER_SKIP_SQ_SN_REGEX=_decoy$ --env=CRUNCH_WORK=/data/crunch-tmp/crunch-job/work --env=CRUNCH_TMP=/data/crunch-tmp/crunch-job --env=TASK_TMPDIR=/tmp/crunch-job-task-work/humgen-03-02.1 --env=JOB_UUID=z8ta6-8i9sb-fsm22dece9hd043 --env=CRUNCH_JOB_UUID=z8ta6-8i9sb-fsm22dece9hd043 --env=TASK_SLOT_NUMBER=1 --env=CRUNCH_SRC_URL=/var/lib/arvados/internal.git --env=TASK_SLOT_NODE=humgen-03-02 --env=JOB_SCRIPT=bcftools-exp-gvcf-mpileup-cram-chunked.py --env=CRUNCH_NODE_SLOTS=1 --env=JOB_PARAMETER_REFERENCE_COLLECTION=a83bd4e5a26a64612322f21515d93bab+6190 --env=JOB_PARAMETER_GENOME_CHUNKS=400 --env=CRUNCH_JOB_DOCKER_BIN=/usr/bin/docker --env=TASK_WORK=/tmp/crunch-job-task-work/humgen-03-02.1 --env=TASK_KEEPMOUNT_TMP=/keep_tmp --env=ARVADOS_API_HOST=api.arvados.sanger.ac.uk --env=JOB_WORK=/tmp/crunch-job-work --env=TASK_UUID=z8ta6-ot0gb-v0p7tcarqimrnmj --env=CRUNCH_SRC=/data/crunch-tmp/crunch-job/src --env=HOME=/tmp/crunch-job-task-work/humgen-03-02.1 bfdf5bfbb2858302a6b82242018f6d39aade1707e25541710ff85b804c35056c /bin/sh -c python -c "from pkg_resources import get_distribution as get; print \"Using Arvados SDK version\", get(\"arvados-python-client\").version">&2 2>/dev/null; mkdir -p "/tmp/crunch-job-work" "/tmp/crunch-job-task-work/humgen-03-02.1" && if which stdbuf >/dev/null ; then   exec  stdbuf --output=0 --error=0  \/data\/crunch\-tmp\/crunch\-job\/src\/crunch_scripts\/bcftools\-exp\-gvcf\-mpileup\-cram\-chunked\.py ; else   exec \/data\/crunch\-tmp\/crunch\-job\/src\/crunch_scripts\/bcftools\-exp\-gvcf\-mpileup\-cram\-chunked\.py ; fi
# pstree -Aap 8848
arv-mount,8848 /usr/local/bin/arv-mount --read-write --mount-by-pdh=by_pdh --mount-tmp=tmp --crunchstat-interval=10 --allow-other /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep--e
  |-{arv-mount},8860
  `-{arv-mount},8863
# fuser -v -m /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep
                     USER PID ACCESS COMMAND
/data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep:
                     root     kernel mount /data
                     crunch     8848 ..c.. arv-mount
                     root      37715 F.... docker
# ps auxwww | grep 37715 | grep -v grep
root     37715  0.1  0.0 12555452 94712 ?      Ssl   2015 166:27 /usr/bin/docker -d --graph=/data/docker
# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
# docker version
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 786b29d
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d
OS/Arch (server): linux/amd64
# lsof -p 37715
COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF      NODE NAME
docker  37715 root  cwd    DIR                8,6     4096         2 /
docker  37715 root  rtd    DIR                8,6     4096         2 /
docker  37715 root  txt    REG                8,6 16296881   1977972 /usr/bin/docker
docker  37715 root    0u   CHR                1,3      0t0      1029 /dev/null
docker  37715 root    1u   CHR              136,2      0t0         5 /dev/pts/2
docker  37715 root    2u   CHR              136,2      0t0         5 /dev/pts/2
docker  37715 root    3u  unix 0xffff881881334280      0t0  48220470 /var/run/docker.sock
docker  37715 root    4u  unix 0xffff882ff7a0f700      0t0   1501908 /var/run/docker.sock
docker  37715 root    5r   CHR                1,9      0t0      1034 /dev/urandom
docker  37715 root    6u  0000                0,9        0      6847 anon_inode
docker  37715 root    9u  unix 0xffff881ff0922300      0t0  48310485 /var/run/docker.sock
docker  37715 root   11u  unix 0xffff881ff0927a80      0t0  48326780 /var/run/docker.sock
docker  37715 root   12u  unix 0xffff88191467b800      0t0  48698892 /var/run/docker.sock
docker  37715 root   14u  unix 0xffff880106038000      0t0  48603069 /var/run/docker.sock
docker  37715 root   16u  unix 0xffff882005fa6c80      0t0  48334610 /var/run/docker.sock
docker  37715 root   17u   REG              252,0  1996800 366018572 /data/docker/linkgraph.db
docker  37715 root   19u   REG              252,0  1996800 366018572 /data/docker/linkgraph.db
docker  37715 root   23u  unix 0xffff881ff0922d80      0t0  48247780 /var/run/docker.sock
docker  37715 root   24u  unix 0xffff88191467e580      0t0  48663201 /var/run/docker.sock
docker  37715 root   25u  unix 0xffff88010603c980      0t0  48683402 /var/run/docker.sock
docker  37715 root   26u  unix 0xffff881ef218b100      0t0  48310488 /var/run/docker.sock
docker  37715 root   27u  unix 0xffff882005fa6580      0t0  48309653 /var/run/docker.sock
# stat /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep
  File: `/data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep'
  Size: 4096          Blocks: 8          IO Block: 4096   directory
Device: fc00h/64512d    Inode: 262078526   Links: 2
Access: (0755/drwxr-xr-x)  Uid: (15324/  crunch)   Gid: ( 1593/ arvados)
Access: 2016-01-22 21:49:38.729986521 +0000
Modify: 2016-01-22 21:49:38.441989497 +0000
Change: 2016-01-22 21:49:38.441989497 +0000
 Birth: -
# stat -f /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep
  File: "/data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep" 
    ID: 6fef924393f3e2f8 Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 17485339344 Free: 16349728046 Available: 16174171418
Inodes: Total: 1097203712 Free: 1096592099
# uname -a
Linux humgen-03-02 3.13.0-63-generic #104~precise1-Ubuntu SMP Tue Aug 18 17:03:03 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
# readlink /proc/37715/fd/6
anon_inode:[eventpoll]
# fusermount -u /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep
fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep: Invalid argument
# mount -t fuse
# umount /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep
umount: /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep: not mounted
# rm -rf /data/crunch-tmp/crunch-job/task/humgen-03-02.1.keep
# pstree -Aasp 8848
init,1
  `-slurmstepd,8845
      `-arv-mount,8848 /usr/local/bin/arv-mount --read-write --mount-by-pdh=by_pdh --mount-tmp=tmp --crunchstat-interval=10 --allow-other ...
          |-{arv-mount},8860
          `-{arv-mount},8863
# strace -f -p 8848
Process 8848 attached with 3 threads - interrupt to quit
[pid  8863] futex(0x7fcc728c7a84, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid  8860] select(0, NULL, NULL, NULL, {1, 754286} <unfinished ...>
[pid  8848] futex(0x32c6e80, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  8860] <... select resumed> )      = 0 (Timeout)
[pid  8860] write(2, "crunchstat: keepcalls 0 put 0 ge"..., 74) = 74
[pid  8860] write(2, "crunchstat: net:keep0 0 tx 0 rx "..., 70) = 70
[pid  8860] write(2, "crunchstat: keepcache 0 hit 0 mi"..., 76) = 76
[pid  8860] write(2, "crunchstat: fuseops 0 write 0 re"..., 78) = 78
[pid  8860] write(2, "crunchstat: blkio:0:0 0 write 0 "..., 80) = 80
[pid  8860] select(0, NULL, NULL, NULL, {10, 0}) = 0 (Timeout)
[pid  8860] write(2, "crunchstat: keepcalls 0 put 0 ge"..., 74) = 74
[pid  8860] write(2, "crunchstat: net:keep0 0 tx 0 rx "..., 70) = 70
[pid  8860] write(2, "crunchstat: keepcache 0 hit 0 mi"..., 76) = 76
[pid  8860] write(2, "crunchstat: fuseops 0 write 0 re"..., 78) = 78
[pid  8860] write(2, "crunchstat: blkio:0:0 0 write 0 "..., 80) = 80
[pid  8860] select(0, NULL, NULL, NULL, {10, 0}^C <unfinished ...>
Process 8848 detached
Process 8860 detached
Process 8863 detached
# kill 8860

Subtasks

Task #8355: review 8288-poll-client-close-timeoutResolvedPeter Amstutz


Related issues

Related to Arvados - Feature #8163: [FUSE] arv-mount should detect and log any files/dirs that are still open after unmountingNew01/07/2016

Related to Arvados - Bug #8388: [SDKs] WebSocketClient Exception 'NoneType' object is not callableIn Progress02/17/2016

Associated revisions

Revision e0e0c312
Added by Tom Clegg over 5 years ago

Merge branch '8288-arv-mount-deadlock' refs #8288

Revision cda9a7d3
Added by Tom Clegg over 5 years ago

Merge branch '8288-poll-client-close-timeout' refs #8288

History

#1 Updated by Tom Clegg over 5 years ago

  • Description updated (diff)

#2 Updated by Joshua Randall over 5 years ago

Further testing shows that docker is not the culprit here. I instrumented a version of arvados_fuse that monitors all operations calls (tracking the pid of the calling process) and no operations were performed by docker. It turns out that I was misusing the `fuser -m` option because at that point the fuse filesystem has already been unmounted, so the mount point is just a directory on the scratch filesystem (on which the docker graph database also resides, hence docker daemon having it open).

There is still an issue in that arv-mount fails to cleanly exit (entirely on its own, it would seem). I'm currently experimenting with adding some additonal code to _run_exec to make sure that it exits once the exec'd command completes.

#3 Updated by Joshua Randall over 5 years ago

  • Subject changed from arv-mount / crunchstat in a crunch job fails to exit because docker daemon has file open for writing (according to fuser) to arv-mount / crunchstat in a crunch job fails to exit because reasons

#4 Updated by Tom Clegg over 5 years ago

I suspect arv-mount's use of arvados.events.PollClient (when it can't connect to websocket) contributes to this problem:
  1. I ran into both of these problems on a dev cluster last night.
  2. IIRC, Josh mentioned the frequency of "arv-mount gets stuck" seemed to be related to the load on the whole cluster, not just the number of tasks running on a single compute node ... and that high cluster load seemed to be causing websocket failures.

As Josh pointed out in IRC, it looks like we're using operations.destroy() wrong: the llfuse example suggests we should just call llfuse.close() after the filesystem is unmounted and llfuse.main() returns, and let llfuse call our operations.destroy() by itself. (This might explain why we've had to hack mutexes into our shutdown methods: perhaps we wouldn't have to accommodate multiple overlapping shutdown events if we stop going around llfuse's back and calling operations.destroy().)

We start llfuse.main() in a non-daemon thread, and then call operations.destroy() from a different thread. Calling llfuse.close() (not destroy()), and calling it from the same thread that called llfuse.main(), seems like good form if nothing else. And I'm not sure there's any reason that thread should not be a daemon thread: if we're in --exec mode and our main thread ends, that means our child process has exited and we have called fusermount -u, so the expected behavior is for all subsequent IO operations to fail anyway.

#5 Updated by Tom Clegg over 5 years ago

8288-arv-mount-deadlock @ bd49a6c contains a few changes that may help here. Notably:
  1. When someone (like arv-mount) closes its PollClient event listener, don't wait for the current set of "get logs" API calls to complete.
  2. In --exec mode, mark the llfuse thread as a daemon thread so Python doesn't stay alive waiting for it to finish after main has decided to exit.
  3. In --exec mode, if --unmount-timeout (default 2) seconds pass after "fusermount -u -z" and the llfuse thread is still running, log a warning, call operations.destroy(), and exit. (Otherwise, don't call operations.destroy() at all, just assume llfuse has done the necessary cleanup.)

Would be good to add an "exit within unmount_timeout+1 seconds even if various things are conspiring to keep threads alive" test case.

#6 Updated by Brett Smith over 5 years ago

  • Target version set to 2016-02-17 Sprint

#7 Updated by Brett Smith over 5 years ago

  • Category set to FUSE
  • Assigned To set to Tom Clegg

#8 Updated by Tom Clegg over 5 years ago

Split branch into 8288-arv-mount-deadlock and 8288-poll-client-close-timeout

#9 Updated by Tom Clegg over 5 years ago

  • Story points set to 0.5

#10 Updated by Peter Amstutz over 5 years ago

Reviewing 8288-arv-mount-deadlock:

The strategy here seems to be (a) use lazy unmount and (b) make the llfuse main thread a daemon thread so it doesn't prevent the runtime from exiting.

Some notes:

  • The llfuse main thread calls fuse_session_loop_mt() in libfuse which spins up more pthreads. I believe these will die quietly when the process as a whole exits.
  • The llfuse main thread normally terminates in response to an exit message from the kernel. So if we don't get the exit message, the kernel doesn't think it is done with the mount point (but we knew that already).
  • close() calls fuse_session_destroy() which calls op.destroy(), so I don't think we should ever be calling operations.destroy() explicitly.
  • It seems very likely to be a race condition to call close() without first stopping the main loop, so we might be signing up for random segfaults.
  • The correct way to terminate the main thread seems to be to call fuse_session_exit(), but llfuse doesn't expose that.
  • At best, this seems very likely to result in stale mount points in a weird state.

#11 Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

The strategy here seems to be (a) use lazy unmount and (b) make the llfuse main thread a daemon thread so it doesn't prevent the runtime from exiting.

Right. Also, stop going behind llfuse's back and calling operations.destroy() immediately after fusermount -u and possibly getting there first, which might have been causing subtle problems of its own.

(I suspect the most common reason the llfuse thread was staying open was that our destroy() method called PollClient.close(), which could take a long time -- addressed by the 8288-poll-client-close-timeout branch -- but surely there are also other ways to keep the llfuse thread alive after the --exec proc has exited, and we want to cut those off too.)

  • The llfuse main thread calls fuse_session_loop_mt() in libfuse which spins up more pthreads. I believe these will die quietly when the process as a whole exits.
  • The llfuse main thread normally terminates in response to an exit message from the kernel. So if we don't get the exit message, the kernel doesn't think it is done with the mount point (but we knew that already).
  • close() calls fuse_session_destroy() which calls op.destroy(), so I don't think we should ever be calling operations.destroy() explicitly.

Right. I moved it to "only if the normal way doesn't work normally" because it's clear that's not the right way. But yes, terminating the thread by exiting main is probably better than unsafe-cleanup-then-exit. So, removed the call to operations.destroy() in da7eabf.

  • It seems very likely to be a race condition to call close() without first stopping the main loop, so we might be signing up for random segfaults.

Yes, I assume this is why the old "close() from a different thread" code was commented out. The new code is straight from the llfuse example: close() is just what you're supposed to do after main() returns.

  • The correct way to terminate the main thread seems to be to call fuse_session_exit(), but llfuse doesn't expose that.

I figure we have to make sure "fusermount -u" works properly (that's the correct way that's exposed to users) so I'm not inclined to look too hard for other ways.

  • At best, this seems very likely to result in stale mount points in a weird state.

So far I have not seen a mount point still connected after "fusermount -u -z". But yes, getting rid of operations.destroy() should reduce the possibilities to "normal", "process exited, anything still open gets an IO error", "process won't die, llfuse is still doing its thing without interference"... no more "llfuse is still running after we poked its operations in the eye" option.

If there is still some case where the kernel (or python) keeps our process alive even after we try to exit the main thread, at least we'll have a log message as a clue, and at least we won't have poked llfuse in the eye.

#12 Updated by Peter Amstutz over 5 years ago

fuse_session_exit() would let us gracefully terminate the main libfuse threads instead of just abandoning them, but as noted, llfuse doesn't expose that. I'll see about filing a feature request.

8288-arv-mount-deadlock @ da7eabf looks good to me.

#13 Updated by Joshua Randall over 5 years ago

We could also tell the kernel to abort the connection (and thus terminate the fuse/llfuse process) by either `umount -f` on the mountpoint (which should work after `fusermount -u` but not after `fusermount -u -z`) or signalling the "abort" interface in the fusectl (/sys) filesystem.

Those are the two most effective deadlock-avoidance mechanisms suggested in the linux kernel documentation for fuse:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/fuse.txt#n204

I think the implication of these documentation are that there are situations arv-mount can get into in which this will be the only way to kill off the fuse filesystem.

Perhaps we could follow up on the current `fusermount -u` call with a .join() with timeout on the fuse main thread, and if it doesn't exit in the time allotted, hit it harder with a `umount -f` call.

#14 Updated by Peter Amstutz over 5 years ago

Joshua Randall wrote:

We could also tell the kernel to abort the connection (and thus terminate the fuse/llfuse process) by either `umount -f` on the mountpoint (which should work after `fusermount -u` but not after `fusermount -u -z`) or signalling the "abort" interface in the fusectl (/sys) filesystem.

umount -f definitely works, but requires sudo.

I didn't know about "abort", though. It seems that the permissions are associated with the user that is running the fuse FS, so no sudo needed. However the mounts are identified by a numeric id, I'm not sure where that id comes from.

#15 Updated by Peter Amstutz over 5 years ago

Reviewing 8288-poll-client-close-timeout

In EventClient:

Calling super(EventClient, self).close() starts a graceful shutdown (WebSocketClient tells the server that we're signing off, waits for an acknowledgement, the acknowledgement is what signals closed()), but if we don't wait for it, won't happen. Suggest the default timeout should be greater than zero, maybe 1s.

In PollClient:

Similar suggestion for default timeout to be greater than zero. Once _closing.set() is called, the thread should exit almost immediately, except with a timeout of zero the thread might not even get a chance to run to completion.

#16 Updated by Tom Clegg over 5 years ago

I started out having non-zero default timeouts, but I couldn't think of any advantage.

For example, with EventClient, why do we care whether the server acknowledged our "close"? Either way, the connection will close when the server closes it, it times out, or our process exits.

If there are bugs in ws4py that make it crash during exit hooks if the server hasn't acknowledged shutdown yet, it seems like larger timeouts here would make the resulting problems rarer and harder to reproduce, but wouldn't actually solve them.

Similarly, in PollClient, I don't see an advantage in waiting. OTOH if we close the connection (by exiting the process) there's some chance the server end can notice that we no longer care about the response (in practice I don't think this is a real advantage right now: the nginx proxy might notice, but # nginx connections doesn't tend to be a bottleneck).

The advantages of zero timeouts are
  • if there are unclean-shutdown bugs, we get to find them sooner/easier, and fix them (instead of "wait 2 seconds every time, and 99.9% of the time that's long enough")
  • we just exit as soon as we're done, things run fast

#17 Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

I started out having non-zero default timeouts, but I couldn't think of any advantage.

For example, with EventClient, why do we care whether the server acknowledged our "close"? Either way, the connection will close when the server closes it, it times out, or our process exits.

If there are bugs in ws4py that make it crash during exit hooks if the server hasn't acknowledged shutdown yet, it seems like larger timeouts here would make the resulting problems rarer and harder to reproduce, but wouldn't actually solve them.

Yes, I agree it should just close the socket and not wait for acknowledgement, but ws4py doesn't have an option to do that (I've looked over the code several times). So if you want to poke ws4py in the eye, that's fine, but it's likely to result in intermittent backtraces during shutdown. We should probably fix ws4py at some point.

Similarly, in PollClient, I don't see an advantage in waiting. OTOH if we close the connection (by exiting the process) there's some chance the server end can notice that we no longer care about the response (in practice I don't think this is a real advantage right now: the nginx proxy might notice, but # nginx connections doesn't tend to be a bottleneck).

I agree that for PollClient it shouldn't make much difference one way or the other.

The advantages of zero timeouts are
  • if there are unclean-shutdown bugs, we get to find them sooner/easier, and fix them (instead of "wait 2 seconds every time, and 99.9% of the time that's long enough")
  • we just exit as soon as we're done, things run fast

In the interests of time, go ahead and merge.

#18 Updated by Tom Clegg over 5 years ago

Merged, thanks.

Peter Amstutz wrote:

Yes, I agree it should just close the socket and not wait for acknowledgement, but ws4py doesn't have an option to do that (I've looked over the code several times). So if you want to poke ws4py in the eye, that's fine, but it's likely to result in intermittent backtraces during shutdown. We should probably fix ws4py at some point.

The new default behavior is exactly the same as the old behavior (i.e., don't wait for anything) so if we haven't been having this problem I suppose we will continue to not have it.

I don't think this qualifies as "poking ws4py in the eye" unless ws4py's API tells us we're not allowed to exit our program until we receive a "closed" event on every connection...?

... not wait for acknowledgement, but ws4py doesn't have an option to do that

This seems to be exactly what close() does -- we have to go to extra trouble to wait for acknowledgement via closed() callback.

#19 Updated by Brett Smith over 5 years ago

  • Status changed from New to In Progress

#20 Updated by Tom Clegg over 5 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF