Bug #8388

[SDKs] WebSocketClient Exception 'NoneType' object is not callable

Added by Nico C├ęsar over 3 years ago. Updated about 3 years ago.

Status:
In Progress
Priority:
Normal
Assigned To:
-
Category:
SDKs
Target version:
Start date:
02/17/2016
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

2016-02-05_16:06:55.98467 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
2016-02-05_16:06:55.98478 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr crunchstat: notice: stats not available: stat memory.stat, statgroup memory, cid 8632b141bad61731f821338bef7fae89a2bc279c3fcb0a6b2725171ab64221f0, parent docker, root /sys/fs/cgroup
2016-02-05_16:06:55.98486 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr crunchstat: notice: stats not available: stat cpuacct.stat, statgroup cpuacct, cid 8632b141bad61731f821338bef7fae89a2bc279c3fcb0a6b2725171ab64221f0, parent docker, root /sys/fs/cgroup
2016-02-05_16:06:55.98497 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr crunchstat: notice: stats not available: stat blkio.io_service_bytes, statgroup blkio, cid 8632b141bad61731f821338bef7fae89a2bc279c3fcb0a6b2725171ab64221f0, parent docker, root /sys/fs/cgroup
2016-02-05_16:06:55.98505 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr crunchstat: notice: stats not available: stat cgroup.procs, statgroup cpuacct, cid 8632b141bad61731f821338bef7fae89a2bc279c3fcb0a6b2725171ab64221f0, parent docker, root /sys/fs/cgroup
2016-02-05_16:06:55.98513 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr Using Arvados SDK version 0.1.20151210204624
2016-02-05_16:06:56.19051 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr run-command: md5sum /keep/e73f37e3a1b9153ac3421630bb2d641e+54/md5sum.txt > md5sum.txt
2016-02-05_16:06:56.80218 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr run-command: md5sum completed with exit code 0 (success)
2016-02-05_16:06:57.38913 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr run-command: the following output files will be saved to keep:
2016-02-05_16:06:57.38924 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr run-command:           87 ./md5sum.txt
2016-02-05_16:06:57.38932 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr run-command: start writing output to keep
2016-02-05_16:06:57.94527 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr Exception in thread WebSocketClient (most likely raised during interpreter shutdown):
2016-02-05_16:06:58.49804 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr Traceback (most recent call last):
2016-02-05_16:06:58.49807 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr   File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
2016-02-05_16:06:58.49808 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr   File "/usr/lib/python2.7/threading.py", line 763, in run
2016-02-05_16:06:58.49820 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr   File "/usr/lib/python2.7/dist-packages/ws4py/websocket.py", line 427, in run
2016-02-05_16:06:58.49821 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr   File "/usr/lib/python2.7/dist-packages/ws4py/websocket.py", line 305, in once
2016-02-05_16:06:58.49823 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr   File "/usr/lib/python2.7/dist-packages/ws4py/websocket.py", line 357, in process
2016-02-05_16:06:58.49824 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr   File "/usr/lib/python2.7/dist-packages/ws4py/streaming.py", line 185, in receiver
2016-02-05_16:06:58.49832 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr   File "/usr/lib/python2.7/dist-packages/ws4py/framing.py", line 138, in _parsing
2016-02-05_16:06:58.49833 c97qk-8i9sb-toh7dba29qpmstw 24092 0 stderr <type 'exceptions.TypeError'>: 'NoneType' object is not callable
2016-02-05_16:06:58.49840 c97qk-8i9sb-toh7dba29qpmstw 24092 0 child 24188 on compute4.1 exit 0 success=true
2016-02-05_16:06:58.49846 c97qk-8i9sb-toh7dba29qpmstw 24092 0 success in 4 seconds
2016-02-05_16:06:58.49847 c97qk-8i9sb-toh7dba29qpmstw 24092 0 task output (105 bytes): . fc9db26fae56d1e01d39fc19924fd5f7+87+A3cd799a779d9f99d272b6575e76507b08e585772@56c73da1 0:87:md5sum.txt
2016-02-05_16:06:58.49853 c97qk-8i9sb-toh7dba29qpmstw 24092  status: 1 done, 0 running, 0 todo
2016-02-05_16:06:58.57025 c97qk-8i9sb-toh7dba29qpmstw 24092  wait for last 0 children to finish
2016-02-05_16:06:58.58058 c97qk-8i9sb-toh7dba29qpmstw 24092  release job allocation
2016-02-05_16:06:58.61254 c97qk-8i9sb-toh7dba29qpmstw ! salloc: Job allocation 154 has been revoked.
2016-02-05_16:06:58.61303 c97qk-8i9sb-toh7dba29qpmstw 24092  Freeze not implemented
2016-02-05_16:06:58.62316 c97qk-8i9sb-toh7dba29qpmstw 24092  collate
2016-02-05_16:06:58.63606 c97qk-8i9sb-toh7dba29qpmstw 24092  collated output manifest text to send to API server is 105 bytes with access tokens
2016-02-05_16:06:58.94797 c97qk-8i9sb-toh7dba29qpmstw 24092  job output 8db0d9405b2963cf413cb73580889fe7+54
2016-02-05_16:06:59.53260 dispatch: update compute4 state to {:state=>"idle", :job=>nil}
2016-02-05_16:06:59.59897 c97qk-8i9sb-toh7dba29qpmstw 24092  finish
2016-02-05_16:06:59.93084 c97qk-8i9sb-toh7dba29qpmstw 24092  log collection is cd10b15edb98db7acd592dd8dad0f2c5+85
2016-02-05_16:07:00.53179 dispatch: child 24086 exit 0
2016-02-05_16:07:00.53188 dispatch: job c97qk-8i9sb-toh7dba29qpmstw end

Subtasks

Task #8462: Review 8388-events-terminateIn ProgressPeter Amstutz


Related issues

Related to Arvados - Bug #8288: arv-mount / crunchstat in a crunch job fails to exit because reasonsResolved01/23/2016

Related to Arvados - Bug #8060: [SDKs] Python tools occasionally write errors on exitNew12/18/2015

History

#2 Updated by Tom Clegg over 3 years ago

Any ideas about #8288#note-18, why t.join(timeout=0) would produce different results than not joining the thread at all?

Maybe we should be calling terminate() instead of close(). terminate() closes the socket, cleans up, and then returns -- which sounds more like what we want, rather than "send a message to the server asking it to close".

#3 Updated by Tom Clegg over 3 years ago

8388-events-terminate @ fe55632

#4 Updated by Brett Smith over 3 years ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg
  • Target version set to 2016-03-02 sprint

#5 Updated by Brett Smith over 3 years ago

  • Subject changed from WebSocketClient Exception 'NoneType' object is not callable to [SDKs] WebSocketClient Exception 'NoneType' object is not callable
  • Category set to SDKs

#6 Updated by Tom Clegg over 3 years ago

  • Story points set to 0.5
8388-events-terminate has two commits
  • 02dde15 8388: Add PollClient.terminate method to match EventClient.terminate.
  • fe55632 8388: Call terminate() instead of close() in hopes of avoiding stack traces crying wolf during shutdown.
This should be merged carefully to avoid fuse/sdk incompatibility:
  1. Merge "Add PollClient.terminate method"
  2. Push
  3. Amend "Call terminate()" so FUSE depends on the SDK version merged in step 2
  4. Merge
  5. Push

#7 Updated by Peter Amstutz over 3 years ago

"Those who do not read history are doomed to repeat it".

If you look at 51641ba5579cb9ebe14234e0888a162b46d1627d you'll see that we used to use terminate() and it used to crash.

The problem is that ws4py terminate() doesn't do any locking and is intended to be called by the websocket thread and not by other threads (it's really an internal method, just not marked as such). As a result calling terminate() directly will cause the websocket thread to crash by setting variables to None.

Also in that commit the following code was removed. Possibly we might just want to use this in our close() and ignore the graceful shutdown of WebSocketClient.close() entirely.

    def close_connection(self):
        try:
            self.sock.shutdown(socket.SHUT_RDWR)
            self.sock.close()
        except:
            pass

#8 Updated by Tom Clegg over 3 years ago

That commit message doesn't say anything about terminate() crashing, and neither do the notes on #6706, so I guess those who don't write their history are also doomed to repeat it.

Is there evidence somewhere that terminate() crashes?

http://ws4py.readthedocs.org/en/latest/sources/ws4py/#ws4py.websocket.WebSocket.terminate makes it look a lot like the right public API

In https://github.com/Lawouach/WebSocket-for-Python/issues/131 someome reported a crash in 2014, but the author seems to think terminate() is a working public API

I'm not following the logic of going back to the code that we've already tried but then had to remove in the bugfix commit you referenced.

At this point it actually seems like we're better off doing nothing at all, just letting nature take its course.

#9 Updated by Peter Amstutz over 3 years ago

https://github.com/Lawouach/WebSocket-for-Python/blob/master/ws4py/websocket.py#L393

Here's why terminate() shouldn't be called by the main thread:

  1. terminate() calls self.close_connection()
  2. close_connection() closes the socket (good!) and sets self.sock = None (uh oh!)
  3. the event handler thread, which is still running, calls self.sock.recv() or similar method
  4. the thread nondeterministically goes down in flame with a noisy "Can't call recv() on None" or similar

#10 Updated by Peter Amstutz over 3 years ago

The problem with not closing anything and just letting the program end is that then the thread nondeterministically crashes due to some kind of cleanup happening that cuts the legs out from underneath the thread that is still running.

Having spent far more time with the ws4py code that I even wanted to, I think the solution to avoid both hangs and crashes on exit is (a) close the socket directly (note #7) and (b) join on the event thread (so it has a chance to shut down gracefully, but isn't waiting on the server to do anything).

#11 Updated by Peter Amstutz over 3 years ago

  • Assigned To changed from Tom Clegg to Peter Amstutz

#12 Updated by Tom Clegg over 3 years ago

Before the next round of "I predict this will crash" can we make a program that connects/kills repeatedly using whatever method until the stack trace appears? Then we can [a] help fix ws4py upstream, [b] know whether ws4py is fixed, and [c] save time speculating.

#13 Updated by Peter Amstutz about 3 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg

#14 Updated by Brett Smith about 3 years ago

  • Assigned To deleted (Tom Clegg)
  • Target version changed from 2016-03-02 sprint to Arvados Future Sprints

#15 Updated by Brett Smith about 3 years ago

Note that this bug currently doesn't cause jobs to fail, or tools to exit nonzero (as far as we know). It's "just" a very scary warning as the stack comes down.

#16 Updated by Tom Clegg about 3 years ago

I started a branch attempting to reproduce the stack traces. From that I've tentatively concluded that you can't reproduce the problem using multiprocessing.pool because pool workers follow a different shutdown path that swallows the stack trace. Next move: see if subprocessing.check_output(['python', '-c', ...]) can reproduce it.

Also available in: Atom PDF