Bug #7654

[SDK] Python websockets client sometimes hangs on shutdown

Added by Peter Amstutz almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
11/19/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

The EventClient() in the Python SDK has a race condition in its shutdown. When EventClient.close() is called, either:

  • It shuts down successfully
  • The event thread crashes and prints a stack trace (because this is in a daemon thread, it doesn't interfere with program shutdown)
    Exception in thread WebSocketClient:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
        self.run()
      File "/usr/lib/python2.7/threading.py", line 763, in run
        self.__target(*self.__args, **self.__kwargs)
      File "build/bdist.linux-x86_64/egg/ws4py/websocket.py", line 427, in run
        if not self.once():
      File "build/bdist.linux-x86_64/egg/ws4py/websocket.py", line 300, in once
        b = self.sock.recv(self.reading_buffer_size)
    AttributeError: 'NoneType' object has no attribute 'recv'
    
  • It hangs forever

The problem is:

  1. The parent class WebSocketClient.close() method is designed for orderly shutdown. It sends a close message to the server and prevents any more messages from being sent, but doesn't actually close the socket.
  2. The server is expected to respond with its own "closed" message.
  3. If the server is uncoorperative or stuck and doesn't respond with a "closed" message of its own, the client won't close the connection on its own. The server may even continue sending events, but if the application assumes that it doesn't receive any more events after returning from close(), it won't be prepared to handle them (because it is shutting down)
  4. To head this off, the current code calls close_connection() which explicitly closes the underlying socket. This also sets "WebSocketClient.sock" to "None". Unfortunately, as it turns out, the "threadedclient.WebSocketClient" is not threadsafe on this function. So this sometimes results in the above crash when sock is set to None at a bad time.

Proposed fix to EventClient.close():

  • Call close() to start orderly shutdown
  • Set a flag indicating that received_message() shouldn't forward any more messages to on_event(). Put a mutex in received_message() so that close() doesn't return until waits for any message handlers are completed.
  • return from close(). At this point we don't care what happens because either the orderly shutdown will complete, or thread will be quietly killed and socket will get closed during process termination.

Subtasks

Task #7687: Review 7654-ws4py-hangResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #7466: [FUSE] arv-mount hang on exitClosed

Related to Arvados - Story #7593: [CWL] Modernize CWL SDKResolved10/16/2015

Related to Arvados - Bug #7721: [Tests] Build 2170 ran for 2.5 hours until I cancelled it - deadlock in websockets testsResolved

Associated revisions

Revision 73a127e5
Added by Peter Amstutz almost 4 years ago

Merge branch '7654-ws4py-hang' closes #7654

Revision 41694642 (diff)
Added by Peter Amstutz almost 4 years ago

Hotfix: use a recursive lock for closed_lock so that EventClient.close() can be
called from on_event(). refs #7654

Revision a85ea61e (diff)
Added by Tom Clegg over 3 years ago

7751: 7654: Rename bool to _closed to avoid conflicting with superclass closed() method.

refs #7654

History

#1 Updated by Peter Amstutz almost 4 years ago

  • Description updated (diff)

#2 Updated by Peter Amstutz almost 4 years ago

  • Description updated (diff)
  • Story points set to 1.0

#3 Updated by Peter Amstutz almost 4 years ago

  • Story points changed from 1.0 to 0.5

#4 Updated by Brett Smith almost 4 years ago

  • Category set to SDKs
  • Assigned To set to Sarah Guthrie
  • Target version set to 2015-11-11 sprint

#5 Updated by Brett Smith almost 4 years ago

  • Target version changed from 2015-11-11 sprint to 2015-12-02 sprint

#6 Updated by Brett Smith almost 4 years ago

  • Assigned To deleted (Sarah Guthrie)

#7 Updated by Brett Smith almost 4 years ago

  • Assigned To set to Peter Amstutz

#8 Updated by Peter Amstutz almost 4 years ago

Fixed as described. Being a race condition, this is slippery to reproduce and even harder to test, so hopefully you agree with my diagnosis & fix.

#9 Updated by Tom Clegg almost 4 years ago

dec876f... even if there are other problems still lurking, it certainly looks better/safer now than before. LGTM.

#10 Updated by Peter Amstutz almost 4 years ago

  • Status changed from New to Resolved

Applied in changeset arvados|commit:73a127e5492bc2711530b2f5a7c30a5021232d40.

Also available in: Atom PDF