Bug #9051

[SDKs] EventClient fails to reconnect after HandshakeError from last connection

Added by Jiayong Li over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

snap_gatk_HG01953.log_02 (3.91 KB) snap_gatk_HG01953.log_02 Jiayong Li, 04/25/2016 08:45 PM
snap_gatk_HG01953.log_01 (633 KB) snap_gatk_HG01953.log_01 Jiayong Li, 04/25/2016 08:45 PM

Related issues

Related to Arvados - Bug #8931: [SDK] websocket event thread crashResolved04/25/2016

History

#1 Updated by Jiayong Li over 3 years ago

Running snap_gatk on HG01953 exome with arvados-cwl-runner, with websocket enabled (using arvados branch 8931-event-thread-catch-exceptions).
log_1 shows

2016-04-21 01:51:21 arvados.events[40865] WARNING: Unexpected close. Reconnecting.
2016-04-21 01:51:22 arvados.events[40865] WARNING: Error 'Invalid response status: 502 Bad Gateway' during websocket reconnect. Will retry after 5s.

followed by traceback
Traceback (most recent call last):
  File "/home/jiayong/miniconda2/lib/python2.7/site-packages/arvados_python_client-0.1.20160412180035-py2.7.egg/arvados/events.py", line 119, in on_closed
    self.ec.connect()
  File "build/bdist.linux-x86_64/egg/ws4py/client/__init__.py", line 231, in connect
    self.process_response_line(response_line)
  File "build/bdist.linux-x86_64/egg/ws4py/client/__init__.py", line 284, in process_response_line
    raise HandshakeError("Invalid response status: %s %s" % (code, status))
HandshakeError: Invalid response status: 502 Bad Gateway
2016-04-21 01:51:27 arvados.events[40865] WARNING: Error ''NoneType' object has no attribute 'getsockopt'' during websocket reconnect. Will retry after 5s.
Traceback (most recent call last):
  File "/home/jiayong/miniconda2/lib/python2.7/site-packages/arvados_python_client-0.1.20160412180035-py2.7.egg/arvados/events.py", line 119, in on_closed
    self.ec.connect()
  File "build/bdist.linux-x86_64/egg/ws4py/client/__init__.py", line 207, in connect
    self.sock = ssl.wrap_socket(self.sock, **self.ssl_options)
  File "/home/jiayong/miniconda2/lib/python2.7/ssl.py", line 911, in wrap_socket
    ciphers=ciphers)
  File "/home/jiayong/miniconda2/lib/python2.7/ssl.py", line 535, in __init__
    if sock.getsockopt(SOL_SOCKET, SO_TYPE) != SOCK_STREAM:
AttributeError: 'NoneType' object has no attribute 'getsockopt'

log_02 shows a deadlock.

2016-04-21 05:42:29 arvados.cwl-runner[7143] ERROR: Workflow is deadlocked, no runnable jobs and not waiting on any pending jobs.
2016-04-21 05:42:29 arvados.cwl-runner[7143] ERROR: Caught unhandled exception, marking pipeline as failed.  Error was: <class 'cwltool.errors.WorkflowException'>

#2 Updated by Brett Smith over 3 years ago

  • Subject changed from [Websocket] Connection closed during long running jobs, and reconnecting error to [SDKs] EventClient fails to reconnect after HandshakeError from last connection
  • Status changed from New to Feedback

We believe this was fixed in c1276bd9f, done as part of #8931. It would be good to hear back if you continue to see reconnect attempts failing permanently like this.

#3 Updated by Jiayong Li over 3 years ago

That's fantastic. I'll keep this ticket in mind the next time running pipelines.

#4 Updated by Tom Morris almost 3 years ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF