Project

General

Profile

Actions

Bug #9051

closed

[SDKs] EventClient fails to reconnect after HandshakeError from last connection

Added by Jiayong Li over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Files

snap_gatk_HG01953.log_02 (3.91 KB) snap_gatk_HG01953.log_02 Jiayong Li, 04/25/2016 08:45 PM
snap_gatk_HG01953.log_01 (633 KB) snap_gatk_HG01953.log_01 Jiayong Li, 04/25/2016 08:45 PM

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #8931: [SDK] websocket event thread crashResolvedPeter Amstutz04/25/2016Actions

Updated by Jiayong Li over 8 years ago

Running snap_gatk on HG01953 exome with arvados-cwl-runner, with websocket enabled (using arvados branch 8931-event-thread-catch-exceptions).
log_1 shows

2016-04-21 01:51:21 arvados.events[40865] WARNING: Unexpected close. Reconnecting.
2016-04-21 01:51:22 arvados.events[40865] WARNING: Error 'Invalid response status: 502 Bad Gateway' during websocket reconnect. Will retry after 5s.

followed by traceback
Traceback (most recent call last):
  File "/home/jiayong/miniconda2/lib/python2.7/site-packages/arvados_python_client-0.1.20160412180035-py2.7.egg/arvados/events.py", line 119, in on_closed
    self.ec.connect()
  File "build/bdist.linux-x86_64/egg/ws4py/client/__init__.py", line 231, in connect
    self.process_response_line(response_line)
  File "build/bdist.linux-x86_64/egg/ws4py/client/__init__.py", line 284, in process_response_line
    raise HandshakeError("Invalid response status: %s %s" % (code, status))
HandshakeError: Invalid response status: 502 Bad Gateway
2016-04-21 01:51:27 arvados.events[40865] WARNING: Error ''NoneType' object has no attribute 'getsockopt'' during websocket reconnect. Will retry after 5s.
Traceback (most recent call last):
  File "/home/jiayong/miniconda2/lib/python2.7/site-packages/arvados_python_client-0.1.20160412180035-py2.7.egg/arvados/events.py", line 119, in on_closed
    self.ec.connect()
  File "build/bdist.linux-x86_64/egg/ws4py/client/__init__.py", line 207, in connect
    self.sock = ssl.wrap_socket(self.sock, **self.ssl_options)
  File "/home/jiayong/miniconda2/lib/python2.7/ssl.py", line 911, in wrap_socket
    ciphers=ciphers)
  File "/home/jiayong/miniconda2/lib/python2.7/ssl.py", line 535, in __init__
    if sock.getsockopt(SOL_SOCKET, SO_TYPE) != SOCK_STREAM:
AttributeError: 'NoneType' object has no attribute 'getsockopt'

log_02 shows a deadlock.

2016-04-21 05:42:29 arvados.cwl-runner[7143] ERROR: Workflow is deadlocked, no runnable jobs and not waiting on any pending jobs.
2016-04-21 05:42:29 arvados.cwl-runner[7143] ERROR: Caught unhandled exception, marking pipeline as failed.  Error was: <class 'cwltool.errors.WorkflowException'>

Actions #2

Updated by Brett Smith over 8 years ago

  • Subject changed from [Websocket] Connection closed during long running jobs, and reconnecting error to [SDKs] EventClient fails to reconnect after HandshakeError from last connection
  • Status changed from New to Feedback

We believe this was fixed in c1276bd9f, done as part of #8931. It would be good to hear back if you continue to see reconnect attempts failing permanently like this.

Actions #3

Updated by Jiayong Li over 8 years ago

That's fantastic. I'll keep this ticket in mind the next time running pipelines.

Actions #4

Updated by Tom Morris about 8 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF