Bug #7587

[SDKs] Python Google API client raises Broken Pipe socket.error after sitting idle for some time

Added by Brett Smith almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
SDKs
Target version:
Start date:
10/22/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Occasionally, after sitting idle for a while, when you make an API request using the Python Google API client, it raises this exception:

Traceback (most recent call last):
File "/tmp/crunch-job/src/crunch_scripts/run-command", line 444, in <module>
  'progress':1.0
File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/oauth2client/util.py", line 142, in positional_wrapper
  return wrapped(*args, **kwargs)
File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/googleapiclient/http.py", line 722, in execute
  body=self.body, headers=self.headers)
File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/api.py", line 32, in http_request
  return self.orig_http_request(uri, **kwargs)
File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1609, in request
  (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1351, in _request
  (response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1273, in _conn_request
  conn.request(method, request_uri, body, headers)
File "/usr/lib/python2.7/httplib.py", line 973, in request
  self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 1007, in _send_request
  self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 969, in endheaders
  self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 829, in _send_output
  self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 805, in send
  self.sock.sendall(data)
File "/usr/lib/python2.7/ssl.py", line 329, in sendall
  v = self.send(data[count:])
File "/usr/lib/python2.7/ssl.py", line 298, in send
  v = self._sslobj.write(data)
socket.error: [Errno 32] Broken pipe

The first line is the API call, which varies by application, but the rest of the exception is remarkably consistent. We have seen this in a few different run-command jobs, in multiple clusters, when run-command updates the task after it finishes uploading output to Keep. We have also seen it happen after a long-running arv-put finishes writing to Keep and tries to make a collection.

7587-idle-api.py (1.47 KB) 7587-idle-api.py Brett Smith, 10/16/2015 02:19 PM

Subtasks

Task #7640: Review 7587-httplib2-retries-wip (no test yet beyond the successful job)ResolvedNico César

Task #7647: Review 7587-pysdk-retry-test-wipResolvedBrett Smith


Related issues

Related to Arvados - Bug #7496: [SDKs] picard run-command fails with Broken Pipe socket.error on tb05zResolved10/09/2015

Related to Arvados - Bug #7572: [SDKs] arv-put crashes with Broken Pipe socket.error after uploading 60GBClosed10/15/2015

Associated revisions

Revision 92768ce8
Added by Brett Smith almost 4 years ago

Merge branch '7587-httplib2-retries-wip'

Refs #7587. Closes #7640.

Revision 6e0145ef
Added by Brett Smith almost 4 years ago

Merge branch '7587-pysdk-retry-test-wip'

Closes #7587, #7647.

History

#1 Updated by Brett Smith almost 4 years ago

  • File 7587-idle-api.py 7587-idle-api.py added
  • Subject changed from [SDKs] Python Google API client raises socket.error after sitting idle for some time to [SDKs] Python Google API client raises Broken Pipe socket.error after sitting idle for some time

Summary of the investigation so far:

With our current logging setup, there is no evidence of the failed request in our nginx logs.

I have tried to reproduce the issue by running the attached script on a shell node on a cluster where we have seen the exception. This tries to mimic run-command's access patterns. So far I have not been able to get this to reproduce the issue, even when writing large numbers of files (256) and with long sleeps (over seven hours).

I note that all the times we've seen this happen, it happens after the code has done a bunch of Keep activity, and then calls an idle API client again. Note that it doesn't require too much Keep activity: one consistently failing job is uploading about 45KiB through run-command before updating the task.

The exception makes it look like the client is successfully sending request headers, but then the socket is closed by the time it tries to write the request body. It's unclear whether the server is accepting the headers, and then closing the connection; or if the connection has been closed the whole time, and the client just can't notice until after it sends the headers.

Open theories:

  • nginx, or Passenger, or (less likely) Rails' HTTP stack is closing the connection after receiving the headers, for some reason.
  • Linux on the API server is closing the connection for some reason. This seems unlikely, since long waits alone haven't successfully reproduced it, but it hasn't been ruled out completely.
  • Linux on the client closes the connection, possibly because the process opens too many other connections to keep it open.
  • Or, through some other mechanism, substantial Keep activity in the client code causes the API server connection to be closed without the original API client knowing.

#2 Updated by Brett Smith almost 4 years ago

Idea: Should _intercept_http_request in api.py retry socket.error the same way it retries BadStatusLine? The comments for that logic suggest the rationale is nearly identical.

#3 Updated by Brett Smith almost 4 years ago

  • Story points set to 1.0

Plan: Do the naive retry in _intercept_http_request, and try the failing job with it. If that fixes things, then write up a test and get the branch merged properly.

If you can figure out how to close the HTTP object's socket before doing the retry, that's even better.

#4 Updated by Brett Smith almost 4 years ago

https://github.com/jcgregorio/httplib2/issues/240 suggests there is no way to close the socket explicitly.

#5 Updated by Brett Smith almost 4 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith

#6 Updated by Brett Smith almost 4 years ago

I have a branch up on qr1hi that implements the planned fix, and I'm running the known-failing job with it.

#7 Updated by Brett Smith almost 4 years ago

Update: I have had to update basically everything around the failing job—run-command, the Docker image—to get a valid test going. Still chipping at it.

#8 Updated by Brett Smith almost 4 years ago

qr1hi-8i9sb-ar91eq68o7afgzx succeeded with the patch.

#9 Updated by Brett Smith almost 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-28 sprint

Branch with the patch is up for review. I am happy to work on a test and make sure it gets done before this issue is closed. But I wonder if the job is test enough to go ahead and get this merged, given how many people have been affected by it.

#10 Updated by Brett Smith almost 4 years ago

For background, it might also help to look at httplib2's _conn_request method. You can compare how it handles and retries different exceptions, including socket.error, and what we do in _intercept_http_request.

Note that httplib2 itself only retries socket errors that are ENETUNREACH or EADDRNOTAVAIL. Turning on its retry support wouldn't help us address the cases we've seen, which are EPIPE and ETIMEOUT.

#11 Updated by Nico César almost 4 years ago

reviewing 6a5cfd533c78824f65a9cf91e722570278d955e7

1. for better understanding of the nature of the exception will be good if the exception.message gets logged. _logger.warning(e.message) before returning.

2. as a (maybe later) improvement debugging before conn.close() will be good to have the list of connections with _logger.debug()

eyeballing the code I cant come up with any problems that might rise right away. Tests pass (even though this isn't tested)

1. and 2. aren't required for the merge in my opinion, but they are nice features to have (specially for OPS). LGTM

#12 Updated by Brett Smith almost 4 years ago

Nico Cesar wrote:

reviewing 6a5cfd533c78824f65a9cf91e722570278d955e7

1. for better understanding of the nature of the exception will be good if the exception.message gets logged. _logger.warning(e.message) before returning.

I added this at the debug level, so you can see it if you run with ARVADOS_DEBUG=1. If this retry does end up being routine—which it's starting to look like it might be—I didn't think it would be helpful to clutter job logs with it by logging a warning.

2. as a (maybe later) improvement debugging before conn.close() will be good to have the list of connections with _logger.debug()

I think if we're worried about this we should investigate it interactively with pdb or something like that. To my mind, the only question is, am I right that the API server connection is the only one? If I am, that's all we need. If not, we need to consider different code entirely. I'll note some of the other things we're doing in that function, like adding the Authorization header, are also inappropriate if the client makes connections to other servers. So the code already makes the same assumption and needs more fixes if it's wrong.

Merged this. I'll work on testing it next. Thanks.

#13 Updated by Brett Smith almost 4 years ago

7587-pysdk-retry-test-wip is up for review. It has two commits. The first fixes a bug where the API client tests could talk to the API server configured in the environment, even a real one. See the commit log for more background and rationale. That commit only fixes that issue.

The second commit adds the actual test.

#14 Updated by Tom Clegg almost 4 years ago

7587-pysdk-retry-test-wip @ cc5c45d LGTM

#15 Updated by Brett Smith almost 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:6e0145efb9929b5a854fb3461161460b0ac75fba.

Also available in: Atom PDF