Bug #7587
closed[SDKs] Python Google API client raises Broken Pipe socket.error after sitting idle for some time
Description
Occasionally, after sitting idle for a while, when you make an API request using the Python Google API client, it raises this exception:
Traceback (most recent call last): File "/tmp/crunch-job/src/crunch_scripts/run-command", line 444, in <module> 'progress':1.0 File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/oauth2client/util.py", line 142, in positional_wrapper return wrapped(*args, **kwargs) File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/googleapiclient/http.py", line 722, in execute body=self.body, headers=self.headers) File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/arvados/api.py", line 32, in http_request return self.orig_http_request(uri, **kwargs) File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1609, in request (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey) File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1351, in _request (response, content) = self._conn_request(conn, request_uri, method, body, headers) File "/tmp/crunch-job-work/.arvados.venv/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1273, in _conn_request conn.request(method, request_uri, body, headers) File "/usr/lib/python2.7/httplib.py", line 973, in request self._send_request(method, url, body, headers) File "/usr/lib/python2.7/httplib.py", line 1007, in _send_request self.endheaders(body) File "/usr/lib/python2.7/httplib.py", line 969, in endheaders self._send_output(message_body) File "/usr/lib/python2.7/httplib.py", line 829, in _send_output self.send(msg) File "/usr/lib/python2.7/httplib.py", line 805, in send self.sock.sendall(data) File "/usr/lib/python2.7/ssl.py", line 329, in sendall v = self.send(data[count:]) File "/usr/lib/python2.7/ssl.py", line 298, in send v = self._sslobj.write(data) socket.error: [Errno 32] Broken pipe
The first line is the API call, which varies by application, but the rest of the exception is remarkably consistent. We have seen this in a few different run-command jobs, in multiple clusters, when run-command updates the task after it finishes uploading output to Keep. We have also seen it happen after a long-running arv-put finishes writing to Keep and tries to make a collection.
Files
Updated by Brett Smith over 9 years ago
- File 7587-idle-api.py 7587-idle-api.py added
- Subject changed from [SDKs] Python Google API client raises socket.error after sitting idle for some time to [SDKs] Python Google API client raises Broken Pipe socket.error after sitting idle for some time
Summary of the investigation so far:
With our current logging setup, there is no evidence of the failed request in our nginx logs.
I have tried to reproduce the issue by running the attached script on a shell node on a cluster where we have seen the exception. This tries to mimic run-command's access patterns. So far I have not been able to get this to reproduce the issue, even when writing large numbers of files (256) and with long sleeps (over seven hours).
I note that all the times we've seen this happen, it happens after the code has done a bunch of Keep activity, and then calls an idle API client again. Note that it doesn't require too much Keep activity: one consistently failing job is uploading about 45KiB through run-command before updating the task.
The exception makes it look like the client is successfully sending request headers, but then the socket is closed by the time it tries to write the request body. It's unclear whether the server is accepting the headers, and then closing the connection; or if the connection has been closed the whole time, and the client just can't notice until after it sends the headers.
Open theories:
- nginx, or Passenger, or (less likely) Rails' HTTP stack is closing the connection after receiving the headers, for some reason.
- Linux on the API server is closing the connection for some reason. This seems unlikely, since long waits alone haven't successfully reproduced it, but it hasn't been ruled out completely.
- Linux on the client closes the connection, possibly because the process opens too many other connections to keep it open.
- Or, through some other mechanism, substantial Keep activity in the client code causes the API server connection to be closed without the original API client knowing.
Updated by Brett Smith over 9 years ago
Idea: Should _intercept_http_request in api.py retry socket.error the same way it retries BadStatusLine? The comments for that logic suggest the rationale is nearly identical.
Updated by Brett Smith over 9 years ago
- Story points set to 1.0
Plan: Do the naive retry in _intercept_http_request, and try the failing job with it. If that fixes things, then write up a test and get the branch merged properly.
If you can figure out how to close the HTTP object's socket before doing the retry, that's even better.
Updated by Brett Smith over 9 years ago
https://github.com/jcgregorio/httplib2/issues/240 suggests there is no way to close the socket explicitly.
Updated by Brett Smith over 9 years ago
- Status changed from New to In Progress
- Assigned To set to Brett Smith
Updated by Brett Smith over 9 years ago
I have a branch up on qr1hi that implements the planned fix, and I'm running the known-failing job with it.
Updated by Brett Smith over 9 years ago
Update: I have had to update basically everything around the failing job—run-command, the Docker image—to get a valid test going. Still chipping at it.
Updated by Brett Smith over 9 years ago
qr1hi-8i9sb-ar91eq68o7afgzx succeeded with the patch.
Updated by Brett Smith over 9 years ago
- Target version changed from Arvados Future Sprints to 2015-10-28 sprint
Branch with the patch is up for review. I am happy to work on a test and make sure it gets done before this issue is closed. But I wonder if the job is test enough to go ahead and get this merged, given how many people have been affected by it.
Updated by Brett Smith over 9 years ago
For background, it might also help to look at httplib2's conn_request method_init__.py#L1265. You can compare how it handles and retries different exceptions, including socket.error, and what we do in _intercept_http_request.
Note that httplib2 itself only retries socket errors that are ENETUNREACH or EADDRNOTAVAIL. Turning on its retry support wouldn't help us address the cases we've seen, which are EPIPE and ETIMEOUT.
Updated by Nico César over 9 years ago
reviewing 6a5cfd533c78824f65a9cf91e722570278d955e7
1. for better understanding of the nature of the exception will be good if the exception.message gets logged. _logger.warning(e.message)
before returning.
2. as a (maybe later) improvement debugging before conn.close()
will be good to have the list of connections with _logger.debug()
eyeballing the code I cant come up with any problems that might rise right away. Tests pass (even though this isn't tested)
1. and 2. aren't required for the merge in my opinion, but they are nice features to have (specially for OPS). LGTM
Updated by Brett Smith over 9 years ago
Nico Cesar wrote:
reviewing 6a5cfd533c78824f65a9cf91e722570278d955e7
1. for better understanding of the nature of the exception will be good if the exception.message gets logged.
_logger.warning(e.message)
before returning.
I added this at the debug level, so you can see it if you run with ARVADOS_DEBUG=1. If this retry does end up being routine—which it's starting to look like it might be—I didn't think it would be helpful to clutter job logs with it by logging a warning.
2. as a (maybe later) improvement debugging before
conn.close()
will be good to have the list of connections with_logger.debug()
I think if we're worried about this we should investigate it interactively with pdb or something like that. To my mind, the only question is, am I right that the API server connection is the only one? If I am, that's all we need. If not, we need to consider different code entirely. I'll note some of the other things we're doing in that function, like adding the Authorization header, are also inappropriate if the client makes connections to other servers. So the code already makes the same assumption and needs more fixes if it's wrong.
Merged this. I'll work on testing it next. Thanks.
Updated by Brett Smith over 9 years ago
7587-pysdk-retry-test-wip is up for review. It has two commits. The first fixes a bug where the API client tests could talk to the API server configured in the environment, even a real one. See the commit log for more background and rationale. That commit only fixes that issue.
The second commit adds the actual test.
Updated by Brett Smith over 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:6e0145efb9929b5a854fb3461161460b0ac75fba.
Updated by Brett Smith almost 2 years ago
- Related to Bug #12684: Let user specify a retry strategy on the client object, used for all API calls added