Bug #5562
closed[SDKs] Fix timeout problems by switching from urllib et al. to PycURL
Description
When there are lots of keep clients on one node sending or receiving lots of data to the same keep server on another node, the network link can become saturated and some connections get starved.
The symptom is that some of the clients to have their writes delayed, which results in connections timing out (Unexpected EOF error on the server).
Setting a connection limit which reject connections doesn't work. It's easy for a process to get 4 "Connection rejected" errors in a row and give up.
The underlying problem is that when urllib3 sets the "connection timeout", this applies to all socket operations until the entire request has been sent, at which point it adjusts the timeout to the "read timeout". This means if any single send() operation blocks for more than the "connection_timeout" time, the entire request is aborted.
We could patch urllib3 to add a separate "request timeout" to be set in connection.HTTPConnection._new_conn() after the actual TCP socket is set up. Pushing an upstream patch and waiting for a new release (or maintaining a forked version) could be awkward, though.
Alternately, we could switch to pycurl (python wrapper for libcurl).
Another related solution for writes is to use Expect-continue and 100 Continue to limit the number of simultaneous requests that are transmitting blocks.
Files
Updated by Tom Clegg over 9 years ago
- Subject changed from [Keep] Better behavior when network link is saturated to [SDKs] Fix timeout problems by switching from urllib et al. to PycURL
- Category set to SDKs
- Target version changed from Bug Triage to 2015-04-29 sprint
- Story points set to 2.0
Updated by Tom Clegg over 9 years ago
- File arvados-python-client-0.1.20150429043929.tar.gz arvados-python-client-0.1.20150429043929.tar.gz added
Test package is here, and /tmp/arvados-python-client-0.1.20150429043929.tar.gz on shell.q
virtualenv ~/venv . ~/venv/bin/activate pip install /tmp/arvados-python-client-0.1.20150429043929.tar.gz
Updated by Peter Amstutz over 9 years ago
- In
KeepService.get()
it looks like a 404 response will be treated as an error and cause the connection to be reset, but that doesn't seem desirable?
Updated by Peter Amstutz over 9 years ago
http://curl.haxx.se/docs/faq.html#I_want_a_different_time_out
We should set http://curl.haxx.se/libcurl/c/CURLOPT_LOW_SPEED_LIMIT.html and http://curl.haxx.se/libcurl/c/CURLOPT_LOW_SPEED_TIME.html
For example:
CURLOPT_LOW_SPEED_LIMIT 1 byte per second
CURLOPT_LOW_SPEED_TIME 20 seconds
This would abort the connection if there is no data transfer for 20 seconds.
Updated by Peter Amstutz over 9 years ago
(From IRC: agreed that when talking to a Keep node, the timeout should already be 20 seconds, so setting LOW_SPEED_TIME is unnecessary).
In addition to enabling SO_KEEPALIVE, need to set TCP_KEEPIDLE and TCP_KEEPINTVL:
def _socket_open(self, family, socktype, protocol, address): """Because pycurl doesn't have CURLOPT_TCP_KEEPALIVE""" s = socket.socket(family, socktype, protocol) s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 120) s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 120) return s
Updated by Tom Clegg over 9 years ago
- Status changed from New to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:acc7d4a40ae78e866c983624ec36b7bae3b918a9.
Updated by Peter Amstutz over 9 years ago
- Status changed from Resolved to Feedback
Traceback (most recent call last): File "/usr/local/bin/arv-get", line 198, in <module> for data in f.readall(): File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 94, in readall data = self.read(size, num_retries=num_retries) File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 45, in before_close_wrapper return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 154, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 201, in read num_retries=num_retries) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 154, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 85, in readfrom data.append(self._keepget(lr.locator, num_retries=num_retries)[lr.segment_offset:lr.segment_offset+lr.segment_size]) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 154, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 74, in _keepget return self._keep.get(locator, num_retries=num_retries) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 154, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 863, in get blob = keep_service.get(locator, timeout=self.current_timeout(num_retries-tries_left)) File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 365, in get curl.setopt(pycurl.WRITEDATA, response_body) TypeError: invalid arguments to setopt
2015-04-30_13:01:38 Exception in thread Thread-1: 2015-04-30_13:01:38 Traceback (most recent call last): 2015-04-30_13:01:38 File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner 2015-04-30_13:01:38 self.run() 2015-04-30_13:01:38 File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 516, in run 2015-04-30_13:01:38 self.run_with_limiter(limiter) 2015-04-30_13:01:38 File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 529, in run_with_limiter 2015-04-30_13:01:38 timeout=self.args.get('timeout', None))) 2015-04-30_13:01:38 File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 430, in put 2015-04-30_13:01:38 curl.setopt(pycurl.WRITEDATA, response_body) 2015-04-30_13:01:38 TypeError: invalid arguments to setopt
Updated by Peter Amstutz over 9 years ago
On the compute node, pycurl is presumably using this package:
/usr/lib/pymodules/python2.7/pycurl-7.19.0.egg-info
Updated by Tom Clegg over 9 years ago
Peter Amstutz wrote:
On the compute node, pycurl is presumably using this package:
/usr/lib/pymodules/python2.7/pycurl-7.19.0.egg-info
Awesome. In that case I think this fix looks pretty compelling.
Updated by Tom Clegg over 9 years ago
- Target version changed from 2015-04-29 sprint to 2015-05-20 sprint
- Story points changed from 2.0 to 0.0
Updated by Tom Clegg over 9 years ago
On 5562-pycurl-upload at 589e0f3c (after merging master to deal with pycurl version problems) I get:
====================================================================== FAIL: test_timeout_slow_request (tests.test_keep_client.KeepClientTimeout) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 519, in test_timeout_slow_request self._test_200ms() File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 537, in _test_200ms loc = kc.put(self.DATA, copies=1, num_retries=0) File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 480, in __exit__ self.assertLess(time.time() - self.t0, self.tmax) AssertionError: 1.0026130676269531 not less than 0.3 ====================================================================== FAIL: test_timeout_slow_response (tests.test_keep_client.KeepClientTimeout) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 523, in test_timeout_slow_response self._test_200ms() File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 537, in _test_200ms loc = kc.put(self.DATA, copies=1, num_retries=0) File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 480, in __exit__ self.assertLess(time.time() - self.t0, self.tmax) AssertionError: 1.002737045288086 not less than 0.3 ====================================================================== FAIL: test_timeout_slow_response_body (tests.test_keep_client.KeepClientTimeout) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 527, in test_timeout_slow_response_body self._test_200ms() File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 537, in _test_200ms loc = kc.put(self.DATA, copies=1, num_retries=0) File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 480, in __exit__ self.assertLess(time.time() - self.t0, self.tmax) AssertionError: 1.002708911895752 not less than 0.3
I'm guessing this just means keepstub.py has to learn how to send "100 Continue" when it gets "Expect: 100-continue".
We should have at least one of:- Issue# in the commit message
- Explain in the commit message what was done (as a behavior change, not just paraphrasing the lines of code)
- Refer to the issue# in the commit message
- Comment the code to say that UPLOAD does that secret 100-continue thing
- A test, perhaps by augmenting keepstub.py to check (always or when the test case wants it to) whether the Expect: header arrived. (Perhaps even checking whether any data gets sent in the first N milliseconds in absence of 100 continue -- but that might be more trouble than it's worth, depending on how/whether BasicHTTPServer handles the Expect header itself.)
Updated by Peter Amstutz over 9 years ago
- Status changed from Feedback to Resolved
- % Done changed from 83 to 100
Applied in changeset arvados|commit:fc16c2ebbb92d9b0e52b6435b717f26b9680c7e6.