Bug #5562

[SDKs] Fix timeout problems by switching from urllib et al. to PycURL

Added by Peter Amstutz over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
04/20/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.0

Description

When there are lots of keep clients on one node sending or receiving lots of data to the same keep server on another node, the network link can become saturated and some connections get starved.

The symptom is that some of the clients to have their writes delayed, which results in connections timing out (Unexpected EOF error on the server).

Setting a connection limit which reject connections doesn't work. It's easy for a process to get 4 "Connection rejected" errors in a row and give up.

The underlying problem is that when urllib3 sets the "connection timeout", this applies to all socket operations until the entire request has been sent, at which point it adjusts the timeout to the "read timeout". This means if any single send() operation blocks for more than the "connection_timeout" time, the entire request is aborted.

We could patch urllib3 to add a separate "request timeout" to be set in connection.HTTPConnection._new_conn() after the actual TCP socket is set up. Pushing an upstream patch and waiting for a new release (or maintaining a forked version) could be awkward, though.

Alternately, we could switch to pycurl (python wrapper for libcurl).

Another related solution for writes is to use Expect-continue and 100 Continue to limit the number of simultaneous requests that are transmitting blocks.


Subtasks

Task #5639: Update test utils to mock pycurl instead of requestsResolvedTom Clegg

Task #5638: Update keep client libraryResolvedTom Clegg

Task #5637: Update package dependenciesResolvedTom Clegg

Task #5789: Test somehowResolvedTom Clegg

Task #5847: Review 5562-pycurlResolved

Task #5868: Review 5562-pycurl-uploadResolvedPeter Amstutz

Associated revisions

Revision acc7d4a4
Added by Tom Clegg over 6 years ago

Merge branch '5562-pycurl' closes #5562

Revision 6c9dde5e (diff)
Added by Tom Clegg over 6 years ago

5562: Use WRITEFUNCTION instead of WRITEDATA for compatibility with older pycurl. refs #5562

Revision f94d59e8
Added by Tom Clegg over 6 years ago

Merge branch '5562-pycurl' refs #5562

Revision fc16c2eb
Added by Peter Amstutz over 6 years ago

Merge branch '5562-pycurl-upload' closes #5562

History

#1 Updated by Peter Amstutz over 6 years ago

  • Description updated (diff)

#2 Updated by Tom Clegg over 6 years ago

  • Subject changed from [Keep] Better behavior when network link is saturated to [SDKs] Fix timeout problems by switching from urllib et al. to PycURL
  • Category set to SDKs
  • Target version changed from Bug Triage to 2015-04-29 sprint
  • Story points set to 2.0

#3 Updated by Tom Clegg over 6 years ago

  • Assigned To set to Tom Clegg

#4 Updated by Tom Clegg over 6 years ago

Test package is here, and /tmp/arvados-python-client-0.1.20150429043929.tar.gz on shell.q

virtualenv ~/venv
. ~/venv/bin/activate
pip install /tmp/arvados-python-client-0.1.20150429043929.tar.gz

#5 Updated by Peter Amstutz over 6 years ago

  • In KeepService.get() it looks like a 404 response will be treated as an error and cause the connection to be reset, but that doesn't seem desirable?

#6 Updated by Peter Amstutz over 6 years ago

http://curl.haxx.se/docs/faq.html#I_want_a_different_time_out

We should set http://curl.haxx.se/libcurl/c/CURLOPT_LOW_SPEED_LIMIT.html and http://curl.haxx.se/libcurl/c/CURLOPT_LOW_SPEED_TIME.html

For example:

CURLOPT_LOW_SPEED_LIMIT 1 byte per second

CURLOPT_LOW_SPEED_TIME 20 seconds

This would abort the connection if there is no data transfer for 20 seconds.

#7 Updated by Peter Amstutz over 6 years ago

(From IRC: agreed that when talking to a Keep node, the timeout should already be 20 seconds, so setting LOW_SPEED_TIME is unnecessary).

In addition to enabling SO_KEEPALIVE, need to set TCP_KEEPIDLE and TCP_KEEPINTVL:

        def _socket_open(self, family, socktype, protocol, address):
            """Because pycurl doesn't have CURLOPT_TCP_KEEPALIVE""" 
            s = socket.socket(family, socktype, protocol)
            s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
            s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 120)
            s.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 120)
            return s

#8 Updated by Tom Clegg over 6 years ago

  • Status changed from New to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:acc7d4a40ae78e866c983624ec36b7bae3b918a9.

#9 Updated by Peter Amstutz over 6 years ago

  • Status changed from Resolved to Feedback
Traceback (most recent call last):
  File "/usr/local/bin/arv-get", line 198, in <module>
    for data in f.readall():
  File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 94, in readall
    data = self.read(size, num_retries=num_retries)
  File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 45, in before_close_wrapper
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 154, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 201, in read
    num_retries=num_retries)
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 154, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 85, in readfrom
    data.append(self._keepget(lr.locator, num_retries=num_retries)[lr.segment_offset:lr.segment_offset+lr.segment_size])
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 154, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 74, in _keepget
    return self._keep.get(locator, num_retries=num_retries)
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 154, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 863, in get
    blob = keep_service.get(locator, timeout=self.current_timeout(num_retries-tries_left))
  File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 365, in get
    curl.setopt(pycurl.WRITEDATA, response_body)
TypeError: invalid arguments to setopt
2015-04-30_13:01:38 Exception in thread Thread-1:
2015-04-30_13:01:38 Traceback (most recent call last):
2015-04-30_13:01:38 File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
2015-04-30_13:01:38 self.run()
2015-04-30_13:01:38 File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 516, in run
2015-04-30_13:01:38 self.run_with_limiter(limiter)
2015-04-30_13:01:38 File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 529, in run_with_limiter
2015-04-30_13:01:38 timeout=self.args.get('timeout', None)))
2015-04-30_13:01:38 File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 430, in put
2015-04-30_13:01:38 curl.setopt(pycurl.WRITEDATA, response_body)
2015-04-30_13:01:38 TypeError: invalid arguments to setopt

#10 Updated by Peter Amstutz over 6 years ago

On the compute node, pycurl is presumably using this package:

/usr/lib/pymodules/python2.7/pycurl-7.19.0.egg-info

#11 Updated by Tom Clegg over 6 years ago

Peter Amstutz wrote:

On the compute node, pycurl is presumably using this package:

/usr/lib/pymodules/python2.7/pycurl-7.19.0.egg-info

Awesome. In that case I think this fix looks pretty compelling.

#12 Updated by Tom Clegg over 6 years ago

  • Target version changed from 2015-04-29 sprint to 2015-05-20 sprint
  • Story points changed from 2.0 to 0.0

#13 Updated by Tom Clegg over 6 years ago

On 5562-pycurl-upload at 589e0f3c (after merging master to deal with pycurl version problems) I get:

======================================================================
FAIL: test_timeout_slow_request (tests.test_keep_client.KeepClientTimeout)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 519, in test_timeout_slow_request
    self._test_200ms()
  File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 537, in _test_200ms
    loc = kc.put(self.DATA, copies=1, num_retries=0)
  File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 480, in __exit__
    self.assertLess(time.time() - self.t0, self.tmax)
AssertionError: 1.0026130676269531 not less than 0.3

======================================================================
FAIL: test_timeout_slow_response (tests.test_keep_client.KeepClientTimeout)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 523, in test_timeout_slow_response
    self._test_200ms()
  File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 537, in _test_200ms
    loc = kc.put(self.DATA, copies=1, num_retries=0)
  File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 480, in __exit__
    self.assertLess(time.time() - self.t0, self.tmax)
AssertionError: 1.002737045288086 not less than 0.3

======================================================================
FAIL: test_timeout_slow_response_body (tests.test_keep_client.KeepClientTimeout)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 527, in test_timeout_slow_response_body
    self._test_200ms()
  File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 537, in _test_200ms
    loc = kc.put(self.DATA, copies=1, num_retries=0)
  File "/home/tom/src/ar2/sdk/python/tests/test_keep_client.py", line 480, in __exit__
    self.assertLess(time.time() - self.t0, self.tmax)
AssertionError: 1.002708911895752 not less than 0.3

I'm guessing this just means keepstub.py has to learn how to send "100 Continue" when it gets "Expect: 100-continue".

We should have at least one of:
  • Issue# in the commit message
  • Explain in the commit message what was done (as a behavior change, not just paraphrasing the lines of code)
  • Refer to the issue# in the commit message
  • Comment the code to say that UPLOAD does that secret 100-continue thing
And:
  • A test, perhaps by augmenting keepstub.py to check (always or when the test case wants it to) whether the Expect: header arrived. (Perhaps even checking whether any data gets sent in the first N milliseconds in absence of 100 continue -- but that might be more trouble than it's worth, depending on how/whether BasicHTTPServer handles the Expect header itself.)

#14 Updated by Tom Clegg over 6 years ago

bde53a1 LGTM, thanks

#15 Updated by Peter Amstutz over 6 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 83 to 100

Applied in changeset arvados|commit:fc16c2ebbb92d9b0e52b6435b717f26b9680c7e6.

Also available in: Atom PDF