Bug #7235

[SDKs] Python Keep client whole-transfer timeout should be more lenient

Added by Brett Smith almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Sarah Guthrie
Category:
SDKs
Target version:
Start date:
09/08/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Background

The Python Keep client sets a 300-second timeout to complete all requests. There are some real-world scenarios where this is too strict. For example, purely hypothetically, an Arvados developer might be working across an ocean, tethered through a cellular network. Everything will complete just fine, but whole 64MiB blocks might not be able to finish transferring in five minutes.

The functional requirement is that a user with a slow but stable connection can successfully interact with a Keep proxy. (I am willing to let timeouts continue to serve as a performance sanity check for the not-proxy case, on the expectation that one admin has sufficient control over the entire stack there.)

Implementation

Refer to libcurl connection options for details.

When the Python Keep client connects to a proxy service, instead of setting TIMEOUT_MS, set LOW_SPEED_LIMIT and LOW_SPEED_TIME to ensure a minimum transfer rate of 2 MiB per 64 second interval: i.e., give up if the transfer speed drops below 32 KiB/s.

Expected outcomes, assuming "mid-transfer outage" happens after 2 MiB has been transferred:

Available bandwidth Server delay (req finish to resp start) Mid-transfer outage Outcome
33 KiB/s None None Success
32 KiB/s 1s None Fail
32 KiB/s None 1s Fail
64 KiB/s 31s None Success
64 KiB/s 31s <32s Success
64 KiB/s <=31s >32s Fail
<2 MiB/s 63s None Fail
>2 MiB/s <63s None Success
>2 MiB/s <63s <63s Success "normal"
>2 MiB/s >=64s None Fail "classic timeout"
>2 MiB/s <63s >64s Fail "classic timeout"
Of the last 100393 keepstore transactions on keep14.q
  • 31 (0.03%) took longer than 30 seconds.
  • 8 (0.008%) took longer than 60 seconds.

Subtasks

Task #7684: Review 7235-python-keep-client-timeoutResolvedSarah Guthrie

Task #7689: Write TestsResolvedSarah Guthrie


Related issues

Related to Arvados - Story #7477: [SDKs] Go Keep client enforces a minimum transfer rate for proxy connectionsNew10/07/2015

Associated revisions

Revision e9c78ef7 (diff)
Added by Sarah Guthrie almost 4 years ago

Closes #7235. Instead of setting KeepService's pycurl.TIMEOUT_MS, set pycurl.LOW_SPEED_LIMIT and pycurl.LOW_SPEED_TIME.
Default LOW_SPEED_LIMIT is 32768 bytes per second. Default LOW_SPEED_TIME is 64 seconds.
If the user specifies a length-two tuple, the first item sets CONNECTTIMEOUT_MS, the second item sets LOW_SPEED_TIME,
and LOW_SPEED_LIMIT is set to 32768 bytes per second.

Added bandwidth similator to keepstub, which uses millisecond precision (like curl) to measure timeouts.
Added tests to test_keep_client and modified existing tests to only use integers.

Revision 84260dab
Added by Sarah Guthrie almost 4 years ago

Closes #7235. Merge branch '7235-python-keep-client-timeout'

History

#1 Updated by Peter Amstutz almost 4 years ago

I've had the same problem trying to upload over my horrible 1.5 mbit DSL where the connection just gives up despite making slow but steady progress.

I agree that having an overall connection timeout is a bad idea, IIRC curl provides options specifically for "minimum transfer speed" and "maximum time between reads" which are much better signals for determining if the connection is stalled or just very slow.

#2 Updated by Brett Smith almost 4 years ago

  • Target version set to Arvados Future Sprints

#3 Updated by Brett Smith almost 4 years ago

General consensus is to use the LOWSPEED options, assuming we can reasonably do something analogous in Go.

#4 Updated by Brett Smith almost 4 years ago

  • Subject changed from [SDKs] Keep client whole-transfer timeout should be more lenient to [SDKs] Python Keep client whole-transfer timeout should be more lenient
  • Description updated (diff)

Per yesterday's discussion. Go SDK changes will be split into a separate story.

#5 Updated by Brett Smith almost 4 years ago

  • Description updated (diff)

#6 Updated by Brett Smith almost 4 years ago

  • Story points set to 0.5

#7 Updated by Brett Smith almost 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-28 sprint

#8 Updated by Brett Smith almost 4 years ago

  • Target version changed from 2015-10-28 sprint to Arvados Future Sprints

#9 Updated by Bryan Cosca almost 4 years ago

Since it seems like I have some free time, I'm going to look at this bug and see if I can make a fix. But I won't take it because I can't guarantee I'll figure it out in time for the next sprint.

#10 Updated by Brett Smith almost 4 years ago

  • Assigned To set to Sarah Guthrie
  • Target version changed from Arvados Future Sprints to 2015-11-11 sprint

#11 Updated by Sarah Guthrie almost 4 years ago

  • Status changed from New to In Progress

#12 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#13 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#14 Updated by Brett Smith almost 4 years ago

  • Target version changed from 2015-11-11 sprint to 2015-12-02 sprint

#15 Updated by Tom Clegg almost 4 years ago

If a caller provides a 2-tuple (presumably based on previous version) I suspect we should default to 32768, not 0. It looks like we're passing 0 if it's not provided, and I'm guessing that tells PyCURL to disable the "low speed" stuff entirely, leaving us without any timeout at all.

Do the old 200ms tests not work any more? I would expect them to, and without them it seems like we're not testing the connect timeout at all. (If we can reinstate them verbatim, that might be a nice way to confirm existing callers still get something reasonable when they pass a 2-tuple of timeouts.)

The rest is just comments/docs:

bytes vs. kibytes
  • update comments for :timeout: and :proxy_timeout: in KeepClient (32 → 32768)
  • change bandwidth_kib_per_s var to bandwidth_bps (or something else without k)

Instead of the "see http://docs.python-requests.org/en/latest/user/advanced/#timeouts" reference, how about something like: "a connection will be aborted if the average traffic rate falls below minimum_bandwidth bytes per second over an interval of read_timeout seconds."

This comment seems to apply to TIMEOUT_TIME and should move above it (and gain a space after "#"): "#Needs to be low to trigger bandwidth errors before we run out of data"

I think this comment would be better rephrased/moved to setbandwidth's docstring:

        #If self.bandwidth = None, function at maximum bandwidth
        #Otherwise, self.bandwidth is the maximum number of bytes per second to
        #   operate at.

#16 Updated by Sarah Guthrie almost 4 years ago

Tom Clegg wrote:

If a caller provides a 2-tuple (presumably based on previous version) I suspect we should default to 32768, not 0. It looks like we're passing 0 if it's not provided, and I'm guessing that tells PyCURL to disable the "low speed" stuff entirely, leaving us without any timeout at all.

Alright, fixed.

Do the old 200ms tests not work any more? I would expect them to, and without them it seems like we're not testing the connect timeout at all. (If we can reinstate them verbatim, that might be a nice way to confirm existing callers still get something reasonable when they pass a 2-tuple of timeouts.)

I put them back in, but the timeouts don't work anymore. Updates are in appropriate branch.

#17 Updated by Tom Clegg almost 4 years ago

Pushed a couple of commits on top of this, most drastic part is slightly reducing the time wasted by those 2-tuple-timeout tests. LGTM if that LGTY.

#18 Updated by Sarah Guthrie almost 4 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:e9c78ef7855e7ae263fe461e069c89ff7fc0b798.

Also available in: Atom PDF