Bug #7235
closed[SDKs] Python Keep client whole-transfer timeout should be more lenient
Description
Background¶
The Python Keep client sets a 300-second timeout to complete all requests. There are some real-world scenarios where this is too strict. For example, purely hypothetically, an Arvados developer might be working across an ocean, tethered through a cellular network. Everything will complete just fine, but whole 64MiB blocks might not be able to finish transferring in five minutes.
The functional requirement is that a user with a slow but stable connection can successfully interact with a Keep proxy. (I am willing to let timeouts continue to serve as a performance sanity check for the not-proxy case, on the expectation that one admin has sufficient control over the entire stack there.)
Implementation¶
Refer to libcurl connection options for details.
When the Python Keep client connects to a proxy service, instead of setting TIMEOUT_MS, set LOW_SPEED_LIMIT and LOW_SPEED_TIME to ensure a minimum transfer rate of 2 MiB per 64 second interval: i.e., give up if the transfer speed drops below 32 KiB/s.
Expected outcomes, assuming "mid-transfer outage" happens after 2 MiB has been transferred:
Available bandwidth | Server delay (req finish to resp start) | Mid-transfer outage | Outcome | |
33 KiB/s | None | None | Success | |
32 KiB/s | 1s | None | Fail | |
32 KiB/s | None | 1s | Fail | |
64 KiB/s | 31s | None | Success | |
64 KiB/s | 31s | <32s | Success | |
64 KiB/s | <=31s | >32s | Fail | |
<2 MiB/s | 63s | None | Fail | |
>2 MiB/s | <63s | None | Success | |
>2 MiB/s | <63s | <63s | Success | "normal" |
>2 MiB/s | >=64s | None | Fail | "classic timeout" |
>2 MiB/s | <63s | >64s | Fail | "classic timeout" |
- 31 (0.03%) took longer than 30 seconds.
- 8 (0.008%) took longer than 60 seconds.
Related issues
Updated by Peter Amstutz about 9 years ago
I've had the same problem trying to upload over my horrible 1.5 mbit DSL where the connection just gives up despite making slow but steady progress.
I agree that having an overall connection timeout is a bad idea, IIRC curl provides options specifically for "minimum transfer speed" and "maximum time between reads" which are much better signals for determining if the connection is stalled or just very slow.
Updated by Brett Smith about 9 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith about 9 years ago
General consensus is to use the LOWSPEED options, assuming we can reasonably do something analogous in Go.
Updated by Brett Smith about 9 years ago
- Subject changed from [SDKs] Keep client whole-transfer timeout should be more lenient to [SDKs] Python Keep client whole-transfer timeout should be more lenient
- Description updated (diff)
Per yesterday's discussion. Go SDK changes will be split into a separate story.
Updated by Brett Smith about 9 years ago
- Target version changed from Arvados Future Sprints to 2015-10-28 sprint
Updated by Brett Smith about 9 years ago
- Target version changed from 2015-10-28 sprint to Arvados Future Sprints
Updated by Bryan Cosca about 9 years ago
Since it seems like I have some free time, I'm going to look at this bug and see if I can make a fix. But I won't take it because I can't guarantee I'll figure it out in time for the next sprint.
Updated by Brett Smith about 9 years ago
- Assigned To set to Sarah Guthrie
- Target version changed from Arvados Future Sprints to 2015-11-11 sprint
Updated by Sarah Guthrie about 9 years ago
- Status changed from New to In Progress
Updated by Brett Smith about 9 years ago
- Target version changed from 2015-11-11 sprint to 2015-12-02 sprint
Updated by Tom Clegg about 9 years ago
If a caller provides a 2-tuple (presumably based on previous version) I suspect we should default to 32768, not 0. It looks like we're passing 0 if it's not provided, and I'm guessing that tells PyCURL to disable the "low speed" stuff entirely, leaving us without any timeout at all.
Do the old 200ms tests not work any more? I would expect them to, and without them it seems like we're not testing the connect timeout at all. (If we can reinstate them verbatim, that might be a nice way to confirm existing callers still get something reasonable when they pass a 2-tuple of timeouts.)
The rest is just comments/docs:
bytes vs. kibytes- update comments for :timeout: and :proxy_timeout: in KeepClient (32 → 32768)
- change bandwidth_kib_per_s var to bandwidth_bps (or something else without k)
Instead of the "see http://docs.python-requests.org/en/latest/user/advanced/#timeouts" reference, how about something like: "a connection will be aborted if the average traffic rate falls below minimum_bandwidth bytes per second over an interval of read_timeout seconds."
This comment seems to apply to TIMEOUT_TIME and should move above it (and gain a space after "#"): "#Needs to be low to trigger bandwidth errors before we run out of data"
I think this comment would be better rephrased/moved to setbandwidth's docstring:
#If self.bandwidth = None, function at maximum bandwidth
#Otherwise, self.bandwidth is the maximum number of bytes per second to
# operate at.
Updated by Sarah Guthrie about 9 years ago
Tom Clegg wrote:
If a caller provides a 2-tuple (presumably based on previous version) I suspect we should default to 32768, not 0. It looks like we're passing 0 if it's not provided, and I'm guessing that tells PyCURL to disable the "low speed" stuff entirely, leaving us without any timeout at all.
Alright, fixed.
Do the old 200ms tests not work any more? I would expect them to, and without them it seems like we're not testing the connect timeout at all. (If we can reinstate them verbatim, that might be a nice way to confirm existing callers still get something reasonable when they pass a 2-tuple of timeouts.)
I put them back in, but the timeouts don't work anymore. Updates are in appropriate branch.
Updated by Tom Clegg about 9 years ago
Pushed a couple of commits on top of this, most drastic part is slightly reducing the time wasted by those 2-tuple-timeout tests. LGTM if that LGTY.
Updated by Sarah Guthrie about 9 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:e9c78ef7855e7ae263fe461e069c89ff7fc0b798.