Project

General

Profile

Actions

Bug #7235

closed

[SDKs] Python Keep client whole-transfer timeout should be more lenient

Added by Brett Smith over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Sarah Guthrie
Category:
SDKs
Target version:
Story points:
0.5

Description

Background

The Python Keep client sets a 300-second timeout to complete all requests. There are some real-world scenarios where this is too strict. For example, purely hypothetically, an Arvados developer might be working across an ocean, tethered through a cellular network. Everything will complete just fine, but whole 64MiB blocks might not be able to finish transferring in five minutes.

The functional requirement is that a user with a slow but stable connection can successfully interact with a Keep proxy. (I am willing to let timeouts continue to serve as a performance sanity check for the not-proxy case, on the expectation that one admin has sufficient control over the entire stack there.)

Implementation

Refer to libcurl connection options for details.

When the Python Keep client connects to a proxy service, instead of setting TIMEOUT_MS, set LOW_SPEED_LIMIT and LOW_SPEED_TIME to ensure a minimum transfer rate of 2 MiB per 64 second interval: i.e., give up if the transfer speed drops below 32 KiB/s.

Expected outcomes, assuming "mid-transfer outage" happens after 2 MiB has been transferred:

Available bandwidth Server delay (req finish to resp start) Mid-transfer outage Outcome
33 KiB/s None None Success
32 KiB/s 1s None Fail
32 KiB/s None 1s Fail
64 KiB/s 31s None Success
64 KiB/s 31s <32s Success
64 KiB/s <=31s >32s Fail
<2 MiB/s 63s None Fail
>2 MiB/s <63s None Success
>2 MiB/s <63s <63s Success "normal"
>2 MiB/s >=64s None Fail "classic timeout"
>2 MiB/s <63s >64s Fail "classic timeout"
Of the last 100393 keepstore transactions on keep14.q
  • 31 (0.03%) took longer than 30 seconds.
  • 8 (0.008%) took longer than 60 seconds.

Subtasks 2 (0 open2 closed)

Task #7684: Review 7235-python-keep-client-timeoutResolvedSarah Guthrie09/08/2015Actions
Task #7689: Write TestsResolvedSarah Guthrie09/08/2015Actions

Related issues

Related to Arvados - Idea #7477: [SDKs] Go Keep client enforces a minimum transfer rate for proxy connectionsNewActions
Actions #1

Updated by Peter Amstutz over 8 years ago

I've had the same problem trying to upload over my horrible 1.5 mbit DSL where the connection just gives up despite making slow but steady progress.

I agree that having an overall connection timeout is a bad idea, IIRC curl provides options specifically for "minimum transfer speed" and "maximum time between reads" which are much better signals for determining if the connection is stalled or just very slow.

Actions #2

Updated by Brett Smith over 8 years ago

  • Target version set to Arvados Future Sprints
Actions #3

Updated by Brett Smith over 8 years ago

General consensus is to use the LOWSPEED options, assuming we can reasonably do something analogous in Go.

Actions #4

Updated by Brett Smith over 8 years ago

  • Subject changed from [SDKs] Keep client whole-transfer timeout should be more lenient to [SDKs] Python Keep client whole-transfer timeout should be more lenient
  • Description updated (diff)

Per yesterday's discussion. Go SDK changes will be split into a separate story.

Actions #5

Updated by Brett Smith over 8 years ago

  • Description updated (diff)
Actions #6

Updated by Brett Smith over 8 years ago

  • Story points set to 0.5
Actions #7

Updated by Brett Smith over 8 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-28 sprint
Actions #8

Updated by Brett Smith over 8 years ago

  • Target version changed from 2015-10-28 sprint to Arvados Future Sprints
Actions #9

Updated by Bryan Cosca over 8 years ago

Since it seems like I have some free time, I'm going to look at this bug and see if I can make a fix. But I won't take it because I can't guarantee I'll figure it out in time for the next sprint.

Actions #10

Updated by Brett Smith over 8 years ago

  • Assigned To set to Sarah Guthrie
  • Target version changed from Arvados Future Sprints to 2015-11-11 sprint
Actions #11

Updated by Sarah Guthrie over 8 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
Actions #13

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
Actions #14

Updated by Brett Smith over 8 years ago

  • Target version changed from 2015-11-11 sprint to 2015-12-02 sprint
Actions #15

Updated by Tom Clegg over 8 years ago

If a caller provides a 2-tuple (presumably based on previous version) I suspect we should default to 32768, not 0. It looks like we're passing 0 if it's not provided, and I'm guessing that tells PyCURL to disable the "low speed" stuff entirely, leaving us without any timeout at all.

Do the old 200ms tests not work any more? I would expect them to, and without them it seems like we're not testing the connect timeout at all. (If we can reinstate them verbatim, that might be a nice way to confirm existing callers still get something reasonable when they pass a 2-tuple of timeouts.)

The rest is just comments/docs:

bytes vs. kibytes
  • update comments for :timeout: and :proxy_timeout: in KeepClient (32 → 32768)
  • change bandwidth_kib_per_s var to bandwidth_bps (or something else without k)

Instead of the "see http://docs.python-requests.org/en/latest/user/advanced/#timeouts" reference, how about something like: "a connection will be aborted if the average traffic rate falls below minimum_bandwidth bytes per second over an interval of read_timeout seconds."

This comment seems to apply to TIMEOUT_TIME and should move above it (and gain a space after "#"): "#Needs to be low to trigger bandwidth errors before we run out of data"

I think this comment would be better rephrased/moved to setbandwidth's docstring:

        #If self.bandwidth = None, function at maximum bandwidth
        #Otherwise, self.bandwidth is the maximum number of bytes per second to
        #   operate at.

Actions #16

Updated by Sarah Guthrie over 8 years ago

Tom Clegg wrote:

If a caller provides a 2-tuple (presumably based on previous version) I suspect we should default to 32768, not 0. It looks like we're passing 0 if it's not provided, and I'm guessing that tells PyCURL to disable the "low speed" stuff entirely, leaving us without any timeout at all.

Alright, fixed.

Do the old 200ms tests not work any more? I would expect them to, and without them it seems like we're not testing the connect timeout at all. (If we can reinstate them verbatim, that might be a nice way to confirm existing callers still get something reasonable when they pass a 2-tuple of timeouts.)

I put them back in, but the timeouts don't work anymore. Updates are in appropriate branch.

Actions #17

Updated by Tom Clegg over 8 years ago

Pushed a couple of commits on top of this, most drastic part is slightly reducing the time wasted by those 2-tuple-timeout tests. LGTM if that LGTY.

Actions #18

Updated by Sarah Guthrie over 8 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:e9c78ef7855e7ae263fe461e069c89ff7fc0b798.

Actions

Also available in: Atom PDF