Project

General

Profile

Actions

Bug #3857

closed

[SDKs] Python KeepClient should use different timeouts for different operations

Added by Brett Smith over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Tim Pierce
Category:
SDKs
Target version:
Story points:
1.0

Description

Executive summary:

If an arvados.KeepClient is connecting to a Keep proxy (service_type == 'proxy'):
  • The TCP connection timeout should be 2 seconds.
  • Otherwise, connection timeout should be 20 seconds.

Leave all other timeouts as they are.

More in detail:

Right now the Python KeepClient takes a single timeout setting when it's instantiated, and uses that for all HTTP requests by passing it along when it instantiates its own httplib2.Http objects. It seems like httplib2.Http is using this same timeout for all aspects of the request. However, we'd like to be smarter than that:

  • The initial attempt to connect, to send the request, should use a short timeout.
  • After the request is sent, it's appropriate to use a longer timeout. (The worst-case scenario is that we're PUTting 64MiB to a proxy. Just sending the data will take a little while, and then we'll have to wait for it to be duplexed to several servers before we receive any response.)

With the single timeout we currently have, a small value causes premature failures using tools like arv-put (see #3846), but a long value can cause clients to spend needless time waiting to contact a Keep service that's down.

Investigate whether it's possible to set different timeouts for these different operations, and implement it if so. Without looking, this might be the difference between the socket module's timeout and httplib2.Http's, or httplib2.Http might accept different configuration arguments for these.


Subtasks 4 (0 open4 closed)

Task #3916: TestingResolvedTim Pierce09/11/2014Actions
Task #3915: Implement timeout configuration settingsResolvedTim Pierce09/11/2014Actions
Task #4444: Review 3857-python-sdk-timeoutsResolvedTim Pierce09/11/2014Actions
Task #3914: Study httplib2 and other docsResolvedTim Pierce09/11/2014Actions
Actions #1

Updated by Ward Vandewege over 9 years ago

  • Target version set to Arvados Future Sprints
Actions #2

Updated by Tom Clegg over 9 years ago

  • Story points set to 2.0
Actions #3

Updated by Tom Clegg over 9 years ago

  • Target version changed from Arvados Future Sprints to 2014-10-08 sprint
Actions #4

Updated by Tim Pierce over 9 years ago

  • Assigned To set to Tim Pierce
Actions #5

Updated by Ward Vandewege over 9 years ago

  • Story points changed from 2.0 to 1.0
Actions #6

Updated by Ward Vandewege over 9 years ago

  • Description updated (diff)
Actions #7

Updated by Tim Pierce over 9 years ago

  • Description updated (diff)
Actions #8

Updated by Tim Pierce over 9 years ago

  • Target version changed from 2014-10-08 sprint to Bug Triage
Actions #9

Updated by Tom Clegg over 9 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints
Actions #10

Updated by Tom Clegg over 9 years ago

  • Subject changed from [SDK] Python KeepClient should use different timeouts for different operations to [SDKs] Python KeepClient should use different timeouts for different operations
Actions #11

Updated by Ward Vandewege over 9 years ago

  • Target version changed from Arvados Future Sprints to 2014-10-29 sprint
Actions #12

Updated by Ward Vandewege over 9 years ago

  • Target version changed from 2014-10-29 sprint to 2014-11-19 sprint
Actions #13

Updated by Tom Clegg over 9 years ago

  • Target version changed from 2014-11-19 sprint to Arvados Future Sprints
Actions #14

Updated by Tom Clegg over 9 years ago

  • Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Actions #15

Updated by Tim Pierce over 9 years ago

There doesn't appear to be a trivial fix to this one. httplib2 is built on Python's httplib, which takes a single timeout setting when a connection is created and gives no opportunity to change the timeout afterward or on individual requests. Nor is the httplib2 code organized in such a way as to offer a good place for us to inject this. The httplib2.Http method is implemented as a monolithic class method, so even if we subclass it, there's no obvious place where we can change the socket timeout after creating a connection and before issuing a request.

That said, making this change in the source doesn't look terribly difficult. This is currently marked as an open issue for httplib2. If we were willing to patch the library, we could submit a pull request and hope it makes it into a new httplib2 release. httplib2 appears to have been mostly in maintenance mode, but the author has accepted a few pull requests this year, so it's not impossible.

Actions #16

Updated by Tim Pierce over 9 years ago

  • Status changed from New to In Progress
Actions #17

Updated by Tim Pierce over 9 years ago

Alternatively, Brett points out that the Python requests module might be better suited to what we want to do here.

At a glance, it also only supports a single "timeout" parameter that is applied the same way to all connection objects. But it might be possible to use the Session class to set up a connection in one call and then issue a request on it with a different timeout setting.

Actions #18

Updated by Brett Smith over 9 years ago

Tim Pierce wrote:

Alternatively, Brett points out that the Python requests module might be better suited to what we want to do here.

To be clear, that was very speculative. Someone mentioned downgrading to httplib, so I wondered if we might do just as well to "upgrade" to requests, which has sometimes been called httplib3. But I've never used it and don't know what it's capable of.

Actions #19

Updated by Tim Pierce over 9 years ago

Tim Pierce wrote:

Alternatively, Brett points out that the Python requests module might be better suited to what we want to do here.

At a glance, it also only supports a single "timeout" parameter that is applied the same way to all connection objects. But it might be possible to use the Session class to set up a connection in one call and then issue a request on it with a different timeout setting.

Yes, reading more closely, the timeout parameter can be given as a (connection_timeout, read_timeout) tuple:

If you specify a single value for the timeout, like this:

r = requests.get('https://github.com', timeout=5)

The timeout value will be applied to both the connect and the read timeouts. Specify a tuple if you would like to set the values separately:

r = requests.get('https://github.com', timeout=(3.05, 27))

http://requests.readthedocs.org/en/latest/user/advanced/#timeouts

Actions #20

Updated by Tom Clegg over 9 years ago

I can't say I've given this a thorough look but I've noticed a couple of things:

I think it would be valuable to have different timeouts (and different defaults) for proxies. Perhaps we could accept timeout=((a,b),(c,d)) to mean "(a,b) if non-proxy, (c,d) if proxy"? In that case I'd suggest ((3,30),(20,60)) as a default.

Is there some way we can test this? Perhaps timeout=(0,12) reliably causes a connection timeout, etc.? If testing the timeouts turns out to be tricky, we could test some basic stuff like passing timeout=123 and timeout=(123,456) and timeout=None doesn't crash.

The doc string bit about the timeout argument could refer to the Requests documentation for more detail about the timeouts.

On that note, I would like to know whether, in the case of a large PUT, the read_timeout timer starts when the last byte is sent by the client, or when the connection is made (~when the first byte is sent by the client). If the latter, I'd suggest something more like ((3,60),(20,300)) instead of ((3,30),(20,60)) above.

(Perhaps Ward has better ideas about defaults.)

Actions #21

Updated by Tim Pierce over 9 years ago

This certainly should be testable, and some clever use of mocks should allow us to test the timeouts without actually waiting for 30 seconds or whatever. I'll ask our mock expert for advice here.

Separate timeouts for proxies seems like a reasonable thing. Is it okay to push that into another story? I'm eager to wrap up this story so I can finish as much work done as possible on NodeManager before my science support gig starts on Monday.

Actions #22

Updated by Tim Pierce over 9 years ago

On the other hand, the more I think about it, the more it seems like the test we're talking about is just that requests.get does what it advertises it will. I'm not sure it's a good use of our time to add testing infrastructure for that. Certainly we should make sure that in practice it gives us the result we want, but testing internal behavior of a third-party library is the kind of test that can be difficult to write and prone to break. WDYT?

With respect to your last question, the package defines the read_timeout parameter to mean the maximum amount of time to wait between consecutive read operations for a response from the server. That seems to indicate that the read timer is reset when the client calls socket.recv, i.e. after the last byte of the request is sent.

Actions #23

Updated by Brett Smith over 9 years ago

Tim Pierce wrote:

This certainly should be testable, and some clever use of mocks should allow us to test the timeouts without actually waiting for 30 seconds or whatever. I'll ask our mock expert for advice here.

Are you talking about me? Oh god, you're talking about me. Horrifying.

Separate timeouts for proxies seems like a reasonable thing. Is it okay to push that into another story?

At the very least, I think this branch needs to use different defaults based on whether or not we're using a proxy. I don't know how else you meet Ward's requirements in this story without regressing #3846. I feel less strongly about whether or not we expose these as separate options to SDK users (as opposed to making them check themselves and pass in one timeout tuple appropriately).

On the other hand, the more I think about it, the more it seems like the test we're talking about is just that requests.get does what it advertises it will. I'm not sure it's a good use of our time to add testing infrastructure for that. Certainly we should make sure that in practice it gives us the result we want, but testing internal behavior of a third-party library is the kind of test that can be difficult to write and prone to break. WDYT?

My feeling is that if we're just passing values directly through to requests.get, that's probably not a big deal to test; but any logic that we have to use different timeout values based on the Keep service type does need to be tested. If there's not an independent unit to test there, then checking what's passed into the requests mock is probably the next best option.

Actions #24

Updated by Tim Pierce over 9 years ago

1fa1006 adds a proxy_timeout configuration setting. The default timeout value is now (3, 30) and the default proxy_timeout is (20, 60).

Added unit tests to check that KeepClient.get and KeepClient.put use the appropriate timeout setting when communicating directly with a Keep server and with a proxy, respectively.

Actions #25

Updated by Brett Smith over 9 years ago

Reviewing 1fa1006. I like this branch a lot. Thanks. This is all small potatoes stuff.

  • The story says to leave all non-connect timeouts as they are. We've been using 300 for read timeouts. We know for sure we need more than 60 for the proxy case, because that's what caused #3846. Let's go back to that for read timeouts.
  • The timeout argument to KeepService.__init__ seems to be a noop. If it is, let's remove it.
  • In KeepService.last_status, AttributeError is the only exception that still makes sense to catch.
  • In test_collections, _mock_api_call should not be changed to use a Requests error. The Google API client library uses httplib2, so ApiError (which is a replacement for apiclient's own HttpError) is expecting to get httplib2's 2-tuple.
  • KeepClientRetryGetTestMixin and KeepClientRetryPutTestMixin seem like unnecessary layering. KeepClientRetryTestMixin users already have to define a few class variables; why not just add mock_method to that list (and document it in the comments)?
    Personally, I'd prefer to have them go further and have mixin users define the patch method: TEST_PATCHER = tutil.mock_get_responses. Then all the test methods call with self.TEST_PATCHER(args, …). This gives the users more flexibility and eliminates the need for mock_requestslib_responses entirely.
  • I feel like test_proxy_timeout better fits under KeepClientServiceTestCase than KeepClientRetryTestMixin. The timeout isn't related to retries, and there's not much point in retesting it for every single method that we're interested in the retry behavior of.
  • PEP 257 calls for no blank line before a docstring's closing triple quotes (KeepClient.__init__).
  • Please keep import blocks sorted where they already are (arvados_testutil).
Actions #26

Updated by Brett Smith over 9 years ago

Brett Smith wrote:

  • I feel like test_proxy_timeout better fits under KeepClientServiceTestCase than KeepClientRetryTestMixin. The timeout isn't related to retries, and there's not much point in retesting it for every single method that we're interested in the retry behavior of.

Actually, there's an even better reason to move this: the current implementation assumes we know at __init__ time whether or not we're using a proxy, but that's not always true. If no service roots are defined at init time, KeepClient will ask Arvados for available services, and we might learn then that we're using a proxy. Whether or not we're using a proxy might even change over the lifetime of the KeepClient, since it refreshes this list during retries. KeepClient needs to save both timeout values and pick the right one at request time based on the current value of self.using_proxy.

KeepClientServiceTestCase has all the infrastructure for mocking the API service call, so it can help you test this case of "no service roots at init time, learn we have a proxy later."

Actions #27

Updated by Tim Pierce over 9 years ago

All of these changes adopted at 6959469

Brett Smith wrote:

Reviewing 1fa1006. I like this branch a lot. Thanks. This is all small potatoes stuff.

  • The story says to leave all non-connect timeouts as they are. We've been using 300 for read timeouts. We know for sure we need more than 60 for the proxy case, because that's what caused #3846. Let's go back to that for read timeouts.
  • The timeout argument to KeepService.__init__ seems to be a noop. If it is, let's remove it.
  • In KeepService.last_status, AttributeError is the only exception that still makes sense to catch.
  • In test_collections, _mock_api_call should not be changed to use a Requests error. The Google API client library uses httplib2, so ApiError (which is a replacement for apiclient's own HttpError) is expecting to get httplib2's 2-tuple.
  • KeepClientRetryGetTestMixin and KeepClientRetryPutTestMixin seem like unnecessary layering. KeepClientRetryTestMixin users already have to define a few class variables; why not just add mock_method to that list (and document it in the comments)?
    Personally, I'd prefer to have them go further and have mixin users define the patch method: TEST_PATCHER = tutil.mock_get_responses. Then all the test methods call with self.TEST_PATCHER(args, …). This gives the users more flexibility and eliminates the need for mock_requestslib_responses entirely.
  • I feel like test_proxy_timeout better fits under KeepClientServiceTestCase than KeepClientRetryTestMixin. The timeout isn't related to retries, and there's not much point in retesting it for every single method that we're interested in the retry behavior of.
  • PEP 257 calls for no blank line before a docstring's closing triple quotes (KeepClient.__init__).

Well. Hmph. Somebody should tell Barry Warsaw so M-q does the right thing in python mode, then.

  • Please keep import blocks sorted where they already are (arvados_testutil).

I'm confused -- I didn't reorder the import modules.

(I'd like to adopt a convention of putting all imports in alphabetical order, but that's a discussion for another time.)

Actions #28

Updated by Tim Pierce over 9 years ago

Tim Pierce wrote:

I'm confused -- I didn't reorder the import modules.

(I'd like to adopt a convention of putting all imports in alphabetical order, but that's a discussion for another time.)

OOPS

fixed at a98086e.

Actions #29

Updated by Brett Smith over 9 years ago

Reviewing a98086e, and just one small technicality about the comments you added KeepClientRetryTestMixin. TEST_PATCHER doesn't need to be a static method: a subclass could def TEST_PATCHER(self, …) and it would work fine. Your specific subclasses need to use staticmethod because that's how you convert an independent function to a method, roughly speaking.

With a small tweak there, this is good to merge, thanks.

Actions #30

Updated by Tim Pierce over 9 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:ec75fda0fc2c86a77d831dcd7962ece7a2d6ae6d.

Actions

Also available in: Atom PDF