Bug #10468

[Keepstore] configurable timeout on blob storage requests

Added by Ward Vandewege almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
11/07/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Subtasks

Task #10474: Review 10468-blob-storage-timeoutsResolvedPeter Amstutz

Associated revisions

Revision c3cc1d58
Added by Tom Clegg almost 3 years ago

Merge branch '10468-blob-storage-timeouts' closes #10468

History

#1 Updated by Tom Clegg almost 3 years ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg

#2 Updated by Tom Clegg almost 3 years ago

10468-blob-storage-timeouts

test 39536d8dd7f0a6ab89e106cd065830f1cbb067b1

#3 Updated by Tom Clegg almost 3 years ago

  • Target version set to 2016-11-09 sprint

#4 Updated by Peter Amstutz almost 3 years ago

  • Default timeout of 10 minutes seems unreasonably long. I can't think of a situation where you would actually want that behavior. Should be more like 2 minutes or even shorter (20 seconds?)
  • Azure has const azureDefaultRequestTimeout but S3 hardcodes defaults in S3Volume.Start().

Rest LGTM.

#5 Updated by Tom Clegg almost 3 years ago

Peter Amstutz wrote:

  • Default timeout of 10 minutes seems unreasonably long. I can't think of a situation where you would actually want that behavior. Should be more like 2 minutes or even shorter (20 seconds?)

10 minutes might be unreasonable for the installation you're thinking of, but 20 seconds might be unreasonably short for someone else's site (e.g., S3 requests often take >30 seconds on our test cluster). Rather than try to guess a useful-but-not-too-aggressive timeout for all setups/endpoints, I figured we should start with a long timeout: a too-long timeout doesn't break anything.

I propose we revisit the defaults/examples/recommendations after we have some real-world experience.

Meanwhile, the rationale for having a default timeout is really just to avoid holding resources forever if the server somehow doesn't get notified that a request has failed.

  • Azure has const azureDefaultRequestTimeout but S3 hardcodes defaults in S3Volume.Start().

Fixed, thanks.

#6 Updated by Peter Amstutz almost 3 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

  • Default timeout of 10 minutes seems unreasonably long. I can't think of a situation where you would actually want that behavior. Should be more like 2 minutes or even shorter (20 seconds?)

10 minutes might be unreasonable for the installation you're thinking of, but 20 seconds might be unreasonably short for someone else's site (e.g., S3 requests often take >30 seconds on our test cluster). Rather than try to guess a useful-but-not-too-aggressive timeout for all setups/endpoints, I figured we should start with a long timeout: a too-long timeout doesn't break anything.

Well, in the Python SDK, the default connection timeout is 2 seconds and the read timeout is 256 seconds. So having the default timeouts for keepstore talking to blob store be an order of magnitude longer than the client timeouts is counterproductive because the SDK will have long since hung up.

I propose we revisit the defaults/examples/recommendations after we have some real-world experience.

I agree we should look at the logs and get some accurate numbers but it's not like we don't have lots of data already.

Meanwhile, the rationale for having a default timeout is really just to avoid holding resources forever if the server somehow doesn't get notified that a request has failed.

By that rationale the default timeout could be 75 years, which is also less than forever.

However please go ahead and merge, we can litigate the defaults later.

#7 Updated by Tom Clegg almost 3 years ago

Peter Amstutz wrote:

Well, in the Python SDK, the default connection timeout is 2 seconds and the read timeout is 256 seconds. So having the default timeouts for keepstore talking to blob store be an order of magnitude longer than the client timeouts is counterproductive because the SDK will have long since hung up.

Guessing the client's timeout isn't the right way to address the problem of releasing server resources after the client hangs up (see #10467)

Timeouts are last resorts. If we find ourselves fine-tuning timeouts, that's probably a sign something else needs to be fixed...

#8 Updated by Tom Clegg almost 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:c3cc1d58b64940a2bd79f27a9d0fdc50318dbb99.

Also available in: Atom PDF