Bug #16480

[keep-balance] should not timeout/fail when keepstore index takes more than 5 minutes

Added by Tom Clegg over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
06/16/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Currently sdk/go/arvados.Client uses a 5-minute request timeout when the caller (like keep-balance) doesn't provide a custom *http.Client. This isn't enough time to complete a keepstore index request on a cloud install with lots of data. When the timeout expires, keep-balance just gives up on the current iteration, goes to sleep, and waits for the next one, which is unlikely to fare any better.

It may be worth having an idle timer to detect hung connections, but aside from that keep-balance should wait as long as it takes for keepstore to return an index response.


Subtasks

Task #16486: Review 16480-keep-balance-index-timeoutResolvedWard Vandewege

Associated revisions

Revision 405b13d5
Added by Tom Clegg over 1 year ago

Merge branch '16480-keep-balance-index-timeout'

fixes #16480

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#2 Updated by Tom Clegg over 1 year ago

  • Target version set to 2020-06-17 Sprint
  • Assigned To set to Tom Clegg

#4 Updated by Tom Clegg over 1 year ago

  • Status changed from New to In Progress

#5 Updated by Tom Clegg over 1 year ago

16480-keep-balance-index-timeout @ 2bc1a7a89597ab02aaeef84b82fdc51f8e375b79 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1919/

This changes the timeout to 24h. I'm inclined to punt on the "idle timer to detect hung connections" part.

#6 Updated by Ward Vandewege over 1 year ago

Tom Clegg wrote:

16480-keep-balance-index-timeout @ 2bc1a7a89597ab02aaeef84b82fdc51f8e375b79 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1919/

This changes the timeout to 24h. I'm inclined to punt on the "idle timer to detect hung connections" part.

OK to punt on the "idle timer", that's a nice to have. But maybe shorten the timeout a bit, 24 hours seems a bit excessive?

Review comment:

Removing the 5 minute timeout from the definition of DefaultSecureClient and InsecureHTTPClient means that the setup() function in lib/controller/handler.go now creates a secureClient and insecureClient without timeout. Is that intentional? Since DefaultSecureClient and InsecureHTTPClient are exported, shouldn't there be a default timeout on them?

Otherwise, LGTM, thanks.

#7 Updated by Tom Clegg over 1 year ago

  • Target version changed from 2020-06-17 Sprint to 2020-07-01 Sprint

#8 Updated by Tom Clegg over 1 year ago

Ward Vandewege wrote:

OK to punt on the "idle timer", that's a nice to have. But maybe shorten the timeout a bit, 24 hours seems a bit excessive?

I suppose the only way to avoid choosing a bad timeout is to make it configurable so the admin can choose a bad timeout themselves, so I made a configurable timeout for the entire keep-balance operation, defaulting to 6h. Is that a good compromise between annoyingly-long-for-small-sites and annoyingly-short-for-big-sites?

Removing the 5 minute timeout from the definition of DefaultSecureClient and InsecureHTTPClient means that the setup() function in lib/controller/handler.go now creates a secureClient and insecureClient without timeout. Is that intentional? Since DefaultSecureClient and InsecureHTTPClient are exported, shouldn't there be a default timeout on them?

Controller already uses a context deadline on incoming requests to enforce API.RequestTimeout, and propagates that context to outgoing proxy requests. So we were effectively limiting outgoing reqs min(5m, RequestTimeout) -- using only RequestTimeout seems better.

Generally, I'm thinking it's OK to change the exported default clients' timeouts to zero -- that's how http.DefaultClient does it, probably a hint that we should let the caller provide a timeout/context when needed.

16480-keep-balance-index-timeout @ fd080b34a321cbd6593d69f427b9eaeab890712f -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1925/

#9 Updated by Ward Vandewege over 1 year ago

Tom Clegg wrote:

Ward Vandewege wrote:

OK to punt on the "idle timer", that's a nice to have. But maybe shorten the timeout a bit, 24 hours seems a bit excessive?

I suppose the only way to avoid choosing a bad timeout is to make it configurable so the admin can choose a bad timeout themselves, so I made a configurable timeout for the entire keep-balance operation, defaulting to 6h. Is that a good compromise between annoyingly-long-for-small-sites and annoyingly-short-for-big-sites?

Removing the 5 minute timeout from the definition of DefaultSecureClient and InsecureHTTPClient means that the setup() function in lib/controller/handler.go now creates a secureClient and insecureClient without timeout. Is that intentional? Since DefaultSecureClient and InsecureHTTPClient are exported, shouldn't there be a default timeout on them?

Controller already uses a context deadline on incoming requests to enforce API.RequestTimeout, and propagates that context to outgoing proxy requests. So we were effectively limiting outgoing reqs min(5m, RequestTimeout) -- using only RequestTimeout seems better.

Generally, I'm thinking it's OK to change the exported default clients' timeouts to zero -- that's how http.DefaultClient does it, probably a hint that we should let the caller provide a timeout/context when needed.

16480-keep-balance-index-timeout @ fd080b34a321cbd6593d69f427b9eaeab890712f -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1925/

Cool, this LGTM.

#10 Updated by Anonymous over 1 year ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

#11 Updated by Peter Amstutz about 1 year ago

  • Release set to 25

Also available in: Atom PDF