Bug #16480

[keep-balance] should not timeout/fail when keepstore index takes more than 5 minutes

Added by Tom Clegg about 1 month ago. Updated 18 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
06/16/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Currently sdk/go/arvados.Client uses a 5-minute request timeout when the caller (like keep-balance) doesn't provide a custom *http.Client. This isn't enough time to complete a keepstore index request on a cloud install with lots of data. When the timeout expires, keep-balance just gives up on the current iteration, goes to sleep, and waits for the next one, which is unlikely to fare any better.

It may be worth having an idle timer to detect hung connections, but aside from that keep-balance should wait as long as it takes for keepstore to return an index response.


Subtasks

Task #16486: Review 16480-keep-balance-index-timeoutResolvedWard Vandewege

Associated revisions

Revision 405b13d5
Added by Tom Clegg 19 days ago

Merge branch '16480-keep-balance-index-timeout'

fixes #16480

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#2 Updated by Tom Clegg about 1 month ago

  • Target version set to 2020-06-17 Sprint
  • Assigned To set to Tom Clegg

#4 Updated by Tom Clegg 26 days ago

  • Status changed from New to In Progress

#5 Updated by Tom Clegg 25 days ago

16480-keep-balance-index-timeout @ 2bc1a7a89597ab02aaeef84b82fdc51f8e375b79 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1919/

This changes the timeout to 24h. I'm inclined to punt on the "idle timer to detect hung connections" part.

#6 Updated by Ward Vandewege 25 days ago

Tom Clegg wrote:

16480-keep-balance-index-timeout @ 2bc1a7a89597ab02aaeef84b82fdc51f8e375b79 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1919/

This changes the timeout to 24h. I'm inclined to punt on the "idle timer to detect hung connections" part.

OK to punt on the "idle timer", that's a nice to have. But maybe shorten the timeout a bit, 24 hours seems a bit excessive?

Review comment:

Removing the 5 minute timeout from the definition of DefaultSecureClient and InsecureHTTPClient means that the setup() function in lib/controller/handler.go now creates a secureClient and insecureClient without timeout. Is that intentional? Since DefaultSecureClient and InsecureHTTPClient are exported, shouldn't there be a default timeout on them?

Otherwise, LGTM, thanks.

#7 Updated by Tom Clegg 24 days ago

  • Target version changed from 2020-06-17 Sprint to 2020-07-01 Sprint

#8 Updated by Tom Clegg 22 days ago

Ward Vandewege wrote:

OK to punt on the "idle timer", that's a nice to have. But maybe shorten the timeout a bit, 24 hours seems a bit excessive?

I suppose the only way to avoid choosing a bad timeout is to make it configurable so the admin can choose a bad timeout themselves, so I made a configurable timeout for the entire keep-balance operation, defaulting to 6h. Is that a good compromise between annoyingly-long-for-small-sites and annoyingly-short-for-big-sites?

Removing the 5 minute timeout from the definition of DefaultSecureClient and InsecureHTTPClient means that the setup() function in lib/controller/handler.go now creates a secureClient and insecureClient without timeout. Is that intentional? Since DefaultSecureClient and InsecureHTTPClient are exported, shouldn't there be a default timeout on them?

Controller already uses a context deadline on incoming requests to enforce API.RequestTimeout, and propagates that context to outgoing proxy requests. So we were effectively limiting outgoing reqs min(5m, RequestTimeout) -- using only RequestTimeout seems better.

Generally, I'm thinking it's OK to change the exported default clients' timeouts to zero -- that's how http.DefaultClient does it, probably a hint that we should let the caller provide a timeout/context when needed.

16480-keep-balance-index-timeout @ fd080b34a321cbd6593d69f427b9eaeab890712f -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1925/

#9 Updated by Ward Vandewege 22 days ago

Tom Clegg wrote:

Ward Vandewege wrote:

OK to punt on the "idle timer", that's a nice to have. But maybe shorten the timeout a bit, 24 hours seems a bit excessive?

I suppose the only way to avoid choosing a bad timeout is to make it configurable so the admin can choose a bad timeout themselves, so I made a configurable timeout for the entire keep-balance operation, defaulting to 6h. Is that a good compromise between annoyingly-long-for-small-sites and annoyingly-short-for-big-sites?

Removing the 5 minute timeout from the definition of DefaultSecureClient and InsecureHTTPClient means that the setup() function in lib/controller/handler.go now creates a secureClient and insecureClient without timeout. Is that intentional? Since DefaultSecureClient and InsecureHTTPClient are exported, shouldn't there be a default timeout on them?

Controller already uses a context deadline on incoming requests to enforce API.RequestTimeout, and propagates that context to outgoing proxy requests. So we were effectively limiting outgoing reqs min(5m, RequestTimeout) -- using only RequestTimeout seems better.

Generally, I'm thinking it's OK to change the exported default clients' timeouts to zero -- that's how http.DefaultClient does it, probably a hint that we should let the caller provide a timeout/context when needed.

16480-keep-balance-index-timeout @ fd080b34a321cbd6593d69f427b9eaeab890712f -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1925/

Cool, this LGTM.

#10 Updated by Anonymous 19 days ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

Also available in: Atom PDF