Bug #20540
closedcrunch-run should sleep-and-retry after transient failures on API calls, especially when container is succeeding
Added by Tom Clegg over 1 year ago. Updated over 1 year ago.
Related issues
Updated by Tom Clegg over 1 year ago
- Related to Bug #20451: Stuck crunch-run issues added
Updated by Tom Clegg over 1 year ago
- Status changed from New to In Progress
20540-crunch-run-retry @ eca324141778eb082871f5f8bd6b398edc6ac92a -- developer-run-tests: #3671
wb1 developer-run-tests-apps-workbench-integration: #3964
Some of the crunch-run code ("copy output" and "extract git tree") was easy enough to port from sdk/go/arvadosclient to sdk/go/arvados, so I did that.
The main crunch-run code is also easy to port, but the tests use a mocked client object, and porting that quickly turned into a significant undertaking.
To avoid that I changed sdk/go/arvadosclient
behavior so it uses sdk/go/arvados
to do HTTP calls instead of implementing its own retry logic. The only awkward part about this is the way sdk/go/arvadosclient
callers specify retry behavior with a RetryDelay
global (default 2 seconds) and a Retries
field on the client struct. These can't map exactly to the sdk/go/arvados
way (retry with exponential backoff within specified Timeout
), except that Retries==0 is the same as Timeout==0 (no auto retry, no timeout).
This branch effectively interprets non-zero Retries to mean a timeout in minutes, retrying with exponential-backoff within that time. If the caller changes the RetryDelay global from 2 to 4, that effectively doubles the timeout for any given Retries value.
If this seems workable, I'll update the package documentation comments.
It might also be worth adding a Timeout field to sdk/go/arvadosclient
and ignore Retries if Timeout is set. That would give callers an easy way to make their code clearer and avoid the weird influence of the RetryDelay global.
Alternatively we could implement a retry-count limit in sdk/go/arvados
and propagate Retries to that. The main reason I'm not keen to do this is that the number of retries doesn't seem like the thing a caller really wants to control, once retry is enabled at all.
Updated by Tom Clegg over 1 year ago
Adding nearly-full jitter to the exponential backoff timing, and supporting Retry-After: {datetime}
response headers:
20540-crunch-run-retry @ 54e9c47a8340c5952e8e4c0e96e33eb47c3cb963 -- developer-run-tests: #3678
Updated by Peter Amstutz over 1 year ago
Tom Clegg wrote in #note-4:
Adding nearly-full jitter to the exponential backoff timing, and supporting
Retry-After: {datetime}
response headers:20540-crunch-run-retry @ 54e9c47a8340c5952e8e4c0e96e33eb47c3cb963 -- developer-run-tests: #3678
Could you confirm that the value of "min" provided to exponentialBackoff can never be 0?
Updated by Peter Amstutz over 1 year ago
- Related to Idea #20599: Scaling to 1000s of concurrent containers added
Updated by Tom Clegg over 1 year ago
Peter Amstutz wrote in #note-5:
Could you confirm that the value of "min" provided to exponentialBackoff can never be 0?
That's a good point, version in #note-3 didn't handle small/zero values of min very well. When min=0, it would always use delay=0 between attempts.
With this commit, min<1s means delay exactly min (even if zero) after the first attempt, but after subsequent attempts, behave as if min was 1s (i.e., delay between 1 and 2^n seconds).
20540-crunch-run-retry @ f4a37bb12147851402fd585e6e5987f227591028 -- developer-run-tests: #3680
Updated by Peter Amstutz over 1 year ago
This looks good, my only remaining comment is that either the default should be a bit longer (8 or 10 minutes instead of 5 minutes) or that system services like crunch-run specifically should use a higher limit.
This is based on the experience of seeing the API server get saturated for several minutes, I want crunch-run in particular to try very hard to make sure it doesn't abandon containers at the finalization stage (we talked about having the retry time even being proportional to the container runtime, it is rational to wait up to the length of time the container ran to register a successful result.)
Updated by Tom Clegg over 1 year ago
I was reluctant to change the default total-sleep-between-retries too drastically, 6s to 3m already seems like a lot.
Crunch-run's existing code had Retries=8, which used to mean 8x 2-second delays, but now means up to 8 minutes of exponential-backoff. Raised that to 10m:
20540-crunch-run-retry @ 0afb0b057d2572487cc00aa1dc2631cb99713ea7 -- developer-run-tests: #3685
Raising the limit further as the container runs sounds reasonable, although we might still want an upper limit. If a container ran for 4 days but then controller has since been unreachable for 3 straight days, is it really sensible to wait one more day?
Should I merge this and leave the issue open to do the "retry time proportional to runtime" thing? Or should we make that a separate issue and schedule it separately?
Updated by Peter Amstutz over 1 year ago
Tom Clegg wrote in #note-11:
I was reluctant to change the default total-sleep-between-retries too drastically, 6s to 3m already seems like a lot.
Crunch-run's existing code had Retries=8, which used to mean 8x 2-second delays, but now means up to 8 minutes of exponential-backoff. Raised that to 10m:
20540-crunch-run-retry @ 0afb0b057d2572487cc00aa1dc2631cb99713ea7 -- developer-run-tests: #3685
Ah, I overlooked that the original setting of Retries=8 translated into 8 minutes already, I was just looking at the new default of 5 minutes. 10 minutes is good.
Raising the limit further as the container runs sounds reasonable, although we might still want an upper limit. If a container ran for 4 days but then controller has since been unreachable for 3 straight days, is it really sensible to wait one more day?
Right, a reasonable upper limit is probably only an hour or two, which puts it more into the realm of "surviving sustained downtime" not "surviving busy spikes and trouble-free system reboots".
Should I merge this and leave the issue open to do the "retry time proportional to runtime" thing? Or should we make that a separate issue and schedule it separately?
Yes, please merge with the 10 minute timeout. Make a separate issue for the backlog, we can remind ourselves if it becomes an issue in the future.
Updated by Tom Clegg over 1 year ago
- Related to Feature #20604: crunch-run retry timeout should increase for long-running containers added
Updated by Tom Clegg over 1 year ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|bc25d5adc43cf10b6df7dc28556aba083f34444a.