Project

General

Profile

Actions

Bug #20540

closed

crunch-run should sleep-and-retry after transient failures on API calls, especially when container is succeeding

Added by Tom Clegg 11 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
0.5
Release relationship:
Auto

Subtasks 1 (0 open1 closed)

Task #20554: Review 20540-crunch-run-retryResolvedPeter Amstutz05/30/2023Actions

Related issues

Related to Arvados - Bug #20451: Stuck crunch-run issuesResolvedTom CleggActions
Related to Arvados Epics - Idea #20599: Scaling to 1000s of concurrent containersResolved06/01/202303/31/2024Actions
Related to Arvados - Feature #20604: crunch-run retry timeout should increase for long-running containersNewActions
Actions #1

Updated by Tom Clegg 11 months ago

  • Related to Bug #20451: Stuck crunch-run issues added
Actions #2

Updated by Peter Amstutz 11 months ago

  • Category set to API
Actions #3

Updated by Tom Clegg 11 months ago

  • Status changed from New to In Progress

20540-crunch-run-retry @ eca324141778eb082871f5f8bd6b398edc6ac92a -- developer-run-tests: #3671

wb1 developer-run-tests-apps-workbench-integration: #3964

Some of the crunch-run code ("copy output" and "extract git tree") was easy enough to port from sdk/go/arvadosclient to sdk/go/arvados, so I did that.

The main crunch-run code is also easy to port, but the tests use a mocked client object, and porting that quickly turned into a significant undertaking.

To avoid that I changed sdk/go/arvadosclient behavior so it uses sdk/go/arvados to do HTTP calls instead of implementing its own retry logic. The only awkward part about this is the way sdk/go/arvadosclient callers specify retry behavior with a RetryDelay global (default 2 seconds) and a Retries field on the client struct. These can't map exactly to the sdk/go/arvados way (retry with exponential backoff within specified Timeout), except that Retries==0 is the same as Timeout==0 (no auto retry, no timeout).

This branch effectively interprets non-zero Retries to mean a timeout in minutes, retrying with exponential-backoff within that time. If the caller changes the RetryDelay global from 2 to 4, that effectively doubles the timeout for any given Retries value.

If this seems workable, I'll update the package documentation comments.

It might also be worth adding a Timeout field to sdk/go/arvadosclient and ignore Retries if Timeout is set. That would give callers an easy way to make their code clearer and avoid the weird influence of the RetryDelay global.

Alternatively we could implement a retry-count limit in sdk/go/arvados and propagate Retries to that. The main reason I'm not keen to do this is that the number of retries doesn't seem like the thing a caller really wants to control, once retry is enabled at all.

Actions #4

Updated by Tom Clegg 11 months ago

Adding nearly-full jitter to the exponential backoff timing, and supporting Retry-After: {datetime} response headers:

20540-crunch-run-retry @ 54e9c47a8340c5952e8e4c0e96e33eb47c3cb963 -- developer-run-tests: #3678

Actions #5

Updated by Peter Amstutz 11 months ago

Tom Clegg wrote in #note-4:

Adding nearly-full jitter to the exponential backoff timing, and supporting Retry-After: {datetime} response headers:

20540-crunch-run-retry @ 54e9c47a8340c5952e8e4c0e96e33eb47c3cb963 -- developer-run-tests: #3678

Could you confirm that the value of "min" provided to exponentialBackoff can never be 0?

Actions #6

Updated by Peter Amstutz 11 months ago

  • Related to Idea #20599: Scaling to 1000s of concurrent containers added
Actions #8

Updated by Tom Clegg 11 months ago

Peter Amstutz wrote in #note-5:

Could you confirm that the value of "min" provided to exponentialBackoff can never be 0?

That's a good point, version in #note-3 didn't handle small/zero values of min very well. When min=0, it would always use delay=0 between attempts.

With this commit, min<1s means delay exactly min (even if zero) after the first attempt, but after subsequent attempts, behave as if min was 1s (i.e., delay between 1 and 2^n seconds).

20540-crunch-run-retry @ f4a37bb12147851402fd585e6e5987f227591028 -- developer-run-tests: #3680

Actions #9

Updated by Peter Amstutz 11 months ago

  • Release set to 64
Actions #10

Updated by Peter Amstutz 11 months ago

This looks good, my only remaining comment is that either the default should be a bit longer (8 or 10 minutes instead of 5 minutes) or that system services like crunch-run specifically should use a higher limit.

This is based on the experience of seeing the API server get saturated for several minutes, I want crunch-run in particular to try very hard to make sure it doesn't abandon containers at the finalization stage (we talked about having the retry time even being proportional to the container runtime, it is rational to wait up to the length of time the container ran to register a successful result.)

Actions #11

Updated by Tom Clegg 11 months ago

I was reluctant to change the default total-sleep-between-retries too drastically, 6s to 3m already seems like a lot.

Crunch-run's existing code had Retries=8, which used to mean 8x 2-second delays, but now means up to 8 minutes of exponential-backoff. Raised that to 10m:

20540-crunch-run-retry @ 0afb0b057d2572487cc00aa1dc2631cb99713ea7 -- developer-run-tests: #3685

Raising the limit further as the container runs sounds reasonable, although we might still want an upper limit. If a container ran for 4 days but then controller has since been unreachable for 3 straight days, is it really sensible to wait one more day?

Should I merge this and leave the issue open to do the "retry time proportional to runtime" thing? Or should we make that a separate issue and schedule it separately?

Actions #12

Updated by Peter Amstutz 11 months ago

Tom Clegg wrote in #note-11:

I was reluctant to change the default total-sleep-between-retries too drastically, 6s to 3m already seems like a lot.

Crunch-run's existing code had Retries=8, which used to mean 8x 2-second delays, but now means up to 8 minutes of exponential-backoff. Raised that to 10m:

20540-crunch-run-retry @ 0afb0b057d2572487cc00aa1dc2631cb99713ea7 -- developer-run-tests: #3685

Ah, I overlooked that the original setting of Retries=8 translated into 8 minutes already, I was just looking at the new default of 5 minutes. 10 minutes is good.

Raising the limit further as the container runs sounds reasonable, although we might still want an upper limit. If a container ran for 4 days but then controller has since been unreachable for 3 straight days, is it really sensible to wait one more day?

Right, a reasonable upper limit is probably only an hour or two, which puts it more into the realm of "surviving sustained downtime" not "surviving busy spikes and trouble-free system reboots".

Should I merge this and leave the issue open to do the "retry time proportional to runtime" thing? Or should we make that a separate issue and schedule it separately?

Yes, please merge with the 10 minute timeout. Make a separate issue for the backlog, we can remind ourselves if it becomes an issue in the future.

Actions #13

Updated by Tom Clegg 11 months ago

  • Related to Feature #20604: crunch-run retry timeout should increase for long-running containers added
Actions #14

Updated by Tom Clegg 11 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF