Bug #4012

[Crunch] crunch-job bandaid: use eval/retry to wrap api calls that are likely to fail a long-running job due to a transient error condition

Added by Ward Vandewege almost 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
-
Target version:
Start date:
10/05/2014
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
1.0

Subtasks

Task #4109: Review 4012-crunch-job-api-retries-wipResolvedPeter Amstutz

Associated revisions

Revision 344c6dcd
Added by Brett Smith almost 5 years ago

Merge branch '4012-crunch-job-api-retries-wip'

Closes #4012.

History

#1 Updated by Tom Clegg almost 5 years ago

  • Subject changed from [Crunch] crunch-job bandaid: wrap api calls that can fail in an eval and retry to [Crunch] crunch-job bandaid: use eval/retry to wrap api calls that are likely to fail a long-running job due to a transient error condition

#2 Updated by Ward Vandewege almost 5 years ago

  • Assigned To set to Brett Smith
  • Story points changed from 0.5 to 1.0

#3 Updated by Peter Amstutz almost 5 years ago

This looks pretty good, just one comment on the algorithm. If the requests are timing out instead of failing fast, you may up waiting significantly longer than $timediff before giving up: a five minute job (300 seconds) would do 8 retries, if there is a 60 second timeout plus backoff that means it would spend well over 10 minutes retrying before giving up. Suggest tweaking retry_op() to try/retry (with backoff) until the wait time is exceeded, and at least three times? Something like:

my $wait = 1;
my $giveup_time = time + calculate_giveup_time();
while (time < $giveup_time) {
    sleep($wait) if $wait > 1;
    $wait *= 2; 
    my $result = eval { $operation->(@_); };
    if (!$@) {
      return $result;
    }
}

#4 Updated by Brett Smith almost 5 years ago

Peter Amstutz wrote:

If the requests are timing out instead of failing fast, you may up waiting significantly longer than $timediff before giving up

Yeah, that's an issue. Fixed in 7e35706 and ready for another look. Thanks.

#5 Updated by Brett Smith almost 5 years ago

  • Status changed from New to Resolved

Applied in changeset arvados|commit:344c6dcdbae76310879c85a736e4e6cce05d5645.

Also available in: Atom PDF