Bug #3570

[Crunch] With multiple crunch-dispatch processes, race condition causes job record to have success=false while still running.

Added by Tom Clegg about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
08/11/2014
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
0.5

Subtasks

Task #3571: Review 3570-crunch-raceResolvedTom Clegg

Associated revisions

Revision 6c91320c
Added by Tom Clegg about 5 years ago

Merge branch '3570-crunch-race' closes #3570

Revision ed1580a3
Added by Tom Clegg about 5 years ago

Merge branch '3570-crunch-race' closes #3570

Revision 936ee283 (diff)
Added by Tom Clegg about 5 years ago

Update Gemfile to use crunch-job from latest arvados-cli gem. refs #3570

History

#1 Updated by Brett Smith about 5 years ago

Reviewing 7ea5d61

I'm concerned we're semantically overloading the number 111. crunch-job both receives this exit code from tasks (where it means "temporary failure") and sends it up to its caller (where it means "I didn't start the Job"—sometimes for permanent reasons). I think we might save ourselves some grief if we use different status codes for these different cases.

It might help crunch-job's readability if you define the "didn't start" exit code as a constant, and use it where appropriate. Someone who jumps to one of later exits might not see your comment about it and miss the number's special meaning.

Thanks.

#2 Updated by Anonymous about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:6c91320c3a53ddbe1b81bee8ed6322c20acbd047.

#3 Updated by Tom Clegg about 5 years ago

  • Status changed from Resolved to In Progress

(Peter raised the same "overload 111" question but suggested merging, so I merged. Then I saw this comment, so I'm reopening to respond to your points.)

Brett Smith wrote:

Reviewing 7ea5d61

I'm concerned we're semantically overloading the number 111. crunch-job both receives this exit code from tasks (where it means "temporary failure") and sends it up to its caller (where it means "I didn't start the Job"—sometimes for permanent reasons). I think we might save ourselves some grief if we use different status codes for these different cases.

Fair enough. I looked but didn't find a convention closer to "lost race" or "lock unavailable" but 75 is another convention for temporary failure.

It might help crunch-job's readability if you define the "didn't start" exit code as a constant, and use it where appropriate. Someone who jumps to one of later exits might not see your comment about it and miss the number's special meaning.

Changed to EX_TEMPFAIL constant.

Also
  • added some comments to crunch-job about another possible race.
  • exit non-zero if the job fails. This doesn't affect the race, but makes it possible to script local/dev jobs.

Now at 55db208 -- does this look better?

#4 Updated by Brett Smith about 5 years ago

Tom Clegg wrote:

Now at 55db208 -- does this look better?

Yes, please go ahead and merge this. Thanks for following up.

#5 Updated by Anonymous about 5 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:ed1580a38c1aae0910cef83605622a41d927fba3.

#6 Updated by Ward Vandewege about 5 years ago

  • Story points set to 0.5

Also available in: Atom PDF