Bug #3570
closed
[Crunch] With multiple crunch-dispatch processes, race condition causes job record to have success=false while still running.
Added by Tom Clegg over 10 years ago.
Updated over 10 years ago.
Reviewing 7ea5d61
I'm concerned we're semantically overloading the number 111. crunch-job both receives this exit code from tasks (where it means "temporary failure") and sends it up to its caller (where it means "I didn't start the Job"—sometimes for permanent reasons). I think we might save ourselves some grief if we use different status codes for these different cases.
It might help crunch-job's readability if you define the "didn't start" exit code as a constant, and use it where appropriate. Someone who jumps to one of later exits might not see your comment about it and miss the number's special meaning.
Thanks.
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:6c91320c3a53ddbe1b81bee8ed6322c20acbd047.
- Status changed from Resolved to In Progress
(Peter raised the same "overload 111" question but suggested merging, so I merged. Then I saw this comment, so I'm reopening to respond to your points.)
Brett Smith wrote:
Reviewing 7ea5d61
I'm concerned we're semantically overloading the number 111. crunch-job both receives this exit code from tasks (where it means "temporary failure") and sends it up to its caller (where it means "I didn't start the Job"—sometimes for permanent reasons). I think we might save ourselves some grief if we use different status codes for these different cases.
Fair enough. I looked but didn't find a convention closer to "lost race" or "lock unavailable" but 75 is another convention for temporary failure.
It might help crunch-job's readability if you define the "didn't start" exit code as a constant, and use it where appropriate. Someone who jumps to one of later exits might not see your comment about it and miss the number's special meaning.
Changed to EX_TEMPFAIL constant.
Also
- added some comments to crunch-job about another possible race.
- exit non-zero if the job fails. This doesn't affect the race, but makes it possible to script local/dev jobs.
Now at 55db208 -- does this look better?
Tom Clegg wrote:
Now at 55db208 -- does this look better?
Yes, please go ahead and merge this. Thanks for following up.
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:ed1580a38c1aae0910cef83605622a41d927fba3.
Also available in: Atom
PDF