Bug #3570
closed[Crunch] With multiple crunch-dispatch processes, race condition causes job record to have success=false while still running.
Updated by Brett Smith over 10 years ago
Reviewing 7ea5d61
I'm concerned we're semantically overloading the number 111. crunch-job both receives this exit code from tasks (where it means "temporary failure") and sends it up to its caller (where it means "I didn't start the Job"—sometimes for permanent reasons). I think we might save ourselves some grief if we use different status codes for these different cases.
It might help crunch-job's readability if you define the "didn't start" exit code as a constant, and use it where appropriate. Someone who jumps to one of later exits might not see your comment about it and miss the number's special meaning.
Thanks.
Updated by Anonymous over 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:6c91320c3a53ddbe1b81bee8ed6322c20acbd047.
Updated by Tom Clegg over 10 years ago
- Status changed from Resolved to In Progress
(Peter raised the same "overload 111" question but suggested merging, so I merged. Then I saw this comment, so I'm reopening to respond to your points.)
Brett Smith wrote:
Reviewing 7ea5d61
I'm concerned we're semantically overloading the number 111. crunch-job both receives this exit code from tasks (where it means "temporary failure") and sends it up to its caller (where it means "I didn't start the Job"—sometimes for permanent reasons). I think we might save ourselves some grief if we use different status codes for these different cases.
Fair enough. I looked but didn't find a convention closer to "lost race" or "lock unavailable" but 75 is another convention for temporary failure.
It might help crunch-job's readability if you define the "didn't start" exit code as a constant, and use it where appropriate. Someone who jumps to one of later exits might not see your comment about it and miss the number's special meaning.
Changed to EX_TEMPFAIL constant.
Also- added some comments to crunch-job about another possible race.
- exit non-zero if the job fails. This doesn't affect the race, but makes it possible to script local/dev jobs.
Now at 55db208 -- does this look better?
Updated by Brett Smith over 10 years ago
Tom Clegg wrote:
Now at 55db208 -- does this look better?
Yes, please go ahead and merge this. Thanks for following up.
Updated by Anonymous over 10 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:ed1580a38c1aae0910cef83605622a41d927fba3.