Project

General

Profile

Actions

Bug #3570

closed

[Crunch] With multiple crunch-dispatch processes, race condition causes job record to have success=false while still running.

Added by Tom Clegg over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
0.5

Subtasks 1 (0 open1 closed)

Task #3571: Review 3570-crunch-raceResolvedTom Clegg08/11/2014Actions
Actions #1

Updated by Brett Smith over 9 years ago

Reviewing 7ea5d61

I'm concerned we're semantically overloading the number 111. crunch-job both receives this exit code from tasks (where it means "temporary failure") and sends it up to its caller (where it means "I didn't start the Job"—sometimes for permanent reasons). I think we might save ourselves some grief if we use different status codes for these different cases.

It might help crunch-job's readability if you define the "didn't start" exit code as a constant, and use it where appropriate. Someone who jumps to one of later exits might not see your comment about it and miss the number's special meaning.

Thanks.

Actions #2

Updated by Anonymous over 9 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:6c91320c3a53ddbe1b81bee8ed6322c20acbd047.

Actions #3

Updated by Tom Clegg over 9 years ago

  • Status changed from Resolved to In Progress

(Peter raised the same "overload 111" question but suggested merging, so I merged. Then I saw this comment, so I'm reopening to respond to your points.)

Brett Smith wrote:

Reviewing 7ea5d61

I'm concerned we're semantically overloading the number 111. crunch-job both receives this exit code from tasks (where it means "temporary failure") and sends it up to its caller (where it means "I didn't start the Job"—sometimes for permanent reasons). I think we might save ourselves some grief if we use different status codes for these different cases.

Fair enough. I looked but didn't find a convention closer to "lost race" or "lock unavailable" but 75 is another convention for temporary failure.

It might help crunch-job's readability if you define the "didn't start" exit code as a constant, and use it where appropriate. Someone who jumps to one of later exits might not see your comment about it and miss the number's special meaning.

Changed to EX_TEMPFAIL constant.

Also
  • added some comments to crunch-job about another possible race.
  • exit non-zero if the job fails. This doesn't affect the race, but makes it possible to script local/dev jobs.

Now at 55db208 -- does this look better?

Actions #4

Updated by Brett Smith over 9 years ago

Tom Clegg wrote:

Now at 55db208 -- does this look better?

Yes, please go ahead and merge this. Thanks for following up.

Actions #5

Updated by Anonymous over 9 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:ed1580a38c1aae0910cef83605622a41d927fba3.

Actions #6

Updated by Ward Vandewege over 9 years ago

  • Story points set to 0.5
Actions

Also available in: Atom PDF