Bug #4310
closed[Crunch] crunch-dispatch --jobs locking is broken
Description
[Crunch] qr1hi-d1hrv-xg61rrhrxctnjrb failed with 'state invalid change from Failed to Complete'
We have 2 crunch-dispatchers running in --jobs mode right now. If we have locking, it may not be working right.
Updated by Ward Vandewege about 10 years ago
- Target version changed from Bug Triage to 2014-11-19 sprint
Updated by Ward Vandewege about 10 years ago
- Subject changed from [Crunch] qr1hi-d1hrv-xg61rrhrxctnjrb failed with 'state invalid change from Failed to Complete' to [Crunch] crunch-dispatch --jobs locking is broken
- Description updated (diff)
- Story points deleted (
0.5)
Updated by Peter Amstutz about 10 years ago
- Category set to Crunch
- Status changed from New to In Progress
Updated by Peter Amstutz about 10 years ago
The bug is not in the actual locking code, the bug is in crunch-dispatch's handling of setting the git tags for locking. This used to work (in a slightly ugly way) but the 4297-dispatch-load branch converted various "bail out and try again" errors into fatal errors, which resulted in this regression.
Updated by Tom Clegg about 10 years ago
At 7d6454e
Now that it's normal (during a race) for the "git tag" command to fail, but "git rev-list" failure really means a bad thing, the 2>/dev/null
should be removed from "git rev-list" and could be added to "git tag".
Minor things:
These (existing) statements are either debug printfs (which should be removed) or real log messages (which should be prefixed with at least "dispatch: "
like [most of] crunch-dispatch's other log messsages):
+ $stderr.puts cmd
+ $stderr.puts `#{cmd}`
Could be more specific/helpful and say "any existing tag" instead of "the tag" here, since "found other tag" would have resulted in the other "existing tag points to blah" error.
+ fail_job job, "'git tag' for #{job.uuid} failed but did not find the tag using 'git rev-list'"
Updated by Peter Amstutz about 10 years ago
Tom Clegg wrote:
At 7d6454e
Now that it's normal (during a race) for the "git tag" command to fail, but "git rev-list" failure really means a bad thing, the
2>/dev/null
should be removed from "git rev-list" and could be added to "git tag".
Fixed.
Minor things:
These (existing) statements are either debug printfs (which should be removed) or real log messages (which should be prefixed with at least
"dispatch: "
like [most of] crunch-dispatch's other log messsages):
[...]
Fixed, although we could probably do with a real logger in crunch-dispatch.
Could be more specific/helpful and say "any existing tag" instead of "the tag" here, since "found other tag" would have resulted in the other "existing tag points to blah" error.
[...]
Fixed.
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:a324727d478feb278ab35300c5b96e2349e23f3d.
Updated by Tom Clegg about 10 years ago
- Status changed from Resolved to In Progress
f3b3935 fixes a bug but it doesn't explain/address the problem at hand, i.e., "state invalid change from Failed to Complete".
We have a reasonable explanation and bugfix for why crunch-dispatch.rb decided not to run the job (which happened to be the right move in this case, since the job was already being started by a different dispatch process) but we still have the problem that two crunch-dispatches thought they owned the job.
I think the real bug here is that crunch-dispatch.rb does not call Job#lock before setting state='Failed' in fail_job().
If it had done so, dispatch "B" (the race loser) would have left the job alone and dispatch "A" would have succeeded.
Updated by Tom Clegg about 10 years ago
At 3e1ef3e
LGTM, except that it'll never work until you call it ArvadosModel::AlreadyLockedError
. :)
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:04c1723357282d0ef2890e570cfab84d35b99da4.