Bug #4310
closed[Crunch] crunch-dispatch --jobs locking is broken
Description
[Crunch] qr1hi-d1hrv-xg61rrhrxctnjrb failed with 'state invalid change from Failed to Complete'
We have 2 crunch-dispatchers running in --jobs mode right now. If we have locking, it may not be working right.
Related issues
Updated by Ward Vandewege over 9 years ago
- Target version changed from Bug Triage to 2014-11-19 sprint
Updated by Ward Vandewege over 9 years ago
- Subject changed from [Crunch] qr1hi-d1hrv-xg61rrhrxctnjrb failed with 'state invalid change from Failed to Complete' to [Crunch] crunch-dispatch --jobs locking is broken
- Description updated (diff)
- Story points deleted (
0.5)
Updated by Peter Amstutz over 9 years ago
- Category set to Crunch
- Status changed from New to In Progress
Updated by Peter Amstutz over 9 years ago
The bug is not in the actual locking code, the bug is in crunch-dispatch's handling of setting the git tags for locking. This used to work (in a slightly ugly way) but the 4297-dispatch-load branch converted various "bail out and try again" errors into fatal errors, which resulted in this regression.
Updated by Tom Clegg over 9 years ago
At 7d6454e
Now that it's normal (during a race) for the "git tag" command to fail, but "git rev-list" failure really means a bad thing, the 2>/dev/null
should be removed from "git rev-list" and could be added to "git tag".
Minor things:
These (existing) statements are either debug printfs (which should be removed) or real log messages (which should be prefixed with at least "dispatch: "
like [most of] crunch-dispatch's other log messsages):
+ $stderr.puts cmd
+ $stderr.puts `#{cmd}`
Could be more specific/helpful and say "any existing tag" instead of "the tag" here, since "found other tag" would have resulted in the other "existing tag points to blah" error.
+ fail_job job, "'git tag' for #{job.uuid} failed but did not find the tag using 'git rev-list'"
Updated by Peter Amstutz over 9 years ago
Tom Clegg wrote:
At 7d6454e
Now that it's normal (during a race) for the "git tag" command to fail, but "git rev-list" failure really means a bad thing, the
2>/dev/null
should be removed from "git rev-list" and could be added to "git tag".
Fixed.
Minor things:
These (existing) statements are either debug printfs (which should be removed) or real log messages (which should be prefixed with at least
"dispatch: "
like [most of] crunch-dispatch's other log messsages):
[...]
Fixed, although we could probably do with a real logger in crunch-dispatch.
Could be more specific/helpful and say "any existing tag" instead of "the tag" here, since "found other tag" would have resulted in the other "existing tag points to blah" error.
[...]
Fixed.
Updated by Anonymous over 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:a324727d478feb278ab35300c5b96e2349e23f3d.
Updated by Tom Clegg over 9 years ago
- Status changed from Resolved to In Progress
f3b3935 fixes a bug but it doesn't explain/address the problem at hand, i.e., "state invalid change from Failed to Complete".
We have a reasonable explanation and bugfix for why crunch-dispatch.rb decided not to run the job (which happened to be the right move in this case, since the job was already being started by a different dispatch process) but we still have the problem that two crunch-dispatches thought they owned the job.
I think the real bug here is that crunch-dispatch.rb does not call Job#lock before setting state='Failed' in fail_job().
If it had done so, dispatch "B" (the race loser) would have left the job alone and dispatch "A" would have succeeded.
Updated by Tom Clegg over 9 years ago
At 3e1ef3e
LGTM, except that it'll never work until you call it ArvadosModel::AlreadyLockedError
. :)
Updated by Anonymous over 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:04c1723357282d0ef2890e570cfab84d35b99da4.