Project

General

Profile

Actions

Bug #4310

closed

[Crunch] crunch-dispatch --jobs locking is broken

Added by Ward Vandewege over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
0.5

Description

[Crunch] qr1hi-d1hrv-xg61rrhrxctnjrb failed with 'state invalid change from Failed to Complete'

We have 2 crunch-dispatchers running in --jobs mode right now. If we have locking, it may not be working right.


Subtasks 3 (0 open3 closed)

Task #4459: Review 4310-lock-before-failResolvedPeter Amstutz11/06/2014Actions
Task #4447: Review 4310-git-tag-raceResolvedPeter Amstutz11/06/2014Actions
Task #4374: Diagnose and fixResolvedPeter Amstutz11/06/2014Actions

Related issues

Related to Arvados - Bug #4314: [Crunch] Figure out why this job was marked Failed unexpectedlyResolvedPeter Amstutz10/24/2014Actions
Actions #1

Updated by Ward Vandewege over 9 years ago

  • Description updated (diff)
Actions #2

Updated by Ward Vandewege over 9 years ago

  • Target version changed from Bug Triage to 2014-11-19 sprint
Actions #3

Updated by Ward Vandewege over 9 years ago

  • Story points set to 0.5
Actions #4

Updated by Ward Vandewege over 9 years ago

  • Subject changed from [Crunch] qr1hi-d1hrv-xg61rrhrxctnjrb failed with 'state invalid change from Failed to Complete' to [Crunch] crunch-dispatch --jobs locking is broken
  • Description updated (diff)
  • Story points deleted (0.5)
Actions #5

Updated by Ward Vandewege over 9 years ago

  • Story points set to 0.5
Actions #6

Updated by Ward Vandewege over 9 years ago

  • Assigned To set to Peter Amstutz
Actions #7

Updated by Peter Amstutz over 9 years ago

  • Category set to Crunch
  • Status changed from New to In Progress
Actions #8

Updated by Peter Amstutz over 9 years ago

The bug is not in the actual locking code, the bug is in crunch-dispatch's handling of setting the git tags for locking. This used to work (in a slightly ugly way) but the 4297-dispatch-load branch converted various "bail out and try again" errors into fatal errors, which resulted in this regression.

Actions #9

Updated by Tom Clegg over 9 years ago

At 7d6454e

Now that it's normal (during a race) for the "git tag" command to fail, but "git rev-list" failure really means a bad thing, the 2>/dev/null should be removed from "git rev-list" and could be added to "git tag".

Minor things:

These (existing) statements are either debug printfs (which should be removed) or real log messages (which should be prefixed with at least "dispatch: " like [most of] crunch-dispatch's other log messsages):

+        $stderr.puts cmd
+        $stderr.puts `#{cmd}`

Could be more specific/helpful and say "any existing tag" instead of "the tag" here, since "found other tag" would have resulted in the other "existing tag points to blah" error.

+            fail_job job, "'git tag' for #{job.uuid} failed but did not find the tag using 'git rev-list'" 

Actions #10

Updated by Peter Amstutz over 9 years ago

Tom Clegg wrote:

At 7d6454e

Now that it's normal (during a race) for the "git tag" command to fail, but "git rev-list" failure really means a bad thing, the 2>/dev/null should be removed from "git rev-list" and could be added to "git tag".

Fixed.

Minor things:

These (existing) statements are either debug printfs (which should be removed) or real log messages (which should be prefixed with at least "dispatch: " like [most of] crunch-dispatch's other log messsages):
[...]

Fixed, although we could probably do with a real logger in crunch-dispatch.

Could be more specific/helpful and say "any existing tag" instead of "the tag" here, since "found other tag" would have resulted in the other "existing tag points to blah" error.
[...]

Fixed.

Actions #11

Updated by Tom Clegg over 9 years ago

LGTM @ f3b3935, thanks!

Actions #12

Updated by Anonymous over 9 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:a324727d478feb278ab35300c5b96e2349e23f3d.

Actions #13

Updated by Tom Clegg over 9 years ago

  • Status changed from Resolved to In Progress

f3b3935 fixes a bug but it doesn't explain/address the problem at hand, i.e., "state invalid change from Failed to Complete".

We have a reasonable explanation and bugfix for why crunch-dispatch.rb decided not to run the job (which happened to be the right move in this case, since the job was already being started by a different dispatch process) but we still have the problem that two crunch-dispatches thought they owned the job.

I think the real bug here is that crunch-dispatch.rb does not call Job#lock before setting state='Failed' in fail_job().

If it had done so, dispatch "B" (the race loser) would have left the job alone and dispatch "A" would have succeeded.

Actions #14

Updated by Tom Clegg over 9 years ago

At 3e1ef3e

LGTM, except that it'll never work until you call it ArvadosModel::AlreadyLockedError. :)

Actions #15

Updated by Anonymous over 9 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:04c1723357282d0ef2890e570cfab84d35b99da4.

Actions

Also available in: Atom PDF