Bug #6356

[Crunch] crunch-job retried a permanently failed task

Added by Bryan Cosca about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Crunch
Target version:
Start date:
11/09/2015
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-fo8inemanlxlsow is the instance in question:
https://cloud.curoverse.com/jobs/qr1hi-8i9sb-jqqmdpircln9fhj is the job in question:

After I ran into this error:

Traceback (most recent call last):
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr File "/tmp/crunch-job/src/crunch_scripts/bwa.py", line 67, in
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr chr_pipe = subprocess.Popen(chr_args,stdout=chr_out)
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr File "/usr/lib/python2.7/subprocess.py", line 672, in init
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr errread, errwrite) = self._get_handles(stdin, stdout, stderr)
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr File "/usr/lib/python2.7/subprocess.py", line 1063, in _get_handles
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr c2pwrite = stdout.fileno()
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr AttributeError: 'unicode' object has no attribute 'fileno'

chr_out should have been an open file handle, but wasn't. This should not retry because it will never find the open file handle.


Subtasks

Task #7738: Review 6356-crunch-permfail-task-retry-fix-wipResolvedBrett Smith

Associated revisions

Revision 97bc18ee
Added by Brett Smith almost 4 years ago

Merge branch '6356-crunch-permfail-task-retry-fix-wip'

Closes #6356, #7738.

History

#1 Updated by Bryan Cosca about 4 years ago

  • Description updated (diff)

#2 Updated by Bryan Cosca about 4 years ago

  • Description updated (diff)

#3 Updated by Brett Smith about 4 years ago

It looks like what happened is:

  1. The task failed.
  2. reapchildren noticed this, and did the usual end-of-task work. It always puts failed tasks back in the todo list, so it did that.
  3. With a free slot and a task to run, crunch-job left the while loop at the bottom of THISROUND, and started a new iteration of that for loop. The next task it would start would be the retry of the one that just failed.
  4. That eventually got back to the while loop at the bottom of THISROUND again, which immediately ran last THISROUND—too late, since we already started the task we shouldn't have.

Possible fix(es):

  • Only put tempfailed task back on the todo list.
  • Run last THISROUND if $main::please_freeze || defined($main::success); after we call reapchildren(), since it's the main place where the value of $main::success can change.

#4 Updated by Brett Smith about 4 years ago

  • Subject changed from Jobs retry when a file handle is not open to [Crunch] crunch-job retried a permanently failed task
  • Category set to Crunch
  • Target version set to 2015-07-22 sprint

#5 Updated by Brett Smith about 4 years ago

  • Target version changed from 2015-07-22 sprint to Arvados Future Sprints

#6 Updated by Brett Smith almost 4 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Arvados Future Sprints to 2015-11-11 sprint
  • Story points set to 0.5

Branch 6356-crunch-permfail-task-retry-fix-wip is up for review. Please see the commit message for very full rationale explaining why I chose this solution over others.

#7 Updated by Brett Smith almost 4 years ago

  • Target version changed from 2015-11-11 sprint to 2015-12-02 sprint

#8 Updated by Peter Amstutz almost 4 years ago

This looks good to me. I think the prior attempt at a fix in b306eb48ab12676ffb365ede8197e4f2d7e92011 was just a mistake because crunch-job is confusing. (I see now the loop "while (@freeslot ...) { }" is actually the idle loop that waits for slots to become available, not the loop that actually schedules jobs.)

#9 Updated by Brett Smith almost 4 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:97bc18eee0a50a4bc0209932c26ab44e51b4836b.

Also available in: Atom PDF