Bug #6356
closed[Crunch] crunch-job retried a permanently failed task
Description
https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-fo8inemanlxlsow is the instance in question:
https://cloud.curoverse.com/jobs/qr1hi-8i9sb-jqqmdpircln9fhj is the job in question:
After I ran into this error:
Traceback (most recent call last):
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr File "/tmp/crunch-job/src/crunch_scripts/bwa.py", line 67, in
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr chr_pipe = subprocess.Popen(chr_args,stdout=chr_out)
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr File "/usr/lib/python2.7/subprocess.py", line 672, in init
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr errread, errwrite) = self._get_handles(stdin, stdout, stderr)
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr File "/usr/lib/python2.7/subprocess.py", line 1063, in _get_handles
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr c2pwrite = stdout.fileno()
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr AttributeError: 'unicode' object has no attribute 'fileno'
chr_out should have been an open file handle, but wasn't. This should not retry because it will never find the open file handle.
Updated by Brett Smith over 9 years ago
It looks like what happened is:
- The task failed.
- reapchildren noticed this, and did the usual end-of-task work. It always puts failed tasks back in the todo list, so it did that.
- With a free slot and a task to run, crunch-job left the while loop at the bottom of THISROUND, and started a new iteration of that for loop. The next task it would start would be the retry of the one that just failed.
- That eventually got back to the while loop at the bottom of THISROUND again, which immediately ran
last THISROUND
—too late, since we already started the task we shouldn't have.
Possible fix(es):
- Only put tempfailed task back on the todo list.
- Run
last THISROUND if $main::please_freeze || defined($main::success);
after we call reapchildren(), since it's the main place where the value of $main::success can change.
Updated by Brett Smith over 9 years ago
- Subject changed from Jobs retry when a file handle is not open to [Crunch] crunch-job retried a permanently failed task
- Category set to Crunch
- Target version set to 2015-07-22 sprint
Updated by Brett Smith over 9 years ago
- Target version changed from 2015-07-22 sprint to Arvados Future Sprints
Updated by Brett Smith about 9 years ago
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version changed from Arvados Future Sprints to 2015-11-11 sprint
- Story points set to 0.5
Branch 6356-crunch-permfail-task-retry-fix-wip is up for review. Please see the commit message for very full rationale explaining why I chose this solution over others.
Updated by Brett Smith about 9 years ago
- Target version changed from 2015-11-11 sprint to 2015-12-02 sprint
Updated by Peter Amstutz about 9 years ago
This looks good to me. I think the prior attempt at a fix in b306eb48ab12676ffb365ede8197e4f2d7e92011 was just a mistake because crunch-job is confusing. (I see now the loop "while (@freeslot ...) { }" is actually the idle loop that waits for slots to become available, not the loop that actually schedules jobs.)
Updated by Brett Smith about 9 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:97bc18eee0a50a4bc0209932c26ab44e51b4836b.