Project

General

Profile

Actions

Bug #6356

closed

[Crunch] crunch-job retried a permanently failed task

Added by Bryan Cosca over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
0.5

Description

https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-fo8inemanlxlsow is the instance in question:
https://cloud.curoverse.com/jobs/qr1hi-8i9sb-jqqmdpircln9fhj is the job in question:

After I ran into this error:

Traceback (most recent call last):
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr File "/tmp/crunch-job/src/crunch_scripts/bwa.py", line 67, in
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr chr_pipe = subprocess.Popen(chr_args,stdout=chr_out)
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr File "/usr/lib/python2.7/subprocess.py", line 672, in init
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr errread, errwrite) = self._get_handles(stdin, stdout, stderr)
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr File "/usr/lib/python2.7/subprocess.py", line 1063, in _get_handles
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr c2pwrite = stdout.fileno()
2015-06-18_18:02:38 qr1hi-8i9sb-jqqmdpircln9fhj 28591 2 stderr AttributeError: 'unicode' object has no attribute 'fileno'

chr_out should have been an open file handle, but wasn't. This should not retry because it will never find the open file handle.


Subtasks 1 (0 open1 closed)

Task #7738: Review 6356-crunch-permfail-task-retry-fix-wipResolvedBrett Smith11/09/2015Actions
Actions #1

Updated by Bryan Cosca over 9 years ago

  • Description updated (diff)
Actions #2

Updated by Bryan Cosca over 9 years ago

  • Description updated (diff)
Actions #3

Updated by Brett Smith over 9 years ago

It looks like what happened is:

  1. The task failed.
  2. reapchildren noticed this, and did the usual end-of-task work. It always puts failed tasks back in the todo list, so it did that.
  3. With a free slot and a task to run, crunch-job left the while loop at the bottom of THISROUND, and started a new iteration of that for loop. The next task it would start would be the retry of the one that just failed.
  4. That eventually got back to the while loop at the bottom of THISROUND again, which immediately ran last THISROUND—too late, since we already started the task we shouldn't have.

Possible fix(es):

  • Only put tempfailed task back on the todo list.
  • Run last THISROUND if $main::please_freeze || defined($main::success); after we call reapchildren(), since it's the main place where the value of $main::success can change.
Actions #4

Updated by Brett Smith over 9 years ago

  • Subject changed from Jobs retry when a file handle is not open to [Crunch] crunch-job retried a permanently failed task
  • Category set to Crunch
  • Target version set to 2015-07-22 sprint
Actions #5

Updated by Brett Smith over 9 years ago

  • Target version changed from 2015-07-22 sprint to Arvados Future Sprints
Actions #6

Updated by Brett Smith about 9 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Arvados Future Sprints to 2015-11-11 sprint
  • Story points set to 0.5

Branch 6356-crunch-permfail-task-retry-fix-wip is up for review. Please see the commit message for very full rationale explaining why I chose this solution over others.

Actions #7

Updated by Brett Smith about 9 years ago

  • Target version changed from 2015-11-11 sprint to 2015-12-02 sprint
Actions #8

Updated by Peter Amstutz about 9 years ago

This looks good to me. I think the prior attempt at a fix in b306eb48ab12676ffb365ede8197e4f2d7e92011 was just a mistake because crunch-job is confusing. (I see now the loop "while (@freeslot ...) { }" is actually the idle loop that waits for slots to become available, not the loop that actually schedules jobs.)

Actions #9

Updated by Brett Smith about 9 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:97bc18eee0a50a4bc0209932c26ab44e51b4836b.

Actions

Also available in: Atom PDF