Bug #12551

crunch-job should check errors from open() calls

Added by Tom Clegg over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
11/03/2017
Due date:
% Done:

100%

Estimated time:
Story points:
-

Description

Undetected open() errors could be causing the "missing stderr" bug mentioned in #12550.

Specifically, each task child process relies on this to capture stderr from srun:

    open(STDOUT,">&writer");
    open(STDERR,">&writer");

Associated revisions

Revision 98e9073e
Added by Tom Clegg over 1 year ago

Merge branch '12551-check-open-errors'

fixes #12551

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg over 1 year ago

#2 Updated by Tom Clegg over 1 year ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg

#3 Updated by Tom Clegg over 1 year ago

If this makes any difference at all, it'll make the affected jobs fail (not as good as tempfail, but better than causing downstream failures that are really hard to recover from because job-reuse). By doing so it will also tell us what the open() errors are. Then we can figure out how to either prevent the errors from happening at all and/or recover from them more gracefully.

#4 Updated by Tom Clegg over 1 year ago

  • Target version set to 2017-11-08 Sprint

#5 Updated by Peter Amstutz over 1 year ago

Why not exit_retry_unlocked(); ?

#6 Updated by Tom Clegg over 1 year ago

Peter Amstutz wrote:

Why not exit_retry_unlocked(); ?

Less likely to introduce more complications. I stuck with what we do in other rare/unexpected error cases, like fcntl() failing.

#7 Updated by Peter Amstutz over 1 year ago

Ok, LGTM, if any of these things are failing silently then at least they will fail loudly.

#8 Updated by Anonymous over 1 year ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:98e9073e7ca36edbe8dd0569d67405e2e030f8db.

Also available in: Atom PDF