Project

General

Profile

Actions

Bug #12551

closed

crunch-job should check errors from open() calls

Added by Tom Clegg about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

Undetected open() errors could be causing the "missing stderr" bug mentioned in #12550.

Specifically, each task child process relies on this to capture stderr from srun:

    open(STDOUT,">&writer");
    open(STDERR,">&writer");
Actions #1

Updated by Tom Clegg about 7 years ago

Actions #2

Updated by Tom Clegg about 7 years ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg about 7 years ago

If this makes any difference at all, it'll make the affected jobs fail (not as good as tempfail, but better than causing downstream failures that are really hard to recover from because job-reuse). By doing so it will also tell us what the open() errors are. Then we can figure out how to either prevent the errors from happening at all and/or recover from them more gracefully.

Actions #4

Updated by Tom Clegg about 7 years ago

  • Target version set to 2017-11-08 Sprint
Actions #5

Updated by Peter Amstutz about 7 years ago

Why not exit_retry_unlocked(); ?

Actions #6

Updated by Tom Clegg about 7 years ago

Peter Amstutz wrote:

Why not exit_retry_unlocked(); ?

Less likely to introduce more complications. I stuck with what we do in other rare/unexpected error cases, like fcntl() failing.

Actions #7

Updated by Peter Amstutz about 7 years ago

Ok, LGTM, if any of these things are failing silently then at least they will fail loudly.

Actions #8

Updated by Anonymous about 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:98e9073e7ca36edbe8dd0569d67405e2e030f8db.

Actions

Also available in: Atom PDF