Bug #12551
closedcrunch-job should check errors from open() calls
Description
Undetected open() errors could be causing the "missing stderr" bug mentioned in #12550.
Specifically, each task child process relies on this to capture stderr from srun:
open(STDOUT,">&writer");
open(STDERR,">&writer");
Updated by Tom Clegg about 7 years ago
12551-check-open-errors @ ab82f5b47625e76b47893c992b31e9b2d2208d3f
Updated by Tom Clegg about 7 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
Updated by Tom Clegg about 7 years ago
If this makes any difference at all, it'll make the affected jobs fail (not as good as tempfail, but better than causing downstream failures that are really hard to recover from because job-reuse). By doing so it will also tell us what the open() errors are. Then we can figure out how to either prevent the errors from happening at all and/or recover from them more gracefully.
Updated by Tom Clegg about 7 years ago
Peter Amstutz wrote:
Why not
exit_retry_unlocked();
?
Less likely to introduce more complications. I stuck with what we do in other rare/unexpected error cases, like fcntl() failing.
Updated by Peter Amstutz about 7 years ago
Ok, LGTM, if any of these things are failing silently then at least they will fail loudly.
Updated by Anonymous about 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:98e9073e7ca36edbe8dd0569d67405e2e030f8db.