Bug #12551
closed
crunch-job should check errors from open() calls
Added by Tom Clegg about 7 years ago.
Updated about 7 years ago.
Description
Undetected open() errors could be causing the "missing stderr" bug mentioned in #12550.
Specifically, each task child process relies on this to capture stderr from srun:
open(STDOUT,">&writer");
open(STDERR,">&writer");
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
If this makes any difference at all, it'll make the affected jobs fail (not as good as tempfail, but better than causing downstream failures that are really hard to recover from because job-reuse). By doing so it will also tell us what the open() errors are. Then we can figure out how to either prevent the errors from happening at all and/or recover from them more gracefully.
- Target version set to 2017-11-08 Sprint
Why not exit_retry_unlocked();
?
Peter Amstutz wrote:
Why not exit_retry_unlocked();
?
Less likely to introduce more complications. I stuck with what we do in other rare/unexpected error cases, like fcntl() failing.
Ok, LGTM, if any of these things are failing silently then at least they will fail loudly.
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:98e9073e7ca36edbe8dd0569d67405e2e030f8db.
Also available in: Atom
PDF