Bug #4368
closed[Crunch] Improve node failure detection and job retry logic
Description
One of Crunch's ideal responsibilities is that it's supposed to detect when a job fails because a node failed, and retry the job on a different node. Unfortunately, it's not coping so well when Node Manager shuts down nodes. There are a couple of ways this can go:
- In between the time Node Manager issues the shutdown command, and the time that SLURM notices that the node is down, Crunch may decide to try to dispatch work to the node. #4334 will shrink this window of time, but there's no way to close it completely. In this case, the initial node allocation will usually fail, and this is the most common failure mode we're seeing for jobs right now.
- There may be a rapid succession of events where Crunch assigns work to a node, the node enters a shutdown window, and Node Manager decides to shut it down before it sees the new assignment. In this case, the initial allocation succeeds, and the job may even officially start, but it will die unceremoniously before long.
Crunch should be able to detect both these cases. It should not mark the job failed, but instead retry it on a functional node.
Peter brought up this issue of Node Manager causing job failures during the code review, and the result was #4127, a proposed addition to the Node API to safely declare shutdowns. Reflecting on it some more, better detection and handling of node failures is something we've always wanted Crunch to do anyway; useful in contexts beyond Node Manager intentionally shutting down nodes; and would probably take similar or even less development time—there's no new API to document and test, no clients to update, just internal Crunch logic.