[Crunch] Improve node failure detection and job retry logic
One of Crunch's ideal responsibilities is that it's supposed to detect when a job fails because a node failed, and retry the job on a different node. Unfortunately, it's not coping so well when Node Manager shuts down nodes. There are a couple of ways this can go:
- In between the time Node Manager issues the shutdown command, and the time that SLURM notices that the node is down, Crunch may decide to try to dispatch work to the node. #4334 will shrink this window of time, but there's no way to close it completely. In this case, the initial node allocation will usually fail, and this is the most common failure mode we're seeing for jobs right now.
- There may be a rapid succession of events where Crunch assigns work to a node, the node enters a shutdown window, and Node Manager decides to shut it down before it sees the new assignment. In this case, the initial allocation succeeds, and the job may even officially start, but it will die unceremoniously before long.
Crunch should be able to detect both these cases. It should not mark the job failed, but instead retry it on a functional node.
Peter brought up this issue of Node Manager causing job failures during the code review, and the result was #4127, a proposed addition to the Node API to safely declare shutdowns. Reflecting on it some more, better detection and handling of node failures is something we've always wanted Crunch to do anyway; useful in contexts beyond Node Manager intentionally shutting down nodes; and would probably take similar or even less development time—there's no new API to document and test, no clients to update, just internal Crunch logic.
#3 Updated by Brett Smith over 5 years ago
- Status changed from New to Closed
- Target version deleted (
Arvados Future Sprints)
Now that #4380 is done and we've seen great results, I'm closing this issue. This issue's description is very focused on the interaction between Node Manager and Crunch, and the work done on #4380 has addressed both potential problems outlined here.
The subject line of this issue remains to be done, but with the issue description and background, I think it'd be better to let that work be taken over by stories that started out more generally, like #5064.