[Crunch] Terminate on node failure in setup steps
If an allocation includes multiple nodes, it's possible for a node to fail during one of the setup tasks executed by crunch-job (such as docker load). When a node fails,
srun can hang.
The current code checks
squeue to see if the step has gone away, and kills
srun. However, if one node fails but there are other nodes still in the allocation, the job will still be in
squeue. As a result, the stuck
srun process isn't terminated.
#5 Updated by Tom Clegg over 3 years ago
10004-check-sinfo @ 19ad5c5
This should have a comment explaining why it's necessary, and why it only makes sense during srun_sync (i.e., at that point, the failure of even one node makes any further effort futile).
$last_sinfo_check variable is superfluous. Might as well just compare and update
Other than that, LGTM. Thanks.