Bug #10004
closed[Crunch] Terminate on node failure in setup steps
Description
If an allocation includes multiple nodes, it's possible for a node to fail during one of the setup tasks executed by crunch-job (such as docker load). When a node fails, srun
can hang.
The current code checks squeue
to see if the step has gone away, and kills srun
. However, if one node fails but there are other nodes still in the allocation, the job will still be in squeue
. As a result, the stuck srun
process isn't terminated.
Updated by Peter Amstutz over 8 years ago
- Subject changed from [Crunch] Handle partial node failure to [Crunch] Handle partial node failure in setup steps
- Description updated (diff)
Updated by Peter Amstutz over 8 years ago
- Subject changed from [Crunch] Handle partial node failure in setup steps to [Crunch] Terminate on node failure in setup steps
- Assigned To set to Peter Amstutz
- Target version set to 2016-09-14 sprint
Updated by Tom Clegg over 8 years ago
10004-check-sinfo @ 19ad5c5
This should have a comment explaining why it's necessary, and why it only makes sense during srun_sync (i.e., at that point, the failure of even one node makes any further effort futile).
The $last_sinfo_check
variable is superfluous. Might as well just compare and update $sinfo_checked
directly.
Other than that, LGTM. Thanks.
Updated by Peter Amstutz over 8 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:b54478ea1b7c8aaeaf565d591f32769bcdc09b8f.