Bug #10004
Updated by Tom Clegg over 8 years ago
If an allocation includes multiple nodes, it's possible for a node to fail during one of the setup tasks executed by crunch-job (such as docker load). When a node fails, @srun@ this can hang. lead to a hang in @srun@. The current code checks @squeue@ to see if the step job has gone away, and kills @srun@. However, if one node fails but there are other nodes still in the allocation, the job will still be in @squeue@. As a result, the stuck @srun@ process isn't terminated.