Bug #10004

Updated by Tom Clegg over 6 years ago

If an allocation includes multiple nodes, it's possible for a node to fail during one of the setup tasks executed by crunch-job (such as docker load).    When a node fails, @srun@ this can hang. lead to a hang in @srun@. 

 The current code checks @squeue@ to see if the step job has gone away, and kills @srun@.    However, if one node fails but there are other nodes still in the allocation, the job will still be in @squeue@.    As a result, the stuck @srun@ process isn't terminated.