Bug #10004
Updated by Peter Amstutz over 8 years ago
If an allocation includes multiple nodes, it's possible for a node to fail during one of the setup tasks executed by crunch-job (such as docker load). When a node fails, this can lead to a hang in @srun@.
The current code checks squeue() to see if the job has gone away, and kills @srun@. However, if one node fails but there are other nodes still in the allocation, the job will still be in @squeue@.