Project

General

Profile

Bug #10004

Updated by Peter Amstutz over 7 years ago

If an allocation includes multiple nodes, it's possible for a node to fail during one of the setup tasks executed by crunch-job (such as docker load).    When a node fails, this can lead to a hang in @srun@. 

 The current code checks @squeue@ squeue() to see if the job has gone away, and kills @srun@.    However, if one node fails but there are other nodes still in the allocation, the job will still be in @squeue@.    As a result, the stuck @srun@ process isn't terminated. 

Back