Project

General

Profile

Actions

Bug #10004

closed

[Crunch] Terminate on node failure in setup steps

Added by Peter Amstutz over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

If an allocation includes multiple nodes, it's possible for a node to fail during one of the setup tasks executed by crunch-job (such as docker load). When a node fails, srun can hang.

The current code checks squeue to see if the step has gone away, and kills srun. However, if one node fails but there are other nodes still in the allocation, the job will still be in squeue. As a result, the stuck srun process isn't terminated.


Subtasks 1 (0 open1 closed)

Task #10010: Review 10004-check-sinfoResolvedTom Clegg09/13/2016Actions
Actions

Also available in: Atom PDF