Bug #10004

[Crunch] Terminate on node failure in setup steps

Added by Peter Amstutz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/13/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

If an allocation includes multiple nodes, it's possible for a node to fail during one of the setup tasks executed by crunch-job (such as docker load). When a node fails, srun can hang.

The current code checks squeue to see if the step has gone away, and kills srun. However, if one node fails but there are other nodes still in the allocation, the job will still be in squeue. As a result, the stuck srun process isn't terminated.


Subtasks

Task #10010: Review 10004-check-sinfoResolvedTom Clegg

Associated revisions

Revision b54478ea
Added by Peter Amstutz over 3 years ago

Merge branch '10004-check-sinfo' closes #10004

History

#1 Updated by Peter Amstutz over 3 years ago

  • Subject changed from [Crunch] Handle partial node failure to [Crunch] Handle partial node failure in setup steps
  • Description updated (diff)

#2 Updated by Peter Amstutz over 3 years ago

  • Description updated (diff)

#3 Updated by Tom Clegg over 3 years ago

  • Description updated (diff)

#4 Updated by Peter Amstutz over 3 years ago

  • Subject changed from [Crunch] Handle partial node failure in setup steps to [Crunch] Terminate on node failure in setup steps
  • Assigned To set to Peter Amstutz
  • Target version set to 2016-09-14 sprint

#5 Updated by Tom Clegg over 3 years ago

10004-check-sinfo @ 19ad5c5

This should have a comment explaining why it's necessary, and why it only makes sense during srun_sync (i.e., at that point, the failure of even one node makes any further effort futile).

The $last_sinfo_check variable is superfluous. Might as well just compare and update $sinfo_checked directly.

Other than that, LGTM. Thanks.

#6 Updated by Peter Amstutz over 3 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:b54478ea1b7c8aaeaf565d591f32769bcdc09b8f.

Also available in: Atom PDF