Bug #8811

[Crunch] `srun --nodes=1` reports "Unable to create job step: Required node not available (down or drained)" and exits 1

Added by Brett Smith over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
-
Target version:
Start date:
03/31/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Bug report

In a job that only had one node allocated to it:

2016-03-26_03:24:57 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686  load docker image: start
2016-03-26_03:24:57 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686  stderr starting: ['srun','--nodelist=compute3','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF b85dffb1be2ca7bc757be6ff8ae4873a45214918282ef42cc2cbc2cead63356b; then     arv-get 877b0bb063029e309ec3d0624e75eeda\\+503\\/b85dffb1be2ca7bc757be6ff8ae4873a45214918282ef42cc2cbc2cead63356b\\.tar | docker.io load fi ']
2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686  load docker image: exit 0
2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686  check --memory-swap feature: start
2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686  stderr starting: ['srun','--nodes=1','docker.io','run','--help']
2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686  stderr srun: error: Unable to create job step: Required node not available (down or drained)
2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686  check --memory-swap feature: exit 1

This should be detected and treated the same way as any other node failure, by exiting so that crunch-dispatch retries the job.

Fix

Near the end of srun_sync, check if $main::please_freeze or $Jobstep->{tempfail} (you might need to refer to it by a different name) are true. If either of them are true, and $exited is 0, coerce $exited to be 255, so that the caller uses its normal error handling code to deal with the problem.


Subtasks

Task #8862: Review 8811-srun-sync-tempfail-wipResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #8810: [Crunch] `docker load` fails to connect to endpoint; srun exits 0Resolved04/05/2016

Associated revisions

Revision 34d92b23
Added by Brett Smith over 5 years ago

Merge branch '8811-srun-sync-tempfail-wip'

Closes #8811, #8862.

History

#1 Updated by Brett Smith over 5 years ago

  • Description updated (diff)

#2 Updated by Brett Smith over 5 years ago

  • Target version set to Arvados Future Sprints

#3 Updated by Brett Smith over 5 years ago

  • Description updated (diff)

#4 Updated by Brett Smith over 5 years ago

  • Story points set to 0.5

#5 Updated by Brett Smith over 5 years ago

Related to #8810 because the propose fix could potentially address that as well.

#6 Updated by Brett Smith over 5 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Arvados Future Sprints to 2016-04-27 sprint

#7 Updated by Brett Smith over 5 years ago

  • Target version changed from 2016-04-27 sprint to 2016-04-13 sprint

#8 Updated by Peter Amstutz over 5 years ago

I'm a bit concerned about this section in preprocess_stderr:

    elsif (!exists $jobstep[$jobstepidx]->{slotindex}) {
      # Skip the following tempfail checks if this srun proc isn't
      # attached to a particular worker slot.
    }

srun_sync doesn't set "slotindex" on $jobstep, which suggests the tests for SLURM errors will get skipped. I think we want to adjust the behavior of preprocess_stderr so that they still test for SLURM errors and set tempfail and just skip the call to ban_node_by_slot.

#9 Updated by Brett Smith over 5 years ago

Peter Amstutz wrote:

srun_sync doesn't set "slotindex" on $jobstep, which suggests the tests for SLURM errors will get skipped. I think we want to adjust the behavior of preprocess_stderr so that they still test for SLURM errors and set tempfail and just skip the call to ban_node_by_slot.

Thanks for catching that, that's absolutely right. Fixed in 5d981be.

#10 Updated by Brett Smith over 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:34d92b238ebc107bf28dac5c7e3ce138ac84b2c1.

Also available in: Atom PDF