Bug #8811
closed[Crunch] `srun --nodes=1` reports "Unable to create job step: Required node not available (down or drained)" and exits 1
Description
Bug report¶
In a job that only had one node allocated to it:
2016-03-26_03:24:57 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686 load docker image: start 2016-03-26_03:24:57 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686 stderr starting: ['srun','--nodelist=compute3','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF b85dffb1be2ca7bc757be6ff8ae4873a45214918282ef42cc2cbc2cead63356b; then arv-get 877b0bb063029e309ec3d0624e75eeda\\+503\\/b85dffb1be2ca7bc757be6ff8ae4873a45214918282ef42cc2cbc2cead63356b\\.tar | docker.io load fi '] 2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686 load docker image: exit 0 2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686 check --memory-swap feature: start 2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686 stderr starting: ['srun','--nodes=1','docker.io','run','--help'] 2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686 stderr srun: error: Unable to create job step: Required node not available (down or drained) 2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686 check --memory-swap feature: exit 1
This should be detected and treated the same way as any other node failure, by exiting so that crunch-dispatch retries the job.
Fix¶
Near the end of srun_sync, check if $main::please_freeze
or $Jobstep->{tempfail}
(you might need to refer to it by a different name) are true. If either of them are true, and $exited
is 0, coerce $exited
to be 255, so that the caller uses its normal error handling code to deal with the problem.
Updated by Brett Smith over 8 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith over 8 years ago
Related to #8810 because the propose fix could potentially address that as well.
Updated by Brett Smith over 8 years ago
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version changed from Arvados Future Sprints to 2016-04-27 sprint
Updated by Brett Smith over 8 years ago
- Target version changed from 2016-04-27 sprint to 2016-04-13 sprint
Updated by Peter Amstutz over 8 years ago
I'm a bit concerned about this section in preprocess_stderr
:
elsif (!exists $jobstep[$jobstepidx]->{slotindex}) { # Skip the following tempfail checks if this srun proc isn't # attached to a particular worker slot. }
srun_sync doesn't set "slotindex" on $jobstep
, which suggests the tests for SLURM errors will get skipped. I think we want to adjust the behavior of preprocess_stderr
so that they still test for SLURM errors and set tempfail and just skip the call to ban_node_by_slot.
Updated by Brett Smith over 8 years ago
Peter Amstutz wrote:
srun_sync doesn't set "slotindex" on
$jobstep
, which suggests the tests for SLURM errors will get skipped. I think we want to adjust the behavior ofpreprocess_stderr
so that they still test for SLURM errors and set tempfail and just skip the call to ban_node_by_slot.
Thanks for catching that, that's absolutely right. Fixed in 5d981be.
Updated by Brett Smith over 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:34d92b238ebc107bf28dac5c7e3ce138ac84b2c1.