Project

General

Profile

Bug #8811

Updated by Brett Smith about 8 years ago

h2. Bug report 

 In a job that only had one node allocated to it: 

 <pre>2016-03-26_03:24:57 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686    load docker image: start 
 2016-03-26_03:24:57 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686    stderr starting: ['srun','--nodelist=compute3','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF b85dffb1be2ca7bc757be6ff8ae4873a45214918282ef42cc2cbc2cead63356b; then       arv-get 877b0bb063029e309ec3d0624e75eeda\\+503\\/b85dffb1be2ca7bc757be6ff8ae4873a45214918282ef42cc2cbc2cead63356b\\.tar | docker.io load fi '] 
 2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686    load docker image: exit 0 
 2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686    check --memory-swap feature: start 
 2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686    stderr starting: ['srun','--nodes=1','docker.io','run','--help'] 
 2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686    stderr srun: error: Unable to create job step: Required node not available (down or drained) 
 2016-03-26_03:29:56 wx7k5-8i9sb-1wvsbbyc8vsteeb 24686    check --memory-swap feature: exit 1</pre> 

 This should be detected and treated the same way as any other node failure, by exiting so that crunch-dispatch retries the job. 

 h2. Fix 

 Near the end of srun_sync, check if @$main::please_freeze@ or @$Jobstep->{tempfail}@ (you might need to refer to it by a different name) are true.    If either of them are true, and @$exited@ is 0, coerce @$exited@ to be 255, so that the caller uses its normal error handling code to deal with the problem. 

Back