[Crunch] Job was repeatedly retried on same bad compute node until abandoned
This is the last log, from the logs table:
2016-03-26_20:51:23 salloc: Granted job allocation 228 2016-03-26_20:51:23 13514 Sanity check is `docker.io ps -q` 2016-03-26_20:51:23 13514 sanity check: start 2016-03-26_20:51:23 13514 stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q'] 2016-03-26_20:51:23 13514 stderr srun: error: Task launch for 228.0 failed on node compute15: No such file or directory 2016-03-26_20:51:23 13514 stderr srun: error: Application launch failed: No such file or directory 2016-03-26_20:51:23 13514 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-03-26_20:51:23 13514 stderr srun: error: Timed out waiting for job step to complete 2016-03-26_20:51:23 13514 sanity check: exit 2 2016-03-26_20:51:23 13514 Sanity check failed: 2 2016-03-26_20:51:23 salloc: Relinquishing job allocation 228
The job was failed immediately after this. That's a little surprising—why wasn't it retried as intended?
#3 Updated by Brett Smith almost 5 years ago
The system behaved as designed, but I wonder if we need a design improvement.
The original error was basically the same as #8810, except it failed early enough that arv-get also reported an error and exited 1:
2016-03-26_20:50:32 wx7k5-8i9sb-f0ygdqygwonamfr 12962 load docker image: start 2016-03-26_20:50:32 wx7k5-8i9sb-f0ygdqygwonamfr 12962 stderr starting: ['srun','--nodelist=compute15','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83; then arv-get [redacted hash]\\.tar | docker.io load fi '] 2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962 stderr Post http:///var/run/docker.sock/v1.20/images/load: EOF. 2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962 stderr * Are you trying to connect to a TLS-enabled daemon without TLS? 2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962 stderr * Is your docker daemon up and running? 2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962 stderr Traceback (most recent call last): 2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962 stderr File "/usr/local/bin/arv-get", line 209, in <module> 2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962 stderr outfile.write(data) 2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962 stderr IOError: [Errno 32] Broken pipe 2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962 stderr srun: error: compute15: task 0: Exited with exit code 1 2016-03-26_20:50:53 wx7k5-8i9sb-f0ygdqygwonamfr 12962 load docker image: exit 1 2016-03-26_20:50:53 salloc: Relinquishing job allocation 223
Here crunch-job exited EX_RETRY_UNLOCKED. Fair enough.
crunch-dispatch tried to run the job three more times. Each time, it was allocated to compute15, and failed in the sanity check as in the description. So crunch-dispatch gave up:
2016-03-26_20:51:23.35132 dispatch: job wx7k5-8i9sb-f0ygdqygwonamfr exceeded node failure retry limit -- giving up
#5 Updated by Brett Smith almost 5 years ago
Idea: Give crunch-job a new exit code to specifically mark the sanity check failed. crunch-dispatch should recognize this exit code and blacklist the allocated compute node(s) for a little while (a configurable amount of time?).
Right now this doesn't seem super high priority, since it's only happened literally once. We're going to wait a little bit and watch out for recurrences.