Bug #6481
closed[Crunch] Job infinitely stuck in Queued state after Sanity check fails 1
Description
Looking at: https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-0ohrxciza6yival#Log
It seems that
Sanity check is `/usr/bin/docker.io ps -q`
2015-07-06_20:09:30 6213 starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker.io','ps','-q']
2015-07-06_20:09:30 time="2015-07-06T20:09:30Z" level=fatal msg="Get http:///var/run/docker.sock/v1.18/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"
2015-07-06_20:09:30 srun: error: compute0: task 0: Exited with exit code 1
2015-07-06_20:09:30 6213 Sanity check failed: 1
Docker is down within the compute node, and this error message keeps getting repeated, ad infinitum.
Two ideas:
The job should fail if the sanity check fails 1.
There should be a maximum number of sanity check retries.