Bug #6481
closed[Crunch] Job infinitely stuck in Queued state after Sanity check fails 1
Description
Looking at: https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-0ohrxciza6yival#Log
It seems that
Sanity check is `/usr/bin/docker.io ps -q`
2015-07-06_20:09:30 6213 starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker.io','ps','-q']
2015-07-06_20:09:30 time="2015-07-06T20:09:30Z" level=fatal msg="Get http:///var/run/docker.sock/v1.18/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"
2015-07-06_20:09:30 srun: error: compute0: task 0: Exited with exit code 1
2015-07-06_20:09:30 6213 Sanity check failed: 1
Docker is down within the compute node, and this error message keeps getting repeated, ad infinitum.
Two ideas:
The job should fail if the sanity check fails 1.
There should be a maximum number of sanity check retries.
Updated by Tom Clegg almost 9 years ago
- Subject changed from Job infinitely stuck in Queued state after Sanity check fails 1 to [Crunch] Job infinitely stuck in Queued state after Sanity check fails 1
- Category set to Crunch
The node (not the job) has a problem, so the sanity check is doing its job here. The job shouldn't fail: it hasn't even been attempted.
Suggested improvements:- Detect sanity check fail as distinct from locking fail (different crunch-job exit status?)
- Mark the node as "down" after 4(?) consecutive failed sanity checks
- Avoid using a node (for 5 seconds?) if it has failed a sanity check