Project

General

Profile

Actions

Bug #6481

closed

[Crunch] Job infinitely stuck in Queued state after Sanity check fails 1

Added by Bryan Cosca almost 9 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Story points:
-

Description

Looking at: https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-0ohrxciza6yival#Log

It seems that

Sanity check is `/usr/bin/docker.io ps -q`
2015-07-06_20:09:30 6213 starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker.io','ps','-q']
2015-07-06_20:09:30 time="2015-07-06T20:09:30Z" level=fatal msg="Get http:///var/run/docker.sock/v1.18/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"
2015-07-06_20:09:30 srun: error: compute0: task 0: Exited with exit code 1
2015-07-06_20:09:30 6213 Sanity check failed: 1

Docker is down within the compute node, and this error message keeps getting repeated, ad infinitum.

Two ideas:

The job should fail if the sanity check fails 1.
There should be a maximum number of sanity check retries.

Actions

Also available in: Atom PDF