Project

General

Profile

Actions

Bug #6481

closed

[Crunch] Job infinitely stuck in Queued state after Sanity check fails 1

Added by Bryan Cosca almost 9 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Story points:
-

Description

Looking at: https://cloud.curoverse.com/pipeline_instances/qr1hi-d1hrv-0ohrxciza6yival#Log

It seems that

Sanity check is `/usr/bin/docker.io ps -q`
2015-07-06_20:09:30 6213 starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker.io','ps','-q']
2015-07-06_20:09:30 time="2015-07-06T20:09:30Z" level=fatal msg="Get http:///var/run/docker.sock/v1.18/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"
2015-07-06_20:09:30 srun: error: compute0: task 0: Exited with exit code 1
2015-07-06_20:09:30 6213 Sanity check failed: 1

Docker is down within the compute node, and this error message keeps getting repeated, ad infinitum.

Two ideas:

The job should fail if the sanity check fails 1.
There should be a maximum number of sanity check retries.

Actions #1

Updated by Tom Clegg almost 9 years ago

  • Subject changed from Job infinitely stuck in Queued state after Sanity check fails 1 to [Crunch] Job infinitely stuck in Queued state after Sanity check fails 1
  • Category set to Crunch

The node (not the job) has a problem, so the sanity check is doing its job here. The job shouldn't fail: it hasn't even been attempted.

Suggested improvements:
  • Detect sanity check fail as distinct from locking fail (different crunch-job exit status?)
  • Mark the node as "down" after 4(?) consecutive failed sanity checks
  • Avoid using a node (for 5 seconds?) if it has failed a sanity check
Actions #2

Updated by Peter Amstutz over 4 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF