Bug #9214

no node being spun up for pending job (after docker load failed)

Added by Ward Vandewege about 6 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

In su92l-d1hrv-xswkd33we27fopw, the docker load command failed:

2016-05-12_19:19:29 salloc: Granted job allocation 14353
2016-05-12_19:19:29 48525  Sanity check is `docker.io ps -q`
2016-05-12_19:19:29 48525  sanity check: start
2016-05-12_19:19:29 48525  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q']
2016-05-12_19:19:29 48525  sanity check: exit 0
2016-05-12_19:19:29 48525  Sanity check OK
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151207150126, 0.1.20151023190001, 0.1.20150205181653
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  check slurm allocation
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  node compute0 - 1 slots
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  start
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  clean work dirs: start
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  stderr starting: ['srun','--nodelist=compute0','-D','/tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2016-05-12_19:19:31 su92l-8i9sb-3wmb3ogss5hvqwb 48525  clean work dirs: exit 0
2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525  Install docker image 256f21bb3abfcd8e08a893886bf3e7c0+5082
2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525  docker image hash is f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8
2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525  load docker image: start
2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525  stderr starting: ['srun','--nodelist=compute0','/bin/bash','-o','pipefail','-ec',' if docker.io images -q --no-trunc --all | grep -xF f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8 >/dev/null; then     exit 0 fi declare -a exit_codes=("${PIPESTATUS[@]}") if [ 0 != "${exit_codes[0]}" ]; then    exit "${exit_codes[0]}"  # `docker images` failed elif [ 1 != "${exit_codes[1]}" ]; then    exit "${exit_codes[1]}"  # `grep` encountered an error else    # Everything worked fine, but grep didn\'t find the image on this host.    arv-get 256f21bb3abfcd8e08a893886bf3e7c0\\+5082\\/f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8\\.tar | docker.io load fi ']
2016-05-12_19:37:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  stderr An error occurred trying to connect: Post http:///var/run/docker.sock/v1.21/images/load: EOF
2016-05-12_19:37:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  stderr srun: error: compute0: task 0: Exited with exit code 1
2016-05-12_19:37:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  load docker image: exit 1
2016-05-12_19:37:30 salloc: Relinquishing job allocation 14353

After this, the job was back in 'pending' state, yet no new node was started up.

Workbench showed 0 queued jobs, 0 busy nodes, 0 idle nodes.

Sinfo and azure vm list concurred that no nodes were up.

I also attached the node manager logs for this period to this ticket.

Not until I queued another job did this job start running again, because there was a node available:

2016-05-12_20:05:56 salloc: Granted job allocation 14354
2016-05-12_20:05:57 7368  Sanity check is `docker.io ps -q`
2016-05-12_20:05:57 7368  sanity check: start
2016-05-12_20:05:57 7368  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q']
2016-05-12_20:05:58 7368  sanity check: exit 0
2016-05-12_20:05:58 7368  Sanity check OK
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151207150126, 0.1.20151023190001, 0.1.20150205181653
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  check slurm allocation
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  node compute1 - 1 slots
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  start
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  clean work dirs: start
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  stderr starting: ['srun','--nodelist=compute1','-D','/tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368  clean work dirs: exit 0
2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368  Install docker image 256f21bb3abfcd8e08a893886bf3e7c0+5082
2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368  docker image hash is f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8
2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368  load docker image: start
2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368  stderr starting: ['srun','--nodelist=compute1','/bin/bash','-o','pipefail','-ec',' if docker.io images -q --no-trunc --all | grep -xF f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8 >/dev/null; then     exit 0 fi declare -a exit_codes=("${PIPESTATUS[@]}") if [ 0 != "${exit_codes[0]}" ]; then    exit "${exit_codes[0]}"  # `docker images` failed elif [ 1 != "${exit_codes[1]}" ]; then    exit "${exit_codes[1]}"  # `grep` encountered an error else    # Everything worked fine, but grep didn\'t find the image on this host.    arv-get 256f21bb3abfcd8e08a893886bf3e7c0\\+5082\\/f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8\\.tar | docker.io load fi ']
...
current (929 KB) current Node manager log Ward Vandewege, 05/12/2016 08:33 PM

History

#1 Updated by Ward Vandewege about 6 years ago

  • Description updated (diff)

#2 Updated by Ward Vandewege about 6 years ago

#3 Updated by Brett Smith about 6 years ago

Ward Vandewege wrote:

After this, the job was back in 'pending' state, yet no new node was started up.

This is the crux of the problem, and a limitation of Crunch v1. After crunch-job locks a job to run it, it is no longer in the Queued state (at the API level), and can't be returned there. Because of that, Node Manager has no way to see it, and can't boot a node to accommodate it.

#4 Updated by Brett Smith about 6 years ago

I guess Node Manager could extend its internal wishlist by listing jobs that are in the Running state but not assigned to any node(s) yet.

#5 Updated by Peter Amstutz over 2 years ago

  • Status changed from New to Closed

Also available in: Atom PDF