Project

General

Profile

Actions

Bug #9214

closed

no node being spun up for pending job (after docker load failed)

Added by Ward Vandewege almost 8 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

In su92l-d1hrv-xswkd33we27fopw, the docker load command failed:

2016-05-12_19:19:29 salloc: Granted job allocation 14353
2016-05-12_19:19:29 48525  Sanity check is `docker.io ps -q`
2016-05-12_19:19:29 48525  sanity check: start
2016-05-12_19:19:29 48525  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q']
2016-05-12_19:19:29 48525  sanity check: exit 0
2016-05-12_19:19:29 48525  Sanity check OK
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151207150126, 0.1.20151023190001, 0.1.20150205181653
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  check slurm allocation
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  node compute0 - 1 slots
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  start
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  clean work dirs: start
2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  stderr starting: ['srun','--nodelist=compute0','-D','/tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2016-05-12_19:19:31 su92l-8i9sb-3wmb3ogss5hvqwb 48525  clean work dirs: exit 0
2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525  Install docker image 256f21bb3abfcd8e08a893886bf3e7c0+5082
2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525  docker image hash is f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8
2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525  load docker image: start
2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525  stderr starting: ['srun','--nodelist=compute0','/bin/bash','-o','pipefail','-ec',' if docker.io images -q --no-trunc --all | grep -xF f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8 >/dev/null; then     exit 0 fi declare -a exit_codes=("${PIPESTATUS[@]}") if [ 0 != "${exit_codes[0]}" ]; then    exit "${exit_codes[0]}"  # `docker images` failed elif [ 1 != "${exit_codes[1]}" ]; then    exit "${exit_codes[1]}"  # `grep` encountered an error else    # Everything worked fine, but grep didn\'t find the image on this host.    arv-get 256f21bb3abfcd8e08a893886bf3e7c0\\+5082\\/f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8\\.tar | docker.io load fi ']
2016-05-12_19:37:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  stderr An error occurred trying to connect: Post http:///var/run/docker.sock/v1.21/images/load: EOF
2016-05-12_19:37:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  stderr srun: error: compute0: task 0: Exited with exit code 1
2016-05-12_19:37:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525  load docker image: exit 1
2016-05-12_19:37:30 salloc: Relinquishing job allocation 14353

After this, the job was back in 'pending' state, yet no new node was started up.

Workbench showed 0 queued jobs, 0 busy nodes, 0 idle nodes.

Sinfo and azure vm list concurred that no nodes were up.

I also attached the node manager logs for this period to this ticket.

Not until I queued another job did this job start running again, because there was a node available:

2016-05-12_20:05:56 salloc: Granted job allocation 14354
2016-05-12_20:05:57 7368  Sanity check is `docker.io ps -q`
2016-05-12_20:05:57 7368  sanity check: start
2016-05-12_20:05:57 7368  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q']
2016-05-12_20:05:58 7368  sanity check: exit 0
2016-05-12_20:05:58 7368  Sanity check OK
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151207150126, 0.1.20151023190001, 0.1.20150205181653
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  check slurm allocation
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  node compute1 - 1 slots
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  start
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  clean work dirs: start
2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368  stderr starting: ['srun','--nodelist=compute1','-D','/tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368  clean work dirs: exit 0
2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368  Install docker image 256f21bb3abfcd8e08a893886bf3e7c0+5082
2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368  docker image hash is f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8
2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368  load docker image: start
2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368  stderr starting: ['srun','--nodelist=compute1','/bin/bash','-o','pipefail','-ec',' if docker.io images -q --no-trunc --all | grep -xF f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8 >/dev/null; then     exit 0 fi declare -a exit_codes=("${PIPESTATUS[@]}") if [ 0 != "${exit_codes[0]}" ]; then    exit "${exit_codes[0]}"  # `docker images` failed elif [ 1 != "${exit_codes[1]}" ]; then    exit "${exit_codes[1]}"  # `grep` encountered an error else    # Everything worked fine, but grep didn\'t find the image on this host.    arv-get 256f21bb3abfcd8e08a893886bf3e7c0\\+5082\\/f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8\\.tar | docker.io load fi ']
...

Files

current (929 KB) current Node manager log Ward Vandewege, 05/12/2016 08:33 PM
Actions

Also available in: Atom PDF