Bug #9214
closedno node being spun up for pending job (after docker load failed)
Description
In su92l-d1hrv-xswkd33we27fopw, the docker load command failed:
2016-05-12_19:19:29 salloc: Granted job allocation 14353 2016-05-12_19:19:29 48525 Sanity check is `docker.io ps -q` 2016-05-12_19:19:29 48525 sanity check: start 2016-05-12_19:19:29 48525 stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q'] 2016-05-12_19:19:29 48525 sanity check: exit 0 2016-05-12_19:19:29 48525 Sanity check OK 2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525 running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151207150126, 0.1.20151023190001, 0.1.20150205181653 2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525 check slurm allocation 2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525 node compute0 - 1 slots 2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525 start 2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525 clean work dirs: start 2016-05-12_19:19:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525 stderr starting: ['srun','--nodelist=compute0','-D','/tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 2016-05-12_19:19:31 su92l-8i9sb-3wmb3ogss5hvqwb 48525 clean work dirs: exit 0 2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525 Install docker image 256f21bb3abfcd8e08a893886bf3e7c0+5082 2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525 docker image hash is f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8 2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525 load docker image: start 2016-05-12_19:19:32 su92l-8i9sb-3wmb3ogss5hvqwb 48525 stderr starting: ['srun','--nodelist=compute0','/bin/bash','-o','pipefail','-ec',' if docker.io images -q --no-trunc --all | grep -xF f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8 >/dev/null; then exit 0 fi declare -a exit_codes=("${PIPESTATUS[@]}") if [ 0 != "${exit_codes[0]}" ]; then exit "${exit_codes[0]}" # `docker images` failed elif [ 1 != "${exit_codes[1]}" ]; then exit "${exit_codes[1]}" # `grep` encountered an error else # Everything worked fine, but grep didn\'t find the image on this host. arv-get 256f21bb3abfcd8e08a893886bf3e7c0\\+5082\\/f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8\\.tar | docker.io load fi '] 2016-05-12_19:37:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525 stderr An error occurred trying to connect: Post http:///var/run/docker.sock/v1.21/images/load: EOF 2016-05-12_19:37:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525 stderr srun: error: compute0: task 0: Exited with exit code 1 2016-05-12_19:37:30 su92l-8i9sb-3wmb3ogss5hvqwb 48525 load docker image: exit 1 2016-05-12_19:37:30 salloc: Relinquishing job allocation 14353
After this, the job was back in 'pending' state, yet no new node was started up.
Workbench showed 0 queued jobs, 0 busy nodes, 0 idle nodes.
Sinfo and azure vm list concurred that no nodes were up.
I also attached the node manager logs for this period to this ticket.
Not until I queued another job did this job start running again, because there was a node available:
2016-05-12_20:05:56 salloc: Granted job allocation 14354 2016-05-12_20:05:57 7368 Sanity check is `docker.io ps -q` 2016-05-12_20:05:57 7368 sanity check: start 2016-05-12_20:05:57 7368 stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','docker.io','ps','-q'] 2016-05-12_20:05:58 7368 sanity check: exit 0 2016-05-12_20:05:58 7368 Sanity check OK 2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368 running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20151207150126, 0.1.20151023190001, 0.1.20150205181653 2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368 check slurm allocation 2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368 node compute1 - 1 slots 2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368 start 2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368 clean work dirs: start 2016-05-12_20:05:58 su92l-8i9sb-3wmb3ogss5hvqwb 7368 stderr starting: ['srun','--nodelist=compute1','-D','/tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368 clean work dirs: exit 0 2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368 Install docker image 256f21bb3abfcd8e08a893886bf3e7c0+5082 2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368 docker image hash is f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8 2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368 load docker image: start 2016-05-12_20:05:59 su92l-8i9sb-3wmb3ogss5hvqwb 7368 stderr starting: ['srun','--nodelist=compute1','/bin/bash','-o','pipefail','-ec',' if docker.io images -q --no-trunc --all | grep -xF f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8 >/dev/null; then exit 0 fi declare -a exit_codes=("${PIPESTATUS[@]}") if [ 0 != "${exit_codes[0]}" ]; then exit "${exit_codes[0]}" # `docker images` failed elif [ 1 != "${exit_codes[1]}" ]; then exit "${exit_codes[1]}" # `grep` encountered an error else # Everything worked fine, but grep didn\'t find the image on this host. arv-get 256f21bb3abfcd8e08a893886bf3e7c0\\+5082\\/f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8\\.tar | docker.io load fi '] ...
Files
Updated by Ward Vandewege almost 8 years ago
Updated by Brett Smith almost 8 years ago
Ward Vandewege wrote:
After this, the job was back in 'pending' state, yet no new node was started up.
This is the crux of the problem, and a limitation of Crunch v1. After crunch-job locks a job to run it, it is no longer in the Queued state (at the API level), and can't be returned there. Because of that, Node Manager has no way to see it, and can't boot a node to accommodate it.
Updated by Brett Smith almost 8 years ago
I guess Node Manager could extend its internal wishlist by listing jobs that are in the Running state but not assigned to any node(s) yet.