Bug #11138

Protection against running new style docker 1.10+ image on old docker host

Added by Joshua Randall about 1 month ago. Updated 17 days ago.

Status:ResolvedStart date:02/20/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:Crunch
Target version:2017-03-15 sprint
Story points1.0Remaining (hours)0.00 hour
Velocity based estimate-

Description

I upgraded docker to 1.13.1 over the weekend (or so I thought). Since then, I've been struggling to get the system working reliably.

It turns out the reason is that the docker upgrade did not take on 3 of the 31 nodes, such that those 3 nodes are still running 1.9.1.

There does not appear to be any check in crunch to ensure that the image being loaded from keep is able to be run in the docker version. I would suggest implementing a `docker version` check in the pre-flight sanity checks that happen just prior to job execution.

The underlying problem seems to be with docker - the 1.9.1 version of docker does not return a failure exit status when attempting to load a 1.10+ image, but the load does fail:

root@humgen-05-04:/tmp# arv-get e829be6274c110c4c16bf6381efae022+594/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac.tar | /usr/bin/docker load
753 MiB / 753 MiB 100.0%
root@humgen-05-04:/tmp# echo $?
0
root@humgen-05-04:/tmp# /usr/bin/docker images -q --no-trunc --all |grep a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac
root@humgen-05-04:/tmp#

The logic in crunch-job does not seem to handle this situation well - it assumes (reasonably, but incorrectly) that if `docker load` returns 0 then it is safe to try to run the container. Docker ends up attempting to pull the image from the `library/sha256` repo on docker hub, which obviously fails.

Full logs from a failing job:

2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001
2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check slurm allocation
2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  node humgen-05-04 - 10 slots
2017-02-20_11:53:06 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  start
2017-02-20_11:53:07 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  clean work dirs: start
2017-02-20_11:53:07 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodelist=humgen-05-04','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2017-02-20_11:53:08 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  clean work dirs: exit 0
2017-02-20_11:53:08 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  Install docker image e829be6274c110c4c16bf6381efae022+594
2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  docker image hash is sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac
2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  load docker image: start
2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodelist=humgen-05-04','/bin/bash','-o','pipefail','-ec',' if /usr/bin/docker images -q --no-trunc --all | grep -xF sha256\\:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac >/dev/null; then     exit 0 fi declare -a exit_codes=("${PIPESTATUS[@]}") if [ 0 != "${exit_codes[0]}" ]; then    exit "${exit_codes[0]}"  # `docker images` failed elif [ 1 != "${exit_codes[1]}" ]; then    exit "${exit_codes[1]}"  # `grep` encountered an error else    # Everything worked fine, but grep didn\'t find the image on this host.    arv-get e829be6274c110c4c16bf6381efae022\\+594\\/sha256\\:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac\\.tar | /usr/bin/docker load fi ']
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  load docker image: exit 0
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check --memory-swap feature: start
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodes=1','/usr/bin/docker','run','--help']
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check --memory-swap feature: exit 0
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether default user is UID 0: start
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run  --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21  sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user']
2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Pulling repository docker.io/library/sha256
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr srun: error: humgen-05-04: task 0: Exited with exit code 1
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether default user is UID 0: exit 1
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether user 'crunch' is UID 0: start
2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run  --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21 --user=crunch sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user']
2017-02-20_11:53:20 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Pulling repository docker.io/library/sha256
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr srun: error: humgen-05-04: task 0: Exited with exit code 1
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether user 'crunch' is UID 0: exit 1
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether user 'nobody' is UID 0: start
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run  --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21 --user=nobody sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user']
2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Pulling repository docker.io/library/sha256
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  stderr srun: error: humgen-05-04: task 0: Exited with exit code 1
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  check whether user 'nobody' is UID 0: exit 1
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  Could not find a user in container that is not UID 0 (tried default user,  crunch nobody) or there was a problem running 'id' in the container. at /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job line 484
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  release job allocation
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  Freeze not implemented
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  collate
2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520  collated output manifest text to send to API server is 0 bytes with access tokens


Subtasks

Task #11207: Make docker_install_script more robustResolvedTom Clegg

Task #11197: Review 11138-docker-load-failResolvedLucas Di Pentima

Associated revisions

Revision a4a3227c
Added by Tom Clegg 17 days ago

Merge branch '11138-docker-load-fail'

closes #11138

History

#1 Updated by Tom Morris about 1 month ago

  • Target version set to 2017-03-15 sprint

#2 Updated by Tom Clegg 22 days ago

  • Assignee set to Tom Clegg

#3 Updated by Tom Clegg 22 days ago

  • Story points set to 1.0

#4 Updated by Tom Clegg 21 days ago

  • Status changed from New to In Progress

#5 Updated by Tom Clegg 21 days ago

11138-docker-load-fail uses docker inspect (before and after docker load) to assure that docker run aaa123 will actually run an image whose ID is aaa123. This way we error out if docker load fails but exits 0.

Testing dd08615e98e656ae07c6ca26eaa635f6ce8f01b7 on 4xphq:

2017-03-03_05:58:23 4xphq-8i9sb-hz036kwpijqh8d7 22432  load docker image: start
2017-03-03_05:58:23 4xphq-8i9sb-hz036kwpijqh8d7 22432  stderr starting: ['srun','--nodelist=compute1','/bin/bash','-o','pipefail','-ec',' loaded() {   [[ $(/usr/bin/docker.io inspect --format="{{.ID}}" b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee) = b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee ]] } if loaded 2>/dev/null; then   echo >&2 "image is already present"   exit 0 fi echo >&2 "docker image is not present; loading" arv-get 76aca9f8f73603945fc5b7b89a84f89d\\+419\\/b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee\\.tar | /usr/bin/docker.io load if ! loaded; then   echo >&2 "`docker load` exited 0, but image is not found (!)"   exit 1 fi echo >&2 "image loaded successfully" ']
2017-03-03_05:58:23 4xphq-8i9sb-hz036kwpijqh8d7 22432  stderr docker image is not present; loading
2017-03-03_05:59:02 4xphq-8i9sb-hz036kwpijqh8d7 22432  stderr image loaded successfully
2017-03-03_05:59:03 4xphq-8i9sb-hz036kwpijqh8d7 22432  load docker image: exit 0
2017-03-03_05:59:23 4xphq-8i9sb-88v4figwr7qsbnh 24304  load docker image: start
2017-03-03_05:59:23 4xphq-8i9sb-88v4figwr7qsbnh 24304  stderr starting: ['srun','--nodelist=compute1','/bin/bash','-o','pipefail','-ec',' loaded() {   [[ $(/usr/bin/docker.io inspect --format="{{.ID}}" b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee) = b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee ]] } if loaded 2>/dev/null; then   echo >&2 "image is already present"   exit 0 fi echo >&2 "docker image is not present; loading" arv-get 76aca9f8f73603945fc5b7b89a84f89d\\+419\\/b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee\\.tar | /usr/bin/docker.io load if ! loaded; then   echo >&2 "`docker load` exited 0, but image is not found (!)"   exit 1 fi echo >&2 "image loaded successfully" ']
2017-03-03_05:59:23 4xphq-8i9sb-88v4figwr7qsbnh 24304  stderr image is already present
2017-03-03_05:59:23 4xphq-8i9sb-88v4figwr7qsbnh 24304  load docker image: exit 0

#6 Updated by Lucas Di Pentima 20 days ago

On line 428: Could it be that "docker load" fails for some reason, and the printed message still say that it exited with 0? If I'm following correctly, we should check for its error level?
On line 430: Maybe it would be useful to tell what we're expecting versus getting, for future debugging purposes.

#7 Updated by Tom Clegg 20 days ago

Lucas Di Pentima wrote:

On line 428: Could it be that "docker load" fails for some reason, and the printed message still say that it exited with 0? If I'm following correctly, we should check for its error level?

We're using "bash -o pipefail -e" so bash should abort for us immediately if arv-get or docker load exits non-zero.

On line 430: Maybe it would be useful to tell what we're expecting versus getting, for future debugging purposes.

SGTM. At f7d32ebcf2e0641cf92d079387dc325d65e35879

2017-03-03_21:41:27 4xphq-8i9sb-951ervh8xt0o8e0 24826  load docker image: start
2017-03-03_21:41:27 4xphq-8i9sb-951ervh8xt0o8e0 24826  stderr starting: ['srun','--nodelist=compute1','/bin/bash','-o','pipefail','-ec',' loaded() {   id=$(/usr/bin/docker.io inspect --format="{{.ID}}" b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee) || return 1   echo "image ID is $id"   [[ ${id} = b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee ]] } if loaded >&2 2>/dev/null; then   echo >&2 "image is already present"   exit 0 fi echo >&2 "docker image is not present; loading" arv-get 76aca9f8f73603945fc5b7b89a84f89d\\+419\\/b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee\\.tar | /usr/bin/docker.io load if ! loaded >&2; then   echo >&2 "`docker load` exited 0, but image is not found (!)"   exit 1 fi echo >&2 "image loaded successfully" ']
2017-03-03_21:41:27 4xphq-8i9sb-951ervh8xt0o8e0 24826  stderr image ID is b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee
2017-03-03_21:41:27 4xphq-8i9sb-951ervh8xt0o8e0 24826  stderr image is already present
2017-03-03_21:41:27 4xphq-8i9sb-951ervh8xt0o8e0 24826  load docker image: exit 0

#8 Updated by Lucas Di Pentima 17 days ago

Thanks! LGTM.

#9 Updated by Tom Clegg 17 days ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:a4a3227c634d8cfd54a23205be727e910032cbe9.

Also available in: Atom PDF