Bug #11138
closedProtection against running new style docker 1.10+ image on old docker host
Description
I upgraded docker to 1.13.1 over the weekend (or so I thought). Since then, I've been struggling to get the system working reliably.
It turns out the reason is that the docker upgrade did not take on 3 of the 31 nodes, such that those 3 nodes are still running 1.9.1.
There does not appear to be any check in crunch to ensure that the image being loaded from keep is able to be run in the docker version. I would suggest implementing a `docker version` check in the pre-flight sanity checks that happen just prior to job execution.
The underlying problem seems to be with docker - the 1.9.1 version of docker does not return a failure exit status when attempting to load a 1.10+ image, but the load does fail:
root@humgen-05-04:/tmp# arv-get e829be6274c110c4c16bf6381efae022+594/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac.tar | /usr/bin/docker load 753 MiB / 753 MiB 100.0% root@humgen-05-04:/tmp# echo $? 0 root@humgen-05-04:/tmp# /usr/bin/docker images -q --no-trunc --all |grep a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac root@humgen-05-04:/tmp#
The logic in crunch-job does not seem to handle this situation well - it assumes (reasonably, but incorrectly) that if `docker load` returns 0 then it is safe to try to run the container. Docker ends up attempting to pull the image from the `library/sha256` repo on docker hub, which obviously fails.
Full logs from a failing job:
2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001 2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check slurm allocation 2017-02-20_11:53:05 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 node humgen-05-04 - 10 slots 2017-02-20_11:53:06 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 start 2017-02-20_11:53:07 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 clean work dirs: start 2017-02-20_11:53:07 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodelist=humgen-05-04','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 2017-02-20_11:53:08 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 clean work dirs: exit 0 2017-02-20_11:53:08 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 Install docker image e829be6274c110c4c16bf6381efae022+594 2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 docker image hash is sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac 2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 load docker image: start 2017-02-20_11:53:09 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodelist=humgen-05-04','/bin/bash','-o','pipefail','-ec',' if /usr/bin/docker images -q --no-trunc --all | grep -xF sha256\\:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac >/dev/null; then exit 0 fi declare -a exit_codes=("${PIPESTATUS[@]}") if [ 0 != "${exit_codes[0]}" ]; then exit "${exit_codes[0]}" # `docker images` failed elif [ 1 != "${exit_codes[1]}" ]; then exit "${exit_codes[1]}" # `grep` encountered an error else # Everything worked fine, but grep didn\'t find the image on this host. arv-get e829be6274c110c4c16bf6381efae022\\+594\\/sha256\\:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac\\.tar | /usr/bin/docker load fi '] 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 load docker image: exit 0 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check --memory-swap feature: start 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodes=1','/usr/bin/docker','run','--help'] 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check --memory-swap feature: exit 0 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether default user is UID 0: start 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21 sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user'] 2017-02-20_11:53:18 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Pulling repository docker.io/library/sha256 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr srun: error: humgen-05-04: task 0: Exited with exit code 1 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether default user is UID 0: exit 1 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether user 'crunch' is UID 0: start 2017-02-20_11:53:19 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21 --user=crunch sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user'] 2017-02-20_11:53:20 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Pulling repository docker.io/library/sha256 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr srun: error: humgen-05-04: task 0: Exited with exit code 1 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether user 'crunch' is UID 0: exit 1 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether user 'nobody' is UID 0: start 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr starting: ['srun','--nodes=1','/bin/sh','-ec','/usr/bin/docker run --add-host=api.arvados.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-01:172.17.180.10 --add-host=humgen-01-01.internal.sanger.ac.uk:172.17.180.10 --add-host=humgen-01-02:172.17.180.11 --add-host=humgen-01-02.internal.sanger.ac.uk:172.17.180.11 --add-host=humgen-01-03:172.17.180.12 --add-host=humgen-01-03.internal.sanger.ac.uk:172.17.180.12 --add-host=humgen-02-01:172.17.180.13 --add-host=humgen-02-01.internal.sanger.ac.uk:172.17.180.13 --add-host=humgen-02-02:172.17.180.14 --add-host=humgen-02-02.internal.sanger.ac.uk:172.17.180.14 --add-host=humgen-02-03:172.17.180.15 --add-host=humgen-02-03.internal.sanger.ac.uk:172.17.180.15 --add-host=humgen-03-01:172.17.180.16 --add-host=humgen-03-01.internal.sanger.ac.uk:172.17.180.16 --add-host=humgen-03-02:172.17.180.17 --add-host=humgen-03-02.internal.sanger.ac.uk:172.17.180.17 --add-host=humgen-03-03:172.17.180.18 --add-host=humgen-03-03.internal.sanger.ac.uk:172.17.180.18 --add-host=humgen-04-01:172.17.180.19 --add-host=humgen-04-01.internal.sanger.ac.uk:172.17.180.19 --add-host=humgen-04-02:172.17.180.20 --add-host=humgen-04-02.internal.sanger.ac.uk:172.17.180.20 --add-host=humgen-04-03:172.17.180.21 --add-host=humgen-04-03.internal.sanger.ac.uk:172.17.180.21 --user=nobody sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac id --user'] 2017-02-20_11:53:21 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Unable to find image 'sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac' locally 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Pulling repository docker.io/library/sha256 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr Error: image library/sha256:a4fa354645c849421c8bfc8da71c5b8ade1df1fe25792d196c59f88c11f5ceac not found 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 stderr srun: error: humgen-05-04: task 0: Exited with exit code 1 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 check whether user 'nobody' is UID 0: exit 1 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 Could not find a user in container that is not UID 0 (tried default user, crunch nobody) or there was a problem running 'id' in the container. at /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job line 484 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 release job allocation 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 Freeze not implemented 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 collate 2017-02-20_11:53:22 z8ta6-8i9sb-4lmsmu1z9a9ur9v 39520 collated output manifest text to send to API server is 0 bytes with access tokens
Updated by Tom Clegg over 7 years ago
11138-docker-load-fail uses docker inspect
(before and after docker load
) to assure that docker run aaa123
will actually run an image whose ID is aaa123
. This way we error out if docker load
fails but exits 0.
Testing dd08615e98e656ae07c6ca26eaa635f6ce8f01b7 on 4xphq:
2017-03-03_05:58:23 4xphq-8i9sb-hz036kwpijqh8d7 22432 load docker image: start 2017-03-03_05:58:23 4xphq-8i9sb-hz036kwpijqh8d7 22432 stderr starting: ['srun','--nodelist=compute1','/bin/bash','-o','pipefail','-ec',' loaded() { [[ $(/usr/bin/docker.io inspect --format="{{.ID}}" b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee) = b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee ]] } if loaded 2>/dev/null; then echo >&2 "image is already present" exit 0 fi echo >&2 "docker image is not present; loading" arv-get 76aca9f8f73603945fc5b7b89a84f89d\\+419\\/b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee\\.tar | /usr/bin/docker.io load if ! loaded; then echo >&2 "`docker load` exited 0, but image is not found (!)" exit 1 fi echo >&2 "image loaded successfully" '] 2017-03-03_05:58:23 4xphq-8i9sb-hz036kwpijqh8d7 22432 stderr docker image is not present; loading 2017-03-03_05:59:02 4xphq-8i9sb-hz036kwpijqh8d7 22432 stderr image loaded successfully 2017-03-03_05:59:03 4xphq-8i9sb-hz036kwpijqh8d7 22432 load docker image: exit 0
2017-03-03_05:59:23 4xphq-8i9sb-88v4figwr7qsbnh 24304 load docker image: start 2017-03-03_05:59:23 4xphq-8i9sb-88v4figwr7qsbnh 24304 stderr starting: ['srun','--nodelist=compute1','/bin/bash','-o','pipefail','-ec',' loaded() { [[ $(/usr/bin/docker.io inspect --format="{{.ID}}" b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee) = b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee ]] } if loaded 2>/dev/null; then echo >&2 "image is already present" exit 0 fi echo >&2 "docker image is not present; loading" arv-get 76aca9f8f73603945fc5b7b89a84f89d\\+419\\/b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee\\.tar | /usr/bin/docker.io load if ! loaded; then echo >&2 "`docker load` exited 0, but image is not found (!)" exit 1 fi echo >&2 "image loaded successfully" '] 2017-03-03_05:59:23 4xphq-8i9sb-88v4figwr7qsbnh 24304 stderr image is already present 2017-03-03_05:59:23 4xphq-8i9sb-88v4figwr7qsbnh 24304 load docker image: exit 0
Updated by Lucas Di Pentima over 7 years ago
On line 428: Could it be that "docker load" fails for some reason, and the printed message still say that it exited with 0? If I'm following correctly, we should check for its error level?
On line 430: Maybe it would be useful to tell what we're expecting versus getting, for future debugging purposes.
Updated by Tom Clegg over 7 years ago
Lucas Di Pentima wrote:
On line 428: Could it be that "docker load" fails for some reason, and the printed message still say that it exited with 0? If I'm following correctly, we should check for its error level?
We're using "bash -o pipefail -e" so bash should abort for us immediately if arv-get or docker load exits non-zero.
On line 430: Maybe it would be useful to tell what we're expecting versus getting, for future debugging purposes.
SGTM. At f7d32ebcf2e0641cf92d079387dc325d65e35879
2017-03-03_21:41:27 4xphq-8i9sb-951ervh8xt0o8e0 24826 load docker image: start 2017-03-03_21:41:27 4xphq-8i9sb-951ervh8xt0o8e0 24826 stderr starting: ['srun','--nodelist=compute1','/bin/bash','-o','pipefail','-ec',' loaded() { id=$(/usr/bin/docker.io inspect --format="{{.ID}}" b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee) || return 1 echo "image ID is $id" [[ ${id} = b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee ]] } if loaded >&2 2>/dev/null; then echo >&2 "image is already present" exit 0 fi echo >&2 "docker image is not present; loading" arv-get 76aca9f8f73603945fc5b7b89a84f89d\\+419\\/b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee\\.tar | /usr/bin/docker.io load if ! loaded >&2; then echo >&2 "`docker load` exited 0, but image is not found (!)" exit 1 fi echo >&2 "image loaded successfully" '] 2017-03-03_21:41:27 4xphq-8i9sb-951ervh8xt0o8e0 24826 stderr image ID is b2a72ceb95b9184194ab7a2bb7f0903a4d77a41867e20bcf8e656529078ae2ee 2017-03-03_21:41:27 4xphq-8i9sb-951ervh8xt0o8e0 24826 stderr image is already present 2017-03-03_21:41:27 4xphq-8i9sb-951ervh8xt0o8e0 24826 load docker image: exit 0
Updated by Tom Clegg over 7 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:a4a3227c634d8cfd54a23205be727e910032cbe9.