Bug #8810
closed[Crunch] `docker load` fails to connect to endpoint; srun exits 0
Description
2016-03-22_16:33:38 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr starting: ['srun','--nodelist=compute11','/bin/bash','-o','pipefail','-ec',' if ! docker.io images -q --no-trunc --all | grep -qxF d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83; then arv-get 17b65db74aae73465b5e286d1cdb0e23\\+798\\/d33416e64af4370471ed15d19211e84991a8e158626199f4e4747e4310144b83\\.tar | docker.io load fi '] 2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr Post http:///var/run/docker.sock/v1.20/images/load: EOF. 2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr * Are you trying to connect to a TLS-enabled daemon without TLS? 2016-03-22_16:33:40 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr * Is your docker daemon up and running? 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: error: Node failure on compute11 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2016-03-22_16:41:14 wx7k5-8i9sb-ose8gk9vuxqe9gd 48074 load docker image: exit 0
From here the job continued running and generating errors until the UID 0 check failed. Instead crunch-job should detect this error and exit such that crunch-dispatch retries the job.
Related issues
Updated by Brett Smith over 8 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith over 8 years ago
Note the large time gap between the Docker error and the node failure. That seems suspicious.
Updated by Brett Smith over 8 years ago
Need to check the Docker API docs: is the failing command the one in the if
condition? If so, that would explain why it exits 0. We could add else; exit ${PIPESTATUS[0]}
to return the exit code of docker images
.
Updated by Brett Smith over 8 years ago
- Status changed from Feedback to In Progress
- Assigned To set to Brett Smith
- Target version changed from Arvados Future Sprints to 2016-04-13 sprint
Updated by Tom Clegg over 8 years ago
We have this
if ! docker images | grep then arv-get | docker load fi
If docker images
fails, then we end up calling arv-get. I suspect this is incorrect. We should probably abort if docker images
fails: we're probably just wasting our time running docker load
.
We probably want something more like this:
if docker images | grep; then true elif [[ ${PIPESTATUS[0]} == 0 ]]; then arv-get | docker load else exit ${PIPESTATUS[0]} fi
However, I don't think this is the bug we're seeing here.
The "node failure" and "job step aborted" messages suggest slurm (not our shell script) is behaving badly by exiting 0.
But it's also possible thatdocker load
really did succeed:
docker images
failed, and exited non-zero ...sodocker images | grep
failed ...so we trieddocker load
docker load
succeeded, and exited zero- Node failed during srun teardown
- Slurm printed error messages about the node failure
- Slurm propagated the
docker load
exit status - crunch-job asked slurm to do other things with the node
- of course the other things failed because the node had failed
In this case, the bug amounts to "if a node fails between times A and B, we get "node failure" error messages before the node failure impacts our ability to run programs on that node". In that case, it seems much less worrisome than "exited 0 on failure".
Updated by Brett Smith over 8 years ago
Tom Clegg wrote:
We probably want something more like this:
[snip]
The branch I have up for review implements that.
In this case, the bug amounts to "if a node fails between times A and B, we get "node failure" error messages before the node failure impacts our ability to run programs on that node". In that case, it seems much less worrisome than "exited 0 on failure".
I believe we changed our behavior here in 86f7740. Now crunch-job will "see" the node failure message, and treat the setup as failed as a result. This seems like an improvement to me. If a node failure happened during a setup task, things are unlikely to go well when we start dispatching actual tasks to it. Better to fail as soon as we see trouble, to save time and avoid making logs more confusing with additional error messages.
Updated by Tom Clegg over 8 years ago
8810-crunch-improve-docker-loading-wip @ df1447d LGTM. Thanks.
Indeed, that seems to have already fixed the "confusing sequence of error messages" aspect. If I'm following the timing properly, #8810 and #8811 are different problems arising from the same underlying bug, "failed to notice early signs that the node is broken, and failed the job as a result", and both are fixed by the 8811 branch. And 8810 happened to help us see & fix another bug.
Updated by Brett Smith over 8 years ago
Tom Clegg wrote:
Indeed, that seems to have already fixed the "confusing sequence of error messages" aspect. If I'm following the timing properly, #8810 and #8811 are different problems arising from the same underlying bug, "failed to notice early signs that the node is broken, and failed the job as a result", and both are fixed by the 8811 branch. And 8810 happened to help us see & fix another bug.
Yeah, Peter made a similar comment at last week's backlog grooming, so I think we're all agreed on that now. Thanks.
Updated by Brett Smith over 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:2d18e31468b269e526cb6e4918912b60a48540b2.