Project

General

Profile

Actions

Bug #20595

closed

"error inspecting container" causing containers to be abandoned

Added by Peter Amstutz 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Docker
Target version:
Story points:
-
Release relationship:
Auto

Description

In #19437 we made it wait for 3 consecutive failures before abandoning a container.

It appears that 3 times (or 3 minutes, at any rate) is not enough.

build-and-publish-rc-packages: #191


Related issues

Related to Arvados - Bug #19437: [crunch-run] Require >1 watchdog errors before giving up and killing docker containerResolvedTom Clegg09/02/2022Actions
Actions #1

Updated by Peter Amstutz 10 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz 10 months ago

  • Related to Bug #19437: [crunch-run] Require >1 watchdog errors before giving up and killing docker container added
Actions #3

Updated by Peter Amstutz 10 months ago

  • Description updated (diff)
Actions #4

Updated by Tom Clegg 10 months ago

Is it possible to post some logs here? In particular I'd like to know what the docker inspect error was ("context deadline exceeded" seems to be the usual error, which is a 1m timeout) and whether the errors are related to any particular part of the container life cycle, e.g., after the container process exits but before docker wait returns.

In principle, we could compare the number of 1-, 2-, and 3-consecutive-failure events. If 2x was much more rare than 1x and 3x, that would suggest that the 3x cases were indeed futile and 3 is a reasonable threshold. But I don't think we have an easy way to report #occurrences on a given cluster.

In any case, true positives (docker really gets into a state where docker wait never returns) seems to be quite rare, so a high threshold (10 x 1m intervals?) would probably be a reasonable choice.

Actions #5

Updated by Peter Amstutz 10 months ago

  • Assigned To set to Peter Amstutz
Actions #6

Updated by Peter Amstutz 10 months ago

I'm having trouble finding the recent example of where this happened.

The gist was that I was seeing the same behavior described in #19437, where the task would run to apparent completion but then time out with "context deadline exceeded" while waiting for the exit code. The only difference is that we changed it to require 3 consecutive timeouts and that doesn't seem to be long enough. I don't know what Docker is doing during this time but it seems to be tearing down the container.

I changed it to a 2 minute watchdog and 5 failures in 97aca635adc2fc448e47984fe1c8974b44f7a656 and (on the release candidate) that resolved the user's issue.

I think we definitely need the longer wait period, the only question is whether we should put a little extra effort in to making these values configurable or just merge this to main and move on.

Actions #7

Updated by Peter Amstutz 10 months ago

  • Release set to 64
Actions #8

Updated by Tom Clegg 10 months ago

I'm suspicious that docker never really recovers from this state, and we're essentially setting a 10 minute time limit on some stuff docker does after the container process exits but before our wait() call returns. Cleaning up the temp filesystem, or something.

Maybe there's another thing we can do ("docker ps"??) that can reassure us docker is still alive even if "inspect" hangs during that interval?

To me that seems more interesting than making the timeout configurable.

Either way, merge to main, yes.

Actions #9

Updated by Peter Amstutz 10 months ago

Tom Clegg wrote in #note-8:

I'm suspicious that docker never really recovers from this state, and we're essentially setting a 10 minute time limit on some stuff docker does after the container process exits but before our wait() call returns. Cleaning up the temp filesystem, or something.

I don't know what you mean by "never really recovers from this state"? It does eventually return an exit code from wait() and everything proceeds as normal, there's just this several-minutes long period where trying to get container status blocks.

Maybe there's another thing we can do ("docker ps"??) that can reassure us docker is still alive even if "inspect" hangs during that interval?

I don't know, what we want to know is whether Docker is responding to API calls it could be deadlocked and still show up in the process listing. Perhaps we could ping docker with other API calls, just not ones directly involving our container.

To me that seems more interesting than making the timeout configurable.

Either way, merge to main, yes.

Will do.

Actions #10

Updated by Peter Amstutz 10 months ago

  • Status changed from In Progress to Resolved
Actions #11

Updated by Tom Clegg 10 months ago

Peter Amstutz wrote in #note-9:

I don't know what you mean by "never really recovers from this state"? It does eventually return an exit code from wait() and everything proceeds as normal, there's just this several-minutes long period where trying to get container status blocks.

That's all I mean by "never really recovers": possibly, after a certain point, "docker inspect" just hangs until the container exits, making it utterly useless as a watchdog probe from that point on.

Actions

Also available in: Atom PDF