Bug #20595
closed"error inspecting container" causing containers to be abandoned
Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.
Description
In #19437 we made it wait for 3 consecutive failures before abandoning a container.
It appears that 3 times (or 3 minutes, at any rate) is not enough.
Related issues
Updated by Peter Amstutz over 1 year ago
- Related to Bug #19437: [crunch-run] Require >1 watchdog errors before giving up and killing docker container added
Updated by Tom Clegg over 1 year ago
Is it possible to post some logs here? In particular I'd like to know what the docker inspect
error was ("context deadline exceeded" seems to be the usual error, which is a 1m timeout) and whether the errors are related to any particular part of the container life cycle, e.g., after the container process exits but before docker wait
returns.
In principle, we could compare the number of 1-, 2-, and 3-consecutive-failure events. If 2x was much more rare than 1x and 3x, that would suggest that the 3x cases were indeed futile and 3 is a reasonable threshold. But I don't think we have an easy way to report #occurrences on a given cluster.
In any case, true positives (docker really gets into a state where docker wait
never returns) seems to be quite rare, so a high threshold (10 x 1m intervals?) would probably be a reasonable choice.
Updated by Peter Amstutz over 1 year ago
I'm having trouble finding the recent example of where this happened.
The gist was that I was seeing the same behavior described in #19437, where the task would run to apparent completion but then time out with "context deadline exceeded" while waiting for the exit code. The only difference is that we changed it to require 3 consecutive timeouts and that doesn't seem to be long enough. I don't know what Docker is doing during this time but it seems to be tearing down the container.
I changed it to a 2 minute watchdog and 5 failures in 97aca635adc2fc448e47984fe1c8974b44f7a656 and (on the release candidate) that resolved the user's issue.
I think we definitely need the longer wait period, the only question is whether we should put a little extra effort in to making these values configurable or just merge this to main and move on.
Updated by Tom Clegg over 1 year ago
I'm suspicious that docker never really recovers from this state, and we're essentially setting a 10 minute time limit on some stuff docker does after the container process exits but before our wait() call returns. Cleaning up the temp filesystem, or something.
Maybe there's another thing we can do ("docker ps"??) that can reassure us docker is still alive even if "inspect" hangs during that interval?
To me that seems more interesting than making the timeout configurable.
Either way, merge to main, yes.
Updated by Peter Amstutz over 1 year ago
Tom Clegg wrote in #note-8:
I'm suspicious that docker never really recovers from this state, and we're essentially setting a 10 minute time limit on some stuff docker does after the container process exits but before our wait() call returns. Cleaning up the temp filesystem, or something.
I don't know what you mean by "never really recovers from this state"? It does eventually return an exit code from wait() and everything proceeds as normal, there's just this several-minutes long period where trying to get container status blocks.
Maybe there's another thing we can do ("docker ps"??) that can reassure us docker is still alive even if "inspect" hangs during that interval?
I don't know, what we want to know is whether Docker is responding to API calls it could be deadlocked and still show up in the process listing. Perhaps we could ping docker with other API calls, just not ones directly involving our container.
To me that seems more interesting than making the timeout configurable.
Either way, merge to main, yes.
Will do.
Updated by Peter Amstutz over 1 year ago
- Status changed from In Progress to Resolved
Updated by Tom Clegg over 1 year ago
Peter Amstutz wrote in #note-9:
I don't know what you mean by "never really recovers from this state"? It does eventually return an exit code from wait() and everything proceeds as normal, there's just this several-minutes long period where trying to get container status blocks.
That's all I mean by "never really recovers": possibly, after a certain point, "docker inspect" just hangs until the container exits, making it utterly useless as a watchdog probe from that point on.