Project

General

Profile

Actions

Bug #19437

closed

[crunch-run] Require >1 watchdog errors before giving up and killing docker container

Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

Observed on customer cluster, this seems to have failed multiple times but eventually succeeded (it seems to have run to completion and was only canceled at the very end).

2022-08-31T00:00:01.820945772Z Creating Docker container
2022-08-31T00:00:09.932234553Z Starting container
2022-08-31T00:00:10.896745626Z Waiting for container to finish
2022-08-31T02:25:10.898243240Z Error inspecting container: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.21/containers/230188325e24f42d3ad8dfd8ceef5c7069733bacdaafe7adaf5bf5a3c4c644f5/json": context deadline exceeded
2022-08-31T02:25:10.898483541Z error in Run: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.21/containers/230188325e24f42d3ad8dfd8ceef5c7069733bacdaafe7adaf5bf5a3c4c644f5/json": context deadline exceeded
2022-08-31T02:38:12.612609772Z copying "/temp.txt" (0 bytes)
2022-08-31T02:38:13.468649279Z Cancelled

Subtasks 1 (0 open1 closed)

Task #19443: Review 19437-docker-watchdogResolvedPeter Amstutz09/02/2022Actions

Related issues

Related to Arvados - Bug #20595: "error inspecting container" causing containers to be abandonedResolvedPeter AmstutzActions
Actions #1

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 1 year ago

  • Subject changed from Error inspecting container to Error inspecting container: context deadline exceeded
Actions #3

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Actions #5

Updated by Tom Clegg over 1 year ago

This means ContainerInspect took >1 minute, and (according to dockerclient.ContainerWait) the container hasn't finished, which we take to mean that docker has died / become unresponsive.

Whether or not the docker daemon is in fact dead/unresponsive in this case, it would be more convincing (and no less robust wrt avoiding the waiting-forever problem the watchdog solves) if we just log a warning on a single ContainerInspect failure/timeout, and error out only after two consecutive failures.

Actions #6

Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote in #note-5:

This means ContainerInspect took >1 minute, and (according to dockerclient.ContainerWait) the container hasn't finished, which we take to mean that docker has died / become unresponsive.

Whether or not the docker daemon is in fact dead/unresponsive in this case, it would be more convincing (and no less robust wrt avoiding the waiting-forever problem the watchdog solves) if we just log a warning on a single ContainerInspect failure/timeout, and error out only after two consecutive failures.

It seems likely that the Docker daemon is in the throes of tearing down the container and either there's an edge case it can fall into where the Inspect request gets dropped, or it really just takes 1+ minute to shut down some containers.

I think it would be a good idea to count 2 or 3 consecutive failures before giving up.

Actions #7

Updated by Peter Amstutz over 1 year ago

  • Category set to Crunch
Actions #8

Updated by Tom Clegg over 1 year ago

  • Assigned To set to Tom Clegg
  • Subject changed from Error inspecting container: context deadline exceeded to [crunch-run] Require >1 watchdog errors before giving up and killing docker container
Actions #9

Updated by Tom Clegg over 1 year ago

  • Status changed from New to In Progress
Actions #10

Updated by Tom Clegg over 1 year ago

  • Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Actions #11

Updated by Peter Amstutz over 1 year ago

This LGTM

Actions #12

Updated by Tom Clegg over 1 year ago

cherry-picked to 2.4-staging as 8ad66154df528ad2020e80bc255896537f1c712a

Actions #13

Updated by Tom Clegg over 1 year ago

  • Status changed from In Progress to Resolved
Actions #14

Updated by Peter Amstutz over 1 year ago

  • Release set to 53
Actions #15

Updated by Peter Amstutz 11 months ago

  • Related to Bug #20595: "error inspecting container" causing containers to be abandoned added
Actions

Also available in: Atom PDF