Project

General

Profile

Actions

Bug #19437

closed

[crunch-run] Require >1 watchdog errors before giving up and killing docker container

Added by Peter Amstutz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

Observed on customer cluster, this seems to have failed multiple times but eventually succeeded (it seems to have run to completion and was only canceled at the very end).

2022-08-31T00:00:01.820945772Z Creating Docker container
2022-08-31T00:00:09.932234553Z Starting container
2022-08-31T00:00:10.896745626Z Waiting for container to finish
2022-08-31T02:25:10.898243240Z Error inspecting container: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.21/containers/230188325e24f42d3ad8dfd8ceef5c7069733bacdaafe7adaf5bf5a3c4c644f5/json": context deadline exceeded
2022-08-31T02:25:10.898483541Z error in Run: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.21/containers/230188325e24f42d3ad8dfd8ceef5c7069733bacdaafe7adaf5bf5a3c4c644f5/json": context deadline exceeded
2022-08-31T02:38:12.612609772Z copying "/temp.txt" (0 bytes)
2022-08-31T02:38:13.468649279Z Cancelled

Subtasks 1 (0 open1 closed)

Task #19443: Review 19437-docker-watchdogResolvedPeter Amstutz09/02/2022Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #20595: "error inspecting container" causing containers to be abandonedResolvedPeter AmstutzActions
Actions #1

Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 2 years ago

  • Subject changed from Error inspecting container to Error inspecting container: context deadline exceeded
Actions #3

Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz over 2 years ago

  • Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Actions #5

Updated by Tom Clegg over 2 years ago

This means ContainerInspect took >1 minute, and (according to dockerclient.ContainerWait) the container hasn't finished, which we take to mean that docker has died / become unresponsive.

Whether or not the docker daemon is in fact dead/unresponsive in this case, it would be more convincing (and no less robust wrt avoiding the waiting-forever problem the watchdog solves) if we just log a warning on a single ContainerInspect failure/timeout, and error out only after two consecutive failures.

Actions #6

Updated by Peter Amstutz over 2 years ago

Tom Clegg wrote in #note-5:

This means ContainerInspect took >1 minute, and (according to dockerclient.ContainerWait) the container hasn't finished, which we take to mean that docker has died / become unresponsive.

Whether or not the docker daemon is in fact dead/unresponsive in this case, it would be more convincing (and no less robust wrt avoiding the waiting-forever problem the watchdog solves) if we just log a warning on a single ContainerInspect failure/timeout, and error out only after two consecutive failures.

It seems likely that the Docker daemon is in the throes of tearing down the container and either there's an edge case it can fall into where the Inspect request gets dropped, or it really just takes 1+ minute to shut down some containers.

I think it would be a good idea to count 2 or 3 consecutive failures before giving up.

Actions #7

Updated by Peter Amstutz over 2 years ago

  • Category set to Crunch
Actions #8

Updated by Tom Clegg over 2 years ago

  • Assigned To set to Tom Clegg
  • Subject changed from Error inspecting container: context deadline exceeded to [crunch-run] Require >1 watchdog errors before giving up and killing docker container
Actions #9

Updated by Tom Clegg over 2 years ago

  • Status changed from New to In Progress
Actions #10

Updated by Tom Clegg over 2 years ago

  • Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Actions #11

Updated by Peter Amstutz over 2 years ago

This LGTM

Actions #12

Updated by Tom Clegg over 2 years ago

cherry-picked to 2.4-staging as 8ad66154df528ad2020e80bc255896537f1c712a

Actions #13

Updated by Tom Clegg over 2 years ago

  • Status changed from In Progress to Resolved
Actions #14

Updated by Peter Amstutz over 2 years ago

  • Release set to 53
Actions #15

Updated by Peter Amstutz over 1 year ago

  • Related to Bug #20595: "error inspecting container" causing containers to be abandoned added
Actions

Also available in: Atom PDF