Project

General

Profile

Actions

Bug #14705

open

Weird container rerun on fail?

Added by Bryan Cosca almost 6 years ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

top level cr: https://workbench.e51c5.arvadosapi.com/container_requests/e51c5-xvhdp-qw8upy1814gij1q#Status

child cr: https://workbench.e51c5.arvadosapi.com/container_requests/e51c5-xvhdp-zvpx6tnr9w44ikc

This container started at 2:50 PM 1/7/2019, but it says it started at 4:32 AM 1/7/2019, but I definitely saw it running at 3/4pm on 1/7. There's this line: "It has runtime of 4h39m(13h42mqueued) and used 4h39m of node allocation time (1.0тип scaling)" but I know it wasn't queued for 13h.

My theory is that there was some failure or restart that caused it to restart at 4:32AM, but I don't see this in the logs.

This job also should finish in 20 mins so I'm confused what it was doing for that long (there's not much in the logs). CPU usage was averaging around 730% looking at the html crunchstat summary. memory usage was low.


Files

Actions #1

Updated by Bryan Cosca almost 6 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 6 years ago

The container request record has "container_count": 2 which it means it was on its second try.

We don't track every container associated with a container request. This is a limitation. Filed #14706

Actions #3

Updated by Peter Amstutz almost 6 years ago

No indication that it is thrashing the keep cache, which makes me suspect the bug is the actual script getting stuck in a loop and not Arvados (although I wish we knew why the 1st container attempt eventually failed).

Actions #4

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #5

Updated by Peter Amstutz 10 months ago

  • Target version set to Future
Actions

Also available in: Atom PDF