Bug #14705
openWeird container rerun on fail?
Description
top level cr: https://workbench.e51c5.arvadosapi.com/container_requests/e51c5-xvhdp-qw8upy1814gij1q#Status
child cr: https://workbench.e51c5.arvadosapi.com/container_requests/e51c5-xvhdp-zvpx6tnr9w44ikc
This container started at 2:50 PM 1/7/2019, but it says it started at 4:32 AM 1/7/2019, but I definitely saw it running at 3/4pm on 1/7. There's this line: "It has runtime of 4h39m(13h42mqueued) and used 4h39m of node allocation time (1.0тип scaling)" but I know it wasn't queued for 13h.
My theory is that there was some failure or restart that caused it to restart at 4:32AM, but I don't see this in the logs.
This job also should finish in 20 mins so I'm confused what it was doing for that long (there's not much in the logs). CPU usage was averaging around 730% looking at the html crunchstat summary. memory usage was low.
Files
Updated by Peter Amstutz almost 6 years ago
The container request record has "container_count": 2 which it means it was on its second try.
We don't track every container associated with a container request. This is a limitation. Filed #14706
Updated by Peter Amstutz almost 6 years ago
No indication that it is thrashing the keep cache, which makes me suspect the bug is the actual script getting stuck in a loop and not Arvados (although I wish we knew why the 1st container attempt eventually failed).