Project

General

Profile

Actions

Bug #20451

closed

Stuck crunch-run issues

Added by Peter Amstutz 11 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
-
Story points:
-

Description

Process reported 503 error from UpdateContainerFinal, crunch-run reported exited but dispatcher still thinks it is running and won't shut down the node.

Another process reported "error updating exit_code: 503 error" and later "error in CaptureOutput: error retrieving collection record: 503 error"

Also "error saving log collection: error recording logs: %!q(<nil>), "503 error".

Have also seen containers that reported as "Complete" in the live log but don't have the complete log in the collection.

Maybe the OOM killer is getting to crunch-run sometimes? Sometimes it just stops logging entirely, and there's nothing useful in the logs stored in keep, either. Or maybe failing to commit the logs to keep causes the logging system to seize up?

I did see "error updating container log: 503" on another process.

Got "docker watchdog: error inspecting container: context deadline exceeded" and "container exited with status code 0" but it is still shown as running.

Some of the steps get through most (all?) the "Copying" lines and then seize up.

For some reason many (although not all) of the steps that froze up are running "STAR"

arv-mount exception during setup can result in as stuck mount, maybe not handled properly?


Related issues

Related to Arvados - Bug #20540: crunch-run should sleep-and-retry after transient failures on API calls, especially when container is succeedingResolvedTom Clegg05/30/2023Actions
Actions

Also available in: Atom PDF