Bug #20451
closedStuck crunch-run issues
Description
Process reported 503 error from UpdateContainerFinal, crunch-run reported exited but dispatcher still thinks it is running and won't shut down the node.
Another process reported "error updating exit_code: 503 error" and later "error in CaptureOutput: error retrieving collection record: 503 error"
Also "error saving log collection: error recording logs: %!q(<nil>), "503 error".
Have also seen containers that reported as "Complete" in the live log but don't have the complete log in the collection.
Maybe the OOM killer is getting to crunch-run sometimes? Sometimes it just stops logging entirely, and there's nothing useful in the logs stored in keep, either. Or maybe failing to commit the logs to keep causes the logging system to seize up?
I did see "error updating container log: 503" on another process.
Got "docker watchdog: error inspecting container: context deadline exceeded" and "container exited with status code 0" but it is still shown as running.
Some of the steps get through most (all?) the "Copying" lines and then seize up.
For some reason many (although not all) of the steps that froze up are running "STAR"
arv-mount exception during setup can result in as stuck mount, maybe not handled properly?
Updated by Peter Amstutz over 1 year ago
- Subject changed from Stuck processes to Stuck crunch-run issues
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-05-10 sprint to Development 2023-05-24 sprint
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-05-24 sprint to Development 2023-06-07
Updated by Tom Clegg over 1 year ago
- Related to Bug #20540: crunch-run should sleep-and-retry after transient failures on API calls, especially when container is succeeding added
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-06-07 to Future