Project

General

Profile

Actions

Bug #20451

closed

Stuck crunch-run issues

Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
-
Story points:
-

Description

Process reported 503 error from UpdateContainerFinal, crunch-run reported exited but dispatcher still thinks it is running and won't shut down the node.

Another process reported "error updating exit_code: 503 error" and later "error in CaptureOutput: error retrieving collection record: 503 error"

Also "error saving log collection: error recording logs: %!q(<nil>), "503 error".

Have also seen containers that reported as "Complete" in the live log but don't have the complete log in the collection.

Maybe the OOM killer is getting to crunch-run sometimes? Sometimes it just stops logging entirely, and there's nothing useful in the logs stored in keep, either. Or maybe failing to commit the logs to keep causes the logging system to seize up?

I did see "error updating container log: 503" on another process.

Got "docker watchdog: error inspecting container: context deadline exceeded" and "container exited with status code 0" but it is still shown as running.

Some of the steps get through most (all?) the "Copying" lines and then seize up.

For some reason many (although not all) of the steps that froze up are running "STAR"

arv-mount exception during setup can result in as stuck mount, maybe not handled properly?


Related issues

Related to Arvados - Bug #20540: crunch-run should sleep-and-retry after transient failures on API calls, especially when container is succeedingResolvedTom Clegg05/30/2023Actions
Actions #1

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 1 year ago

  • Subject changed from Stuck processes to Stuck crunch-run issues
Actions #3

Updated by Peter Amstutz over 1 year ago

  • Release set to 63
Actions #4

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Tom Clegg
Actions #5

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-05-10 sprint to Development 2023-05-24 sprint
Actions #6

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-05-24 sprint to Development 2023-06-07
Actions #7

Updated by Peter Amstutz over 1 year ago

  • Release deleted (63)
Actions #9

Updated by Tom Clegg over 1 year ago

  • Related to Bug #20540: crunch-run should sleep-and-retry after transient failures on API calls, especially when container is succeeding added
Actions #10

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-06-07 to Future
Actions #11

Updated by Peter Amstutz over 1 year ago

  • Status changed from New to Resolved
Actions #12

Updated by Peter Amstutz over 1 year ago

  • Target version deleted (Future)
Actions

Also available in: Atom PDF