Bug #20614
closed"Warning: Process retried 1 time due to failure." with no additional information
Description
If container_count > 1 then Workbench 2 renders a message like "Warning: Process retried 1 time due to failure."
Example:
https://workbench2.scale.arvadosapi.com/processes/scale-xvhdp-e8mrzfk1fx5xl2j
The problem is that the log collection doesn't seem to have any record of the 1st attempt.
We need to figure out why it is not including the 1st failure in the log collection (and then maybe fix what's actually failing).
Update:
Actually, the first failure was recorded (but it is unclear why it failed) but the new container was not started before it was cancelled, and thus never created any logs to be recorded, resulting in a confusing log collection that only shows one container (the old one).
We need a way to communicate this situation better.
Update:
Need to figure out why this processes are actually getting killed, the API server was at high load and maybe this is causing thing to time out and think the containers are abandoned somehow?
Update:
The process wasn't killed, it was cancelled (on purpose). But several weird things happened.
- The log collection scale-4zz18-z33wfnfdga5wroy has incomplete log files at the root, but complete log files in the subdirectory "log for container scale-dz642-c9o553ezb91wbfl/"
- These are log files for the same container as evidenced by them having the same timestamps
- So there was only one (real) container, which was running normally, until it was cancelled
- At that point something weird happened
- It picked up the partial collection as the logs from the "last" container
- It incremented the container count
- It created a new container scale-dz642-5x9d8kfjpcdo3xx which ends up at priority 0 and state Queued, and has no log
My suspicion is that there is a race happening in container.rb#handle_completed
- When the state goes to cancelled, it looks for container requests with priority > 0
- Container requests under the root container request have their own priority, but this is normally ignored unless priority = 0 (which cancels a subtree).
- update_priorities only updates containers, not container requests
- So it seems like there must be a window where the container is cancelled, but the request for it is still 'Committed' (it hasn't yet moved to 'Final') so it gets treated as a retry (because "cancelled" means "did not finish" for any reason).
I think this means, when checking for retryable requests, if a request has a requesting_container, we need to also check that the requesting container has priority > 0.
Or maybe it is simpler, we should just check for Cancelled && priority > 0 of the container directly. If the container priority is 0, we can assume there are no live container requests for it?
I think that would work for the case where there is are no live container requests, however in the case where there is one live and one cancelled container request, it could match both, resulting in the retry behavior applied to the cancelled one.
Related issues
Updated by Peter Amstutz over 1 year ago
- Related to Idea #20599: Scaling to 1000s of concurrent containers added
Updated by Peter Amstutz over 1 year ago
20614-no-retry-cancel @ 431bd5023a19d58369dc18e582b1fc2a3d20a321
- Improve the query for live container requests when deciding whether to retry or not, to take into account the state and priority of the request's requesting container
- Add a few new tests ("Do not retry sub-request when process tree is cancelled" is the one that tests for the bug)
Updated by Peter Amstutz over 1 year ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 1 year ago
- Status changed from In Progress to Resolved
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-06-21 sprint to Development 2023-07-05 sprint
- Status changed from Resolved to Feedback
It seems like something like this can still happen scale-xvhdp-zwqc62kvex6i6m9
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-07-05 sprint to Development 2023-07-19 sprint
Updated by Peter Amstutz over 1 year ago
Relocated comment over to https://dev.arvados.org/issues/20457#note-40
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-07-19 sprint to Development 2023-08-02 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-08-02 sprint to Development 2023-08-16
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-08-16 to Development 2023-08-30
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-08-30 to Development 2023-09-13 sprint
Updated by Peter Amstutz about 1 year ago
- Status changed from Feedback to Closed
Don't have anything to follow up with, going to close this.
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-09-13 sprint to Development 2023-08-30