Project

General

Profile

Actions

Bug #20614

closed

"Warning: Process retried 1 time due to failure." with no additional information

Added by Peter Amstutz 12 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

If container_count > 1 then Workbench 2 renders a message like "Warning: Process retried 1 time due to failure."

Example:

https://workbench2.scale.arvadosapi.com/processes/scale-xvhdp-e8mrzfk1fx5xl2j

The problem is that the log collection doesn't seem to have any record of the 1st attempt.

We need to figure out why it is not including the 1st failure in the log collection (and then maybe fix what's actually failing).

Update:

Actually, the first failure was recorded (but it is unclear why it failed) but the new container was not started before it was cancelled, and thus never created any logs to be recorded, resulting in a confusing log collection that only shows one container (the old one).

We need a way to communicate this situation better.

Update:

Need to figure out why this processes are actually getting killed, the API server was at high load and maybe this is causing thing to time out and think the containers are abandoned somehow?

Update:

The process wasn't killed, it was cancelled (on purpose). But several weird things happened.

  1. The log collection scale-4zz18-z33wfnfdga5wroy has incomplete log files at the root, but complete log files in the subdirectory "log for container scale-dz642-c9o553ezb91wbfl/"
  2. These are log files for the same container as evidenced by them having the same timestamps
  3. So there was only one (real) container, which was running normally, until it was cancelled
  4. At that point something weird happened
  5. It picked up the partial collection as the logs from the "last" container
  6. It incremented the container count
  7. It created a new container scale-dz642-5x9d8kfjpcdo3xx which ends up at priority 0 and state Queued, and has no log

My suspicion is that there is a race happening in container.rb#handle_completed

  • When the state goes to cancelled, it looks for container requests with priority > 0
  • Container requests under the root container request have their own priority, but this is normally ignored unless priority = 0 (which cancels a subtree).
  • update_priorities only updates containers, not container requests
  • So it seems like there must be a window where the container is cancelled, but the request for it is still 'Committed' (it hasn't yet moved to 'Final') so it gets treated as a retry (because "cancelled" means "did not finish" for any reason).

I think this means, when checking for retryable requests, if a request has a requesting_container, we need to also check that the requesting container has priority > 0.

Or maybe it is simpler, we should just check for Cancelled && priority > 0 of the container directly. If the container priority is 0, we can assume there are no live container requests for it?

I think that would work for the case where there is are no live container requests, however in the case where there is one live and one cancelled container request, it could match both, resulting in the retry behavior applied to the cancelled one.


Subtasks 1 (0 open1 closed)

Task #20627: Review 20614-no-retry-cancelResolvedPeter Amstutz06/08/2023Actions

Related issues

Related to Arvados Epics - Idea #20599: Scaling to 1000s of concurrent containersResolved06/01/202303/31/2024Actions
Actions #1

Updated by Peter Amstutz 12 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 12 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 12 months ago

  • Related to Idea #20599: Scaling to 1000s of concurrent containers added
Actions #4

Updated by Peter Amstutz 12 months ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 12 months ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz 12 months ago

  • Assigned To set to Peter Amstutz
Actions #7

Updated by Peter Amstutz 12 months ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz 12 months ago

  • Description updated (diff)
Actions #9

Updated by Peter Amstutz 12 months ago

  • Description updated (diff)
Actions #10

Updated by Peter Amstutz 12 months ago

20614-no-retry-cancel @ 431bd5023a19d58369dc18e582b1fc2a3d20a321

  • Improve the query for live container requests when deciding whether to retry or not, to take into account the state and priority of the request's requesting container
  • Add a few new tests ("Do not retry sub-request when process tree is cancelled" is the one that tests for the bug)

developer-run-tests: #3693

Actions #11

Updated by Peter Amstutz 12 months ago

  • Status changed from New to In Progress
Actions #12

Updated by Tom Clegg 12 months ago

LGTM, thanks

Actions #13

Updated by Peter Amstutz 12 months ago

  • Status changed from In Progress to Resolved
Actions #14

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2023-06-21 sprint to Development 2023-07-05 sprint
  • Status changed from Resolved to Feedback

It seems like something like this can still happen scale-xvhdp-zwqc62kvex6i6m9

Actions #15

Updated by Peter Amstutz 11 months ago

  • Release set to 66
Actions #16

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2023-07-05 sprint to Development 2023-07-19 sprint
Actions #17

Updated by Peter Amstutz 10 months ago

Actions #18

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2023-07-19 sprint to Development 2023-08-02 sprint
Actions #19

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2023-08-02 sprint to Development 2023-08-16
Actions #20

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2023-08-16 to Development 2023-08-30
Actions #21

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2023-08-30 to Development 2023-09-13 sprint
Actions #22

Updated by Peter Amstutz 9 months ago

  • Release deleted (66)
Actions #23

Updated by Peter Amstutz 9 months ago

  • Status changed from Feedback to Closed

Don't have anything to follow up with, going to close this.

Actions #24

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2023-09-13 sprint to Development 2023-08-30
Actions #25

Updated by Peter Amstutz 9 months ago

  • Release set to 66
Actions

Also available in: Atom PDF