Project

General

Profile

Actions

Bug #9688

closed

[Crunch2] Limit number of dispatch attempts per container

Added by Tom Clegg over 7 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

Problem

There are circumstances where crunch-dispatch-* tries to run a container, but something fails before the container gets to Running state, so the container goes back to Queued state. See #9679.

If the same problem keeps happening, the container just flaps between Queued and Locked.

After a certain amount of time, or number of retries, we should really just give up and cancel the container.

Proposed solutions

Pick one of:
  • (crunch-dispatch-*) If a single container gets dispatched more than N times (over a period of at least M seconds) by a single crunch-dispatch-* process, but still won't run, give up and change state to Cancelled.
  • (API server) If a container has been Locked and returned to Queued state, and is more than M seconds old, cancel it.
Also (optional), to mitigate starvation risk in a multiple-dispatch setup:
  • Introduce a delay between "return container X to queue" and "re-attempt X".

Related issues

Related to Arvados - Bug #9679: [Crunch2] Provide feedback when a container is submitted to slurm but does not runResolvedTom Clegg07/29/2016Actions
Has duplicate Arvados - Bug #14540: [API] Limit number of container lock/unlock cyclesDuplicateActions
Has duplicate Arvados - Bug #11561: [API] Limit number of lock/unlock cycles for a given containerResolvedPeter Amstutz04/26/2017Actions
Actions #1

Updated by Peter Amstutz about 5 years ago

  • Related to Bug #11561: [API] Limit number of lock/unlock cycles for a given container added
Actions #2

Updated by Tom Morris about 5 years ago

  • Priority changed from Normal to High
  • Target version set to To Be Groomed
Actions #3

Updated by Tom Morris about 5 years ago

  • Related to Bug #14540: [API] Limit number of container lock/unlock cycles added
Actions #4

Updated by Peter Amstutz about 5 years ago

  • Related to deleted (Bug #14540: [API] Limit number of container lock/unlock cycles)
Actions #5

Updated by Peter Amstutz about 5 years ago

  • Has duplicate Bug #14540: [API] Limit number of container lock/unlock cycles added
Actions #6

Updated by Peter Amstutz about 5 years ago

  • Related to deleted (Bug #11561: [API] Limit number of lock/unlock cycles for a given container)
Actions #7

Updated by Peter Amstutz about 5 years ago

  • Has duplicate Bug #11561: [API] Limit number of lock/unlock cycles for a given container added
Actions #8

Updated by Peter Amstutz about 5 years ago

  • Status changed from New to Duplicate
Actions #9

Updated by Tom Morris about 5 years ago

  • Target version deleted (To Be Groomed)
Actions #10

Updated by Tom Clegg over 2 years ago

  • Priority changed from High to Normal
  • Status changed from Duplicate to Resolved
Actions

Also available in: Atom PDF