Bug #14977: [arvados-dispatch-cloud] kill crunch-run procs for containers that are deleted or have state=Cancelled when dispatcher starts up - Arvados

Actions

Copy link

Bug #14977

closed

[arvados-dispatch-cloud] kill crunch-run procs for containers that are deleted or have state=Cancelled when dispatcher starts up

Added by Tom Clegg about 6 years ago. Updated almost 6 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Crunch

Target version:

2019-03-27 Sprint

Story points:

Release:

Arvados v1.4 - Q1/Q2 2019

Release relationship:

Auto

Description

Currently, a container that has state==Cancelled when arvados-dispatch-cloud starts up will never be added to the container queue, even if its UUID appears on an instance's probe result. Also, a container that has been deleted from the database will never have an entry added/updated in the dispatcher's container queue.

The scheduler's sync() func is responsible for killing unneeded crunch-run processes, but it only looks at the container queue, so these crunch-run processes are allowed to run forever.

Proposed solution:

In (*scheduler.Scheduler)sync(), kill anything returned by sch.pool.Running() that isn't returned by sch.queue.Entries(). This should be safe from "kill crunch-run before seeing its UUID in the queue" races:

at least one "get entire queue from controller/database" has succeeded before the first call to sync()
UUIDs are added to Running() only during (*Scheduler)runQueue(), which does not run concurrently with (*Scheduler)sync().

In (*container.Queue)poll(), if a container's UUID is in the local queue but is not returned by the API calls that request that specific UUID, delete it from the local queue. The "get missing containers" loop will need to be more careful to avoid accidentally deleting containers when the API server chooses to return less than a full page of results.

Files

14977.png (17.7 KB) 14977.png

Tom Clegg, 03/15/2019 08:46 PM

Subtasks 1 (0 open — 1 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Tom Clegg about 6 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Tom Clegg about 6 years ago

14977-kill-if-not-in-queue @ 2a748e79c3a72454d70e40f39fcad9dabf4943cc

This also fixes a different startup bug: "fix stale locks" was not waiting for the pool to load its initial instance list, so it would always return immediately, and the scheduler would create too many new instances at startup ("container is locked but no instances are available to run it").

Successfully tested on c97qk.

Actions

Copy link

Updated by Tom Clegg about 6 years ago

File 14977.png 14977.png added

Actions

Copy link

Updated by Tom Clegg about 6 years ago

Blocks Idea #14807: [arvados-dispatch-cloud] Features/fixes needed before first production deploy added

Actions

Copy link

Updated by Tom Clegg about 6 years ago

14977-kill-if-not-in-queue @ 6582c5afa53258fcb36d682fb690203930b7b2f6

(reorder statements slightly for clarity)

Actions

Copy link

Updated by Ward Vandewege about 6 years ago

Tom Clegg wrote:

14977-kill-if-not-in-queue @ 6582c5afa53258fcb36d682fb690203930b7b2f6

(reorder statements slightly for clarity)

LGTM, thanks!

Actions

Copy link

Updated by Tom Clegg about 6 years ago

More explanation:

If a container is already cancelled when the dispatcher starts up, it never gets added/updated in the dispatcher's queue. Therefore, the scheduler never finds out it has state=Cancelled. So the fix is to kill any container that is not in the queue -- whether because Cancelled, because deleted, or whatever.

Actions

Copy link