Project

General

Profile

Actions

Bug #9898

closed

[Crunch2] [API] Add explicit container lock/unlock APIs to prevent dispatch races

Added by Tom Clegg over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
Crunch
Target version:
Story points:
2.0
Release:
Release relationship:
Auto

Description

Background

Currently it is possible for two dispatch processes to fight over a container: if they use the same token (which ideally wouldn't happen, but certainly will!), and they both submit "update state to Locked" requests at around the same time, both will succeed (the second one will look like a no-op from the API server's perspective) and both dispatchers will try to run the container.

Resolution

Explicit "lock" and "unlock" APIs will make clients' intentions clear. Even when multiple dispatch processes share the same auth token, they will never unknowingly lose a locking race.

New API endpoint /arvados/v1/containers/lock
  • permitted only if state=Queued
  • changes state to Locked
  • changes locked_by_uuid to current token UUID (same as existing behavior when changing state to Locked)
  • ensures that, when multiple "lock" calls are processed concurrently, one of them fails.
New API endpoint /arvados/v1/containers/unlock
  • permitted only if state=Locked
  • changes state to Queued
  • clears locked_by_uuid, revokes and clears auth_uuid (same as existing behavior when changing state to Queued)

Update crunch-dispatch-* to use the new locking APIs instead of "update" when changing state to Locked or Queued.

Even with this change, it will still be possible for confused dispatchers to unlock one another's containers (assuming they have the same token), but this is acceptable: an unlocked container will be re-attempted anyway. This is desirable in that it allows a dispatch process to unlock its locked containers after a crash/restart when all it knows is its own dispatch token.

Even with this change, the following race will still be possible, and will be addressed separately with a "run_auth" token:
  1. dispatch A locks container
  2. dispatch B (with the same token) unlocks the container
  3. dispatch B locks the container
  4. dispatch A retrieves the user auth token
  5. dispatch B retrieves the user auth token
  6. Now both dispatch processes think they have the lock, and both have a valid user auth token. (If dispatch A retrieves the user auth token before the container gets unlocked by dispatch B, the outcome is different but still undesirable.)

Subtasks 1 (0 open1 closed)

Task #9909: Review branch 9898-container-lock-apiResolvedTom Clegg08/31/2016Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Bug #9900: [Crunch2] [API] Add ephemeral "run token" for running containersResolved08/31/2016Actions
Copied from Arvados - Idea #9328: [Crunch2] Prevent dispatch racesClosedActions
Actions

Also available in: Atom PDF