Project

General

Profile

Actions

Idea #9328

closed

[Crunch2] Prevent dispatch races

Added by Peter Amstutz almost 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

See Container dispatch for background.

Containers can go from "Locked" to "Queued" state if something fails before the container started actually running, or if the dispatcher loses track of the container and believes it is no longer running.

A Queued container can then be re-locked by the same dispatcher, which would initiate a second crunch-run process, which could race the first crunch-run process and lead to confusing results as both crunch-run processes have the ability to update the container record.

The proposed solution is to introduce an additional API token which will be issued when the container is Locked. This will be the API token that the crunch-run process will use to update the record. If the container is unlocked for any reason, the API token will be revoked, and as a result the crunch-run process will be unable to modify the container record and fail; the new crunch-run process will be able to take ownership of the container record safely with a new API token.

Proposed implementation:

  1. Add an run_auth_uuid field to "containers" table on the API server
  2. When the container is Locked, set run_auth_uuid to a system user with a read/write scope of just the container record
  3. When the dispatcher queues or executes the container using crunch-run, set ARVADOS_API_TOKEN to the run_auth_uuid token
  4. If the container returns to "Queued" the run_auth_uuid token is revoked/deleted and the field is cleared.

Rationale:

If crunch-run is started multiple times, the old crunch-run will be unable to update the container record because its token is revoked. Only the new crunch-run with the new ARVADOS_API_TOKEN will be able to update the container record.


Related issues

Copied to Arvados - Bug #9898: [Crunch2] [API] Add explicit container lock/unlock APIs to prevent dispatch racesResolvedRadhika Chippada08/31/2016Actions
Copied to Arvados - Bug #9900: [Crunch2] [API] Add ephemeral "run token" for running containersResolved08/31/2016Actions
Actions #1

Updated by Peter Amstutz almost 8 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 8 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg almost 8 years ago

  • Subject changed from [Crunchv2] Issue separate revocable token for individual container runs to [Crunch2] Prevent dispatch races using a per-container dispatch token
Actions #4

Updated by Peter Amstutz almost 8 years ago

  • Description updated (diff)
Actions #5

Updated by Tom Clegg almost 8 years ago

These races can be prevented without API changes, using the existing (since #8128) auth_uuid field.

Implementation would look like:
  • when updating state to Locked, crunch-dispatch notes the auth_uuid field in the response.
  • when updating the container later, crunch-dispatch includes the known auth_uuid in the update fields. If the container has been unlocked and locked by a different dispatcher, even with the same dispatch token, auth_uuid will have changed, and the update will fail.
  • when invoking crunch-run, crunch-dispatch passes the known auth_uuid as a command line flag or environment variable.
  • crunch-run includes the known auth_uuid in every arvados.v1.containers.update call. If the container has been unlocked/relocked, updates will fail.
Actions #6

Updated by Tom Clegg almost 8 years ago

  • Subject changed from [Crunch2] Prevent dispatch races using a per-container dispatch token to [Crunch2] Prevent dispatch races
Actions #7

Updated by Peter Amstutz almost 8 years ago

  • Assigned To set to Tom Clegg
Actions #8

Updated by Tom Clegg almost 8 years ago

  • Description updated (diff)
Actions #9

Updated by Tom Clegg over 7 years ago

  • Status changed from New to Closed
  • Assigned To deleted (Tom Clegg)
Actions #10

Updated by Tom Morris over 7 years ago

Split into two new stories #9898 and #9900

Actions #11

Updated by Tom Morris over 7 years ago

  • Release deleted (11)
Actions

Also available in: Atom PDF