Bug #9898

[Crunch2] [API] Add explicit container lock/unlock APIs to prevent dispatch races

Added by Tom Clegg almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
Crunch
Target version:
Start date:
08/31/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0
Release:
Release relationship:
Auto

Description

Background

Currently it is possible for two dispatch processes to fight over a container: if they use the same token (which ideally wouldn't happen, but certainly will!), and they both submit "update state to Locked" requests at around the same time, both will succeed (the second one will look like a no-op from the API server's perspective) and both dispatchers will try to run the container.

Resolution

Explicit "lock" and "unlock" APIs will make clients' intentions clear. Even when multiple dispatch processes share the same auth token, they will never unknowingly lose a locking race.

New API endpoint /arvados/v1/containers/lock
  • permitted only if state=Queued
  • changes state to Locked
  • changes locked_by_uuid to current token UUID (same as existing behavior when changing state to Locked)
  • ensures that, when multiple "lock" calls are processed concurrently, one of them fails.
New API endpoint /arvados/v1/containers/unlock
  • permitted only if state=Locked
  • changes state to Queued
  • clears locked_by_uuid, revokes and clears auth_uuid (same as existing behavior when changing state to Queued)

Update crunch-dispatch-* to use the new locking APIs instead of "update" when changing state to Locked or Queued.

Even with this change, it will still be possible for confused dispatchers to unlock one another's containers (assuming they have the same token), but this is acceptable: an unlocked container will be re-attempted anyway. This is desirable in that it allows a dispatch process to unlock its locked containers after a crash/restart when all it knows is its own dispatch token.

Even with this change, the following race will still be possible, and will be addressed separately with a "run_auth" token:
  1. dispatch A locks container
  2. dispatch B (with the same token) unlocks the container
  3. dispatch B locks the container
  4. dispatch A retrieves the user auth token
  5. dispatch B retrieves the user auth token
  6. Now both dispatch processes think they have the lock, and both have a valid user auth token. (If dispatch A retrieves the user auth token before the container gets unlocked by dispatch B, the outcome is different but still undesirable.)

Subtasks

Task #9909: Review branch 9898-container-lock-apiResolvedTom Clegg


Related issues

Related to Arvados - Bug #9900: [Crunch2] [API] Add ephemeral "run token" for running containersResolved08/31/2016

Copied from Arvados - Story #9328: [Crunch2] Prevent dispatch racesClosed

Associated revisions

Revision 7c335bf3
Added by Tom Clegg over 3 years ago

Merge branch '9898-container-lock-api'

closes #9898

History

#1 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#2 Updated by Radhika Chippada almost 4 years ago

  • Assigned To set to Radhika Chippada
  • Target version set to 2016-08-31 sprint
  • Story points set to 2.0

#3 Updated by Radhika Chippada almost 4 years ago

  • Target version changed from 2016-08-31 sprint to 2016-09-14 sprint

#4 Updated by Radhika Chippada almost 4 years ago

  • Status changed from New to In Progress

#5 Updated by Tom Clegg over 3 years ago

In sdk/go/dispatch, How about adding Lock and Unlock methods, so callers can say dispatcher.Lock(container.UUID) instead of dispatcher.UpdateState(container.UUID, Locked)?

When lock fails, Container#lock either raises RecordNotSaved if something unexpected happens, or returns false in the "expected failure" case where the record is already locked. By the time this gets back to the client, we seem to have
  • "Error locking container" -- means the container was already locked
  • "RecordNotSaved" -- means something unexpected happened
Suggest:
  • If the record can't be locked because it is not currently Queued, raise AlreadyLockedError instead of returning false
  • No need for "or raise ..." in the controller

The unlock method should also use a model method protected with with_lock, and should return an error if the container already has state=Queued. (Currently the code/tests contradict the "permitted only if state=Locked" part of the spec.)

(and update container unit test in services/api/test/unit/container_test.rb to use unlock instead of update_attributes(state: Container::Queued))

The error "Multiple actions in functional test" is trying to prevent you from doing this. Resetting the counter makes the error go away, but you've still got a square peg in a round hole. Please refactor the tests instead so each test calls one action. E.g., locking the "queued" fixture should succeed; locking the "locked" fixture should fail.

Suggest something like:

[[:queued, :lock, :success, 'Locked'],
 [:queued, :unlock, 422, 'Queued'],
 [:locked, :lock, 422, 'Locked'],
 ...
].each do |fixture, action, response, state|
  uuid = containers(fixture).uuid
  post action, {id: uuid}
  assert_response response
  assert_equal state, Container.where(uuid: uuid).first.state
end

#6 Updated by Radhika Chippada over 3 years ago

In sdk/go/dispatch, How about adding Lock and Unlock methods, so callers can say dispatcher.Lock(container.UUID) instead of dispatcher.UpdateState(container.UUID, Locked)?

Added

When lock fails ... Suggest:

If the record can't be locked because it is not currently Queued, raise AlreadyLockedError instead of returning false
No need for "or raise ..." in the controller

Done

The unlock method should also use a model method protected with with_lock, and should return an error if the container already has state=Queued. (Currently the code/tests contradict the "permitted only if state=Locked" part of the spec.)

Done

(and update container unit test in services/api/test/unit/container_test.rb to use unlock instead of update_attributes(state: Container::Queued))

Done

The error "Multiple actions in functional test" is trying to prevent you from doing this. Resetting the counter makes the error go away, but you've still got a square peg in a round hole. Please refactor the tests instead so each test calls one action. E.g., locking the "queued" fixture should succeed; locking the "locked" fixture should fail.

Done

#7 Updated by Tom Clegg over 3 years ago

Pushed a few updates to 9898-container-lock-api @ 8eaed3b 706abfc, test job queued at https://ci.curoverse.com/job/developer-test-job/188/

#8 Updated by Tom Clegg over 3 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:7c335bf3dc935f6ac0f1845f65063d5b40ddb5ed.

Also available in: Atom PDF