Bug #9898
closed[Crunch2] [API] Add explicit container lock/unlock APIs to prevent dispatch races
Description
Background¶
Currently it is possible for two dispatch processes to fight over a container: if they use the same token (which ideally wouldn't happen, but certainly will!), and they both submit "update state to Locked" requests at around the same time, both will succeed (the second one will look like a no-op from the API server's perspective) and both dispatchers will try to run the container.
Resolution¶
Explicit "lock" and "unlock" APIs will make clients' intentions clear. Even when multiple dispatch processes share the same auth token, they will never unknowingly lose a locking race.
New API endpoint /arvados/v1/containers/lock- permitted only if state=Queued
- changes state to Locked
- changes locked_by_uuid to current token UUID (same as existing behavior when changing state to Locked)
- ensures that, when multiple "lock" calls are processed concurrently, one of them fails.
- permitted only if state=Locked
- changes state to Queued
- clears locked_by_uuid, revokes and clears auth_uuid (same as existing behavior when changing state to Queued)
Update crunch-dispatch-* to use the new locking APIs instead of "update" when changing state to Locked or Queued.
Even with this change, it will still be possible for confused dispatchers to unlock one another's containers (assuming they have the same token), but this is acceptable: an unlocked container will be re-attempted anyway. This is desirable in that it allows a dispatch process to unlock its locked containers after a crash/restart when all it knows is its own dispatch token.
Even with this change, the following race will still be possible, and will be addressed separately with a "run_auth" token:- dispatch A locks container
- dispatch B (with the same token) unlocks the container
- dispatch B locks the container
- dispatch A retrieves the user auth token
- dispatch B retrieves the user auth token
- Now both dispatch processes think they have the lock, and both have a valid user auth token. (If dispatch A retrieves the user auth token before the container gets unlocked by dispatch B, the outcome is different but still undesirable.)
Updated by Radhika Chippada over 8 years ago
- Assigned To set to Radhika Chippada
- Target version set to 2016-08-31 sprint
- Story points set to 2.0
Updated by Radhika Chippada over 8 years ago
- Target version changed from 2016-08-31 sprint to 2016-09-14 sprint
Updated by Radhika Chippada over 8 years ago
- Status changed from New to In Progress
Updated by Tom Clegg over 8 years ago
In sdk/go/dispatch, How about adding Lock and Unlock methods, so callers can say dispatcher.Lock(container.UUID) instead of dispatcher.UpdateState(container.UUID, Locked)?
Whenlock
fails, Container#lock either raises RecordNotSaved if something unexpected happens, or returns false in the "expected failure" case where the record is already locked. By the time this gets back to the client, we seem to have
- "Error locking container" -- means the container was already locked
- "RecordNotSaved" -- means something unexpected happened
- If the record can't be locked because it is not currently Queued, raise AlreadyLockedError instead of returning false
- No need for "or raise ..." in the controller
The unlock method should also use a model method protected with with_lock
, and should return an error if the container already has state=Queued. (Currently the code/tests contradict the "permitted only if state=Locked" part of the spec.)
(and update container unit test in services/api/test/unit/container_test.rb to use unlock
instead of update_attributes(state: Container::Queued)
)
The error "Multiple actions in functional test" is trying to prevent you from doing this. Resetting the counter makes the error go away, but you've still got a square peg in a round hole. Please refactor the tests instead so each test calls one action. E.g., locking the "queued" fixture should succeed; locking the "locked" fixture should fail.
Suggest something like:
[[:queued, :lock, :success, 'Locked'],
[:queued, :unlock, 422, 'Queued'],
[:locked, :lock, 422, 'Locked'],
...
].each do |fixture, action, response, state|
uuid = containers(fixture).uuid
post action, {id: uuid}
assert_response response
assert_equal state, Container.where(uuid: uuid).first.state
end
Updated by Radhika Chippada over 8 years ago
In sdk/go/dispatch, How about adding Lock and Unlock methods, so callers can say dispatcher.Lock(container.UUID) instead of dispatcher.UpdateState(container.UUID, Locked)?
Added
When lock fails ... Suggest:
If the record can't be locked because it is not currently Queued, raise AlreadyLockedError instead of returning false
No need for "or raise ..." in the controller
Done
The unlock method should also use a model method protected with with_lock, and should return an error if the container already has state=Queued. (Currently the code/tests contradict the "permitted only if state=Locked" part of the spec.)
Done
(and update container unit test in services/api/test/unit/container_test.rb to use unlock instead of update_attributes(state: Container::Queued))
Done
The error "Multiple actions in functional test" is trying to prevent you from doing this. Resetting the counter makes the error go away, but you've still got a square peg in a round hole. Please refactor the tests instead so each test calls one action. E.g., locking the "queued" fixture should succeed; locking the "locked" fixture should fail.
Done
Updated by Tom Clegg over 8 years ago
Pushed a few updates to 9898-container-lock-api @ 8eaed3b 706abfc, test job queued at https://ci.curoverse.com/job/developer-test-job/188/
Updated by Tom Clegg over 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:7c335bf3dc935f6ac0f1845f65063d5b40ddb5ed.