Bug #9898
closed[Crunch2] [API] Add explicit container lock/unlock APIs to prevent dispatch races
Description
Background¶
Currently it is possible for two dispatch processes to fight over a container: if they use the same token (which ideally wouldn't happen, but certainly will!), and they both submit "update state to Locked" requests at around the same time, both will succeed (the second one will look like a no-op from the API server's perspective) and both dispatchers will try to run the container.
Resolution¶
Explicit "lock" and "unlock" APIs will make clients' intentions clear. Even when multiple dispatch processes share the same auth token, they will never unknowingly lose a locking race.
New API endpoint /arvados/v1/containers/lock- permitted only if state=Queued
- changes state to Locked
- changes locked_by_uuid to current token UUID (same as existing behavior when changing state to Locked)
- ensures that, when multiple "lock" calls are processed concurrently, one of them fails.
- permitted only if state=Locked
- changes state to Queued
- clears locked_by_uuid, revokes and clears auth_uuid (same as existing behavior when changing state to Queued)
Update crunch-dispatch-* to use the new locking APIs instead of "update" when changing state to Locked or Queued.
Even with this change, it will still be possible for confused dispatchers to unlock one another's containers (assuming they have the same token), but this is acceptable: an unlocked container will be re-attempted anyway. This is desirable in that it allows a dispatch process to unlock its locked containers after a crash/restart when all it knows is its own dispatch token.
Even with this change, the following race will still be possible, and will be addressed separately with a "run_auth" token:- dispatch A locks container
- dispatch B (with the same token) unlocks the container
- dispatch B locks the container
- dispatch A retrieves the user auth token
- dispatch B retrieves the user auth token
- Now both dispatch processes think they have the lock, and both have a valid user auth token. (If dispatch A retrieves the user auth token before the container gets unlocked by dispatch B, the outcome is different but still undesirable.)