Bug #10705
open[Crunch2] [API] return a more specific 422 error message when a client calls containers#unlock without having the lock
Description
Currently we get "arvados API server error: #<ArvadosModel::InvalidStateTransitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com"
This should be something more suggestive of locking, perhaps something along the lines of YouDoNotHaveTheLockError.
Updated by Tom Morris about 8 years ago
- Target version set to Arvados Future Sprints
Updated by Peter Amstutz about 8 years ago
- Target version changed from Arvados Future Sprints to 2017-01-04 sprint
Updated by Peter Amstutz almost 8 years ago
- Target version changed from 2017-01-04 sprint to 2017-01-18 sprint
Updated by Radhika Chippada almost 8 years ago
- Assigned To changed from Peter Amstutz to Radhika Chippada
Updated by Radhika Chippada almost 8 years ago
@ 703b7800
Make container_request -> unlock method a no-op so that crunch-dispatcher doesn't keep retrying when an unlock request errors out.
Updated by Tom Clegg almost 8 years ago
I don't think this is solving the right problem.
Even if unlocking an already-unlocked container is a no-op, crunch-dispatch-slurm still needs to correctly handle the case where unlock fails -- e.g., a different dispatcher has locked the container, or the container has been cancelled.
Once crunch-dispatch-slurm is fixed, "can't unlock an already-unlocked container" should be rare and harmless when things are working well, and helpful for turning making subtle bugs (like this one) easier to find.
So, instead of changing the error to a no-op, can we fix whatever prevented crunch-dispatch-slurm from noticing that the container has been taken away from it, and it should stop trying to mess with it?
Updated by Peter Amstutz almost 8 years ago
Tom Clegg wrote:
I don't think this is solving the right problem.
Even if unlocking an already-unlocked container is a no-op, crunch-dispatch-slurm still needs to correctly handle the case where unlock fails -- e.g., a different dispatcher has locked the container, or the container has been cancelled.
Once crunch-dispatch-slurm is fixed, "can't unlock an already-unlocked container" should be rare and harmless when things are working well, and helpful for turning making subtle bugs (like this one) easier to find.
So, instead of changing the error to a no-op, can we fix whatever prevented crunch-dispatch-slurm from noticing that the container has been taken away from it, and it should stop trying to mess with it?
There is a whole series of bugs linked from #10700. I agree that the actual fix is to ensure that crunch-dispatch-slurm doesn't misbehave when the unlock operation fails. On the other hand, if the API server didn't give a misleading response, maybe the existing code would be sufficient to recover. So to reduce the chance of future problems we should fix it on both sides.
Updated by Tom Clegg almost 8 years ago
Peter Amstutz wrote:
if the API server didn't give a misleading response, maybe the existing code would be sufficient to recover. So to reduce the chance of future problems we should fix it on both sides.
Hm. This reasoning sounds like "if the API call had succeeded, maybe we wouldn't need to handle the error case". But we do need to handle the error case to handle the slightly-different scenarios, like state==Cancelled and locked-by-someone-else -- and when we handle the error case, there is no need to change the API...?
I'm much more comfortable keeping the behavior as "either unlock, or return an error". If a client calls "unlock" when the container is not even locked, that's an opportunity for the client to notice that something is not working as expected. If double-unlock returns OK there is no way for the client to know anything is amiss. In other words, it's already easy for a client to ignore a harmless error, but it would be impossible for a client to detect a suppressed error.
Updated by Radhika Chippada almost 8 years ago
- Target version changed from 2017-01-18 sprint to 2017-02-01 sprint
Updated by Tom Clegg almost 8 years ago
- Subject changed from [Crunch2] API server returns 422 when providing container state field with same value as current state to [Crunch2] [API] return a more specific 422 error message when a client calls containers#unlock without having the lock
- Description updated (diff)
Updated by Radhika Chippada almost 8 years ago
- Target version changed from 2017-02-01 sprint to Arvados Future Sprints
Updated by Tom Morris over 7 years ago
- Status changed from In Progress to New
- Assigned To deleted (
Radhika Chippada)
Updated by Ward Vandewege over 3 years ago
- Target version deleted (
Arvados Future Sprints)