Bug #10705

[Crunch2] [API] return a more specific 422 error message when a client calls containers#unlock without having the lock

Added by Peter Amstutz about 2 years ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
01/11/2017
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Currently we get "arvados API server error: #<ArvadosModel::InvalidStateTransitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com"

This should be something more suggestive of locking, perhaps something along the lines of YouDoNotHaveTheLockError.


Subtasks

Task #10740: Review 10705-state-transition-errorIn ProgressRadhika Chippada


Related issues

Related to Arvados - Bug #10700: [Crunch2] crunch-dispatch-slurm pileupResolved2017-01-27

Associated revisions

Revision f927cd24
Added by Tom Clegg almost 2 years ago

Merge branch '10705-dedup-log-message'

refs #10705

History

#1 Updated by Tom Morris about 2 years ago

  • Target version set to Arvados Future Sprints

#2 Updated by Peter Amstutz about 2 years ago

  • Target version changed from Arvados Future Sprints to 2017-01-04 sprint

#3 Updated by Peter Amstutz about 2 years ago

  • Assigned To set to Peter Amstutz

#4 Updated by Peter Amstutz almost 2 years ago

  • Target version changed from 2017-01-04 sprint to 2017-01-18 sprint

#5 Updated by Radhika Chippada almost 2 years ago

  • Assigned To changed from Peter Amstutz to Radhika Chippada

#6 Updated by Radhika Chippada almost 2 years ago

@ 703b7800

Make container_request -> unlock method a no-op so that crunch-dispatcher doesn't keep retrying when an unlock request errors out.

#7 Updated by Tom Clegg almost 2 years ago

I don't think this is solving the right problem.

Even if unlocking an already-unlocked container is a no-op, crunch-dispatch-slurm still needs to correctly handle the case where unlock fails -- e.g., a different dispatcher has locked the container, or the container has been cancelled.

Once crunch-dispatch-slurm is fixed, "can't unlock an already-unlocked container" should be rare and harmless when things are working well, and helpful for turning making subtle bugs (like this one) easier to find.

So, instead of changing the error to a no-op, can we fix whatever prevented crunch-dispatch-slurm from noticing that the container has been taken away from it, and it should stop trying to mess with it?

#8 Updated by Tom Clegg almost 2 years ago

  • Status changed from New to In Progress

#9 Updated by Peter Amstutz almost 2 years ago

Tom Clegg wrote:

I don't think this is solving the right problem.

Even if unlocking an already-unlocked container is a no-op, crunch-dispatch-slurm still needs to correctly handle the case where unlock fails -- e.g., a different dispatcher has locked the container, or the container has been cancelled.

Once crunch-dispatch-slurm is fixed, "can't unlock an already-unlocked container" should be rare and harmless when things are working well, and helpful for turning making subtle bugs (like this one) easier to find.

So, instead of changing the error to a no-op, can we fix whatever prevented crunch-dispatch-slurm from noticing that the container has been taken away from it, and it should stop trying to mess with it?

There is a whole series of bugs linked from #10700. I agree that the actual fix is to ensure that crunch-dispatch-slurm doesn't misbehave when the unlock operation fails. On the other hand, if the API server didn't give a misleading response, maybe the existing code would be sufficient to recover. So to reduce the chance of future problems we should fix it on both sides.

#10 Updated by Tom Clegg almost 2 years ago

Peter Amstutz wrote:

if the API server didn't give a misleading response, maybe the existing code would be sufficient to recover. So to reduce the chance of future problems we should fix it on both sides.

Hm. This reasoning sounds like "if the API call had succeeded, maybe we wouldn't need to handle the error case". But we do need to handle the error case to handle the slightly-different scenarios, like state==Cancelled and locked-by-someone-else -- and when we handle the error case, there is no need to change the API...?

I'm much more comfortable keeping the behavior as "either unlock, or return an error". If a client calls "unlock" when the container is not even locked, that's an opportunity for the client to notice that something is not working as expected. If double-unlock returns OK there is no way for the client to know anything is amiss. In other words, it's already easy for a client to ignore a harmless error, but it would be impossible for a client to detect a suppressed error.

#11 Updated by Radhika Chippada almost 2 years ago

  • Target version changed from 2017-01-18 sprint to 2017-02-01 sprint

#12 Updated by Tom Clegg almost 2 years ago

  • Subject changed from [Crunch2] API server returns 422 when providing container state field with same value as current state to [Crunch2] [API] return a more specific 422 error message when a client calls containers#unlock without having the lock
  • Description updated (diff)

#13 Updated by Radhika Chippada almost 2 years ago

  • Target version changed from 2017-02-01 sprint to Arvados Future Sprints

#14 Updated by Tom Morris over 1 year ago

  • Status changed from In Progress to New
  • Assigned To deleted (Radhika Chippada)

#15 Updated by Tom Morris 8 months ago

  • Release deleted (11)

Also available in: Atom PDF