Bug #22160
openIdempotent container request 'Committed' state
Description
An operation to update a container request from "Uncommitted" to "Committed" failed with the error "cannot go to Committed from Final state". This doesn't make sense because it shouldn't have gotten into "Committed" state.
Here's my best guess as to what happened.
- The container request was created in "Uncommitted" state like normal.
- The container request was updated to "Committed" in the database
- The request reused a container, so it flips over to "Final"
- The response never makes it back to the client
- The client eventually retries the request, and gets back an a 422 error because updating the state from "Final" to "Committed" is not allowed
- The workflow fails
This seems like a pretty low-frequency error (I don't know if we've ever seen it before) but we could tweak the API server behavior so that an update containing the state of "Committed" that doesn't result in changes to any other fields is accepted and it returns the record back in "Final" state. From the client's perspective this looks exactly like what happens when you update to "Committed" and immediately get back a record in "Final" because of container reuse.
Updated by Tom Clegg 13 days ago
"200 OK, request failed, but I think you should continue as if it had succeeded" doesn't sit well with me.
Other ideas:
The client could handle the 422 -- retrieve the current version and check whether the state is still Uncommitted before giving up.
We could add a separate "commit" API that changes Uncommitted to Committed and is a no-op in other states, similar to our lock/unlock APIs.
Updated by Peter Amstutz 12 days ago
Tom Clegg wrote in #note-2:
"200 OK, request failed, but I think you should continue as if it had succeeded" doesn't sit well with me.
Ok, but sending an update with {"state": "Committed"}
and getting back a record that has {"state": "Final"}
is already a thing under job reuse. We get to set the rules. Also, this has the advantage of working with older versions of arvados-cwl-runner
.
Other ideas:
The client could handle the 422 -- retrieve the current version and check whether the state is still Uncommitted before giving up.
So, arvados-cwl-runner could do this, but it's going to be hard for anyone else who might be writing similar code to know that "if there's a network hiccup and we automatically retry, this API call that we're assuming is idempotent actually isn't". But it's not that hard to add a little extra code that does "catch error, try to re-fetch container, if it turns out to be in Committed or Final state, then proceed anyway".
We could add a separate "commit" API that changes Uncommitted to Committed and is a no-op in other states, similar to our lock/unlock APIs.
This is just a different way of spelling the same semantic change that I suggested. The best time to do that would have been when we designed the crunch v2 API however many years ago. Now it means we have we will have two ways to do something to maintain backwards compatibility, then have to check the API revision to decide which one to use. To be honest, I'd rather change nothing and handle it on the client than introduce a new dedicated API for this corner case.