Project

General

Custom queries

Profile

Actions

Bug #22160

closed

Idempotent container request 'Committed' state

Added by Peter Amstutz 5 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
-
Release relationship:
Auto

Description

An operation to update a container request from "Uncommitted" to "Committed" failed with the error "cannot go to Committed from Final state". This doesn't make sense because it shouldn't have gotten into "Committed" state.

Here's my best guess as to what happened.

  1. The container request was created in "Uncommitted" state like normal.
  2. The container request was updated to "Committed" in the database
  3. The request reused a container, so it flips over to "Final"
  4. The response never makes it back to the client
  5. The client eventually retries the request, and gets back an a 422 error because updating the state from "Final" to "Committed" is not allowed
  6. The workflow fails

This seems like a pretty low-frequency error (I don't know if we've ever seen it before) but we could tweak the API server behavior so that an update containing the state of "Committed" that doesn't result in changes to any other fields is accepted and it returns the record back in "Final" state. From the client's perspective this looks exactly like what happens when you update to "Committed" and immediately get back a record in "Final" because of container reuse.


Subtasks 1 (0 open1 closed)

Task #22480: Review 22160-retry-container-reqResolvedPeter Amstutz01/29/2025Actions
Actions #1

Updated by Peter Amstutz 5 months ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg 5 months ago

"200 OK, request failed, but I think you should continue as if it had succeeded" doesn't sit well with me.

Other ideas:

The client could handle the 422 -- retrieve the current version and check whether the state is still Uncommitted before giving up.

We could add a separate "commit" API that changes Uncommitted to Committed and is a no-op in other states, similar to our lock/unlock APIs.

Actions #3

Updated by Peter Amstutz 5 months ago

Tom Clegg wrote in #note-2:

"200 OK, request failed, but I think you should continue as if it had succeeded" doesn't sit well with me.

Ok, but sending an update with {"state": "Committed"} and getting back a record that has {"state": "Final"} is already a thing under job reuse. We get to set the rules. Also, this has the advantage of working with older versions of arvados-cwl-runner.

Other ideas:

The client could handle the 422 -- retrieve the current version and check whether the state is still Uncommitted before giving up.

So, arvados-cwl-runner could do this, but it's going to be hard for anyone else who might be writing similar code to know that "if there's a network hiccup and we automatically retry, this API call that we're assuming is idempotent actually isn't". But it's not that hard to add a little extra code that does "catch error, try to re-fetch container, if it turns out to be in Committed or Final state, then proceed anyway".

We could add a separate "commit" API that changes Uncommitted to Committed and is a no-op in other states, similar to our lock/unlock APIs.

This is just a different way of spelling the same semantic change that I suggested. The best time to do that would have been when we designed the crunch v2 API however many years ago. Now it means we have we will have two ways to do something to maintain backwards compatibility, then have to check the API revision to decide which one to use. To be honest, I'd rather change nothing and handle it on the client than introduce a new dedicated API for this corner case.

Actions #4

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Future to Development 2025-01-29
  • Assigned To set to Peter Amstutz
Actions #5

Updated by Peter Amstutz about 1 month ago

  • Status changed from New to In Progress
Actions #6

Updated by Peter Amstutz about 1 month ago

  • Subtask #22480 added
Actions #7

Updated by Peter Amstutz about 1 month ago

22160-retry-container-req @ f58030d74160771bebb87dc9407614762b086445

developer-run-tests: #4628

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • yes. Attempts to recovers from errors on the client side, as agreed.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • did not add additional tests, existing tests pass, but unrelated tests have failed a couple of times
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • n/a
  • Behaves appropriately at the intended scale (describe intended scale).
    • Improves scale by making this edge case more robust.
  • Considered backwards and forwards compatibility issues between client and server.
    • no issues
  • Follows our coding standards and GUI style guidelines.
    • yes
Actions #8

Updated by Peter Amstutz about 1 month ago

  • Release set to 75
Actions #9

Updated by Brett Smith about 1 month ago

Peter Amstutz wrote in #note-7:

22160-retry-container-req @ f58030d74160771bebb87dc9407614762b086445

What's here is fine. Given how much time we spent discussing this and convincing ourselves that this was best done in the client, it seems to me like it'd be really good to have a test specifically for it. But I won't insist.

Actions #10

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Development 2025-01-29 to Development 2025-02-12
Actions #11

Updated by Peter Amstutz about 1 month ago

Brett Smith wrote in #note-9:

Peter Amstutz wrote in #note-7:

22160-retry-container-req @ f58030d74160771bebb87dc9407614762b086445

What's here is fine. Given how much time we spent discussing this and convincing ourselves that this was best done in the client, it seems to me like it'd be really good to have a test specifically for it. But I won't insist.

I went ahead and added a test. Since it didn't turn up any unexpected behavior and the code you already signed off on didn't change, I went ahead and merged. Thanks.

Actions #12

Updated by Peter Amstutz about 1 month ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF