[API] Admin can prevent reuse by cancelling a completed container
Background: Sometimes it is desirable to avoid reusing a specific completed container, without disabling reuse for the corresponding step in a workflow. Example: a container exited 0 with bogus output, but it has never™ done that before and will probably never™ do it again, so it's not worth updating the workflow to use a new image/version.
Relax the "frozen in final state" constraints slightly, so an admin can use the API to update a container's state from Complete to Cancelled.
#6 Updated by Eric Biagiotti 5 months ago
Latest at: e1b8a1b0bdb46d0e162ef7794c06b08d0a5fffa5
- Created an
Complete => [Cancelled]that gets merged into the
State_transitionhash if the user is an admin.
- Added a test for setting a completed container's state to cancelled by admin and non-admin
- Created a new admin page and added detail to the containers API page.
Note: In the container model, we check that the value for container state has actually changed before validating, which means that if a user tries to update to the current state (via arv container update or whatever), it passes validation. Even though nothing changed, this may be confusing to the user as it implies that this type of command is allowed. Let me know if you think this should be changed.
I see how the story description and API docs imply otherwise, but non-admin users are already prohibited from updating containers at all (except the container process itself can update its own progress fields while state==Running). So I don't think we need a special "admin state transitions" here.
For example, if a container exited successfully but produced bad output, it may not be worth updating the workflow to use a new image/version. Instead, changing the state of a container to
Cancelledwill disable reuse of the specific container.
I think this should be presented as an expedient thing to do in addition to fixing the bug, rather than as an alternative. We can't force people to fix the bugs, but I don't think we should nudge them in the wrong direction.
Maybe: "... it may not be feasible to update the workflow immediately ... Meanwhile, changing the state ..."
...will change it's state to...
UUIDof the container:
xxxxx-xxxxx-xxxxxxxxxxxxxxx is the UUID of the container
all containers that have been cancelled after completion
Either this should say "exited 0 and were then cancelled" or the condition should be
\ No newline at end of file
(manage-containers.html.textile.liquid) should have one.
Cancelled (admin only)
All state changes are admin only.
The admin page title should probably be something like "Controlling container reuse" -- hint at the user's question, which they know, rather than the answer, which they don't. Same goes for the leading sentence. And I think it needs to be added to _config.yml so it shows up in the left nav.
15002-container-permissions @ 5e95c9b723e36cf80e0b9c1bf02206520503d4f1 https://ci.curoverse.com/view/Developer/job/developer-run-tests/1191/ ✔
#9 Updated by Eric Biagiotti 5 months ago
Latest at 7e0a38f70392822f362bd94f9d9093554c8f351a
- Addressed the documentation issues from Note 7. I already added manage-containers.html.textile.liquid to _config.yml, so it should be showing up in the left navigation panel. Let me know if this still doesn't work for you.
- Merged in your changes and simplified Complete => Cancelled transition in the container model.
TestGetLockUnlockCanceltest to accept the new state transition.
- this branch is a good candidate for squashing to reduce noise in the git history (code that never made it in)
- the "Controlling container reuse" page should probably be called controlling-reuse.html instead of manage-containers.html
- link title in api/methods/containers.html should be updated too
- maybe the doc page should allude to the fact that only an admin can do this -- or is it enough that it's in the "admin" section? I'm not sure whether we should rely on people to notice the nav clues when they arrive at this page via web search or the link on the API page.
- maybe "in your workflow" → "in all affected workflows"? (It's probably not "yours" and there can easily be more than one)
- maybe "disable reuse as the workflow continues to run" → "prevent it from being reused in subsequent workflows"?
It looks like this used to test propagating errors from the API.
err = cq.Cancel(arvadostest.CompletedContainerUUID) c.Check(err, check.ErrorMatches, `.*State cannot change from Complete to Cancelled.*`)
Now that that's not an error, this part of the test seems superfluous, given the successful Cancel() calls above. Might as well remove it. Maybe check the returned error message in one of the non-nil error cases above instead?
#11 Updated by Eric Biagiotti 5 months ago
Addressed the comments from note 10, rebased on master, and squashed into one commit at 30fc42873b2c82e72a95393eb053abf1f7052618