Feature #18600
closedCollection update API for manipulating files
Added by Peter Amstutz about 3 years ago. Updated almost 3 years ago.
Description
New endpoint "arvados/v1/collections/uuid/modify"
Map in the form:
{"destination": "source"}
destination: in the form "destination/path"
source: either "PDH/source/path" (can be a directory, including root) or empty (destination should be deleted)
Gets the file content from the path at the portable data hash. Replaces the contents of the destination.
The original portable data hash of the collection (before edits are applied) is valid.
Cannot have a destination that's within another destination, e.g.
{
"/sys": "abc",
"/sys/foobar": "xyz"
}
Is an error because the specification of "/sys" overlaps.
It is the job of the client to determine that there are no overlaps, when merging collections, the client must resolve conflicts (e.g. same named files).
Files
18600-doc-preview.png (150 KB) 18600-doc-preview.png | Tom Clegg, 03/07/2022 07:35 PM |
Updated by Lucas Di Pentima almost 3 years ago
- Blocks Bug #18587: "Copy selected into collection" incorrect behavior added
Updated by Peter Amstutz almost 3 years ago
- Release set to 46
- Target version set to 2022-02-02 sprint
Updated by Peter Amstutz almost 3 years ago
- Status changed from New to In Progress
Updated by Tom Clegg almost 3 years ago
- Target version changed from 2022-02-02 sprint to 2022-02-16 sprint
Updated by Tom Clegg almost 3 years ago
- Status changed from In Progress to New
18600-copy-by-ref @ 668e4f412e70817059fa093b4739c06c50705e79 -- developer-run-tests: #2909
Adds Snapshot and Splice functions to sdk/go/arvados. The idea is that the "modify collection" API described in this ticket will be implemented by calling SiteFileSystem to create a fuse-like filesystem, creating an empty collectionfs, doing a sequence of Snapshot(sitefs, "by_id/"+pdh)
, target.RemoveAll(...)
, and Splice(target, dstpath)
operations to build a new collection, then calling MarshalManifest and saving that to the target collection.
Another way we could export this functionality is func Copy(dstfs FileSystem, dstpath string, srcfs FileSystem, srcpath string) error
... I'm not really sure which is more convenient/obvious. I think the Snapshot/Splice approach might be a bit easier to write/understand when there's a series of operations where the src and dst filesystem are the same, but I'm pretty sure any given sequence can be done either way.
Updated by Lucas Di Pentima almost 3 years ago
Just a couple of cosmetic comments:
- File
sdk/go/arvados/fs_base.go
- Lines 84, 85: "It is an error to replace a directory with a file" -- may require correction.
- Line 174: Repeated 'the' word on comment
Other than that, it LGTM.
Updated by Tom Clegg almost 3 years ago
- Target version changed from 2022-02-16 sprint to 2022-03-02 sprint
Updated by Tom Clegg almost 3 years ago
18600-copy-by-ref @ ca0b6ff3d75f68b7a2f1821d605fadd49481038e
(just updates comments)
Updated by Peter Amstutz almost 3 years ago
- Target version changed from 2022-03-02 sprint to 2022-03-16 sprint
Updated by Tom Clegg almost 3 years ago
retry workbench1: developer-run-tests-apps-workbench-integration: #3146
retry workbench1: developer-run-tests-apps-workbench-integration: #3148
- Rather than a separate "modify" API, I added a virtual attribute "splices" that can be passed in a create or update request in place of manifest_text. This makes it possible to atomically create a new collection using data from existing collections.
- Deleting a subtree (even the root) and then inserting new content below that point is supported, as an exception to the "cannot have a destination that's within another destination" rule proposed above. This makes it possible to atomically clear and rebuild an existing collection from scratch, rather than requiring the client to list every existing file/subdir to delete, which would be inconvenient and prone to unexpected behavior in races.
The attribute name "splices" alludes to the js/ruby/perl array splice operation ("delete some items and insert some new ones in their place") but perhaps there is a better name for it?
Updated by Lucas Di Pentima almost 3 years ago
Some comments:
- Regarding the name "splices", I find it confusing because the param is about a series of operations that should be done to the data the collection holds (as opposed to the rest of the params, which usually are about data or metadata).
- As commented on chat: one idea would be to name it something like "manifest_ops" making it explicit that its contents are a list of operations on the collection's data.
- Another idea would be converting the noun-based name "splices" to "splicings" making it more evident that its contents are actions, not data. Even then, it doesn't have the "data context" in its name, but I think it may be better than using a noun.
- Do we need to update the discovery document to list this new param?
- It seems that removing non-existent targets is currently a no-op; not sure if that's on purpose.
- I think I've found one case that is not a correct behavior:
{"/": "SOME-VALID-PDH", "/": ""}
→ produces an empty collection{"/foo": "SOME-VALID-PDH", "/": ""}
→ produces a collection with/foo
containingSOME-VALID-PDH
's contents.
Updated by Lucas Di Pentima almost 3 years ago
Lucas Di Pentima wrote:
- I think I've found one case that is not a correct behavior:
{"/": "SOME-VALID-PDH", "/": ""}
→ produces an empty collection{"/foo": "SOME-VALID-PDH", "/": ""}
→ produces a collection with/foo
containingSOME-VALID-PDH
's contents.
I see now that in the 1st case, I used the same target with different sources. Perhaps we should avoid accepting the same target multiple times or document that the last operation is the one that would be the final result.
If accepting the same target multiple times is OK, it would be useful to avoid doing extra work for the [0..N-1] cases of the same target.
Updated by Lucas Di Pentima almost 3 years ago
Agh, please forget about my last comment. My tests were being done with the PySDK, so obviously this happens:
>>> {"hello":"world", "hello":"people"}
{'hello': 'people'}
Sorry about that!
Updated by Tom Clegg almost 3 years ago
replace_files
instead of splices
?
- "replace" is a verb, and not a noun
- hints that existing content at specified paths will be deleted (as opposed to "add" or "update" which might imply some kind of merge)
"collection": { "replace_files": { "/foo.txt": "", "/bar.txt": "fa7aeb5140e2848d39b416daeef4ffc5+45/foo.txt", "/subdir_to_delete": "" } }
Updated by Tom Clegg almost 3 years ago
Re discovery doc, I think this would have to be added as a "Collection" property, which incorrectly implies that it's part of a collection response too. This might be a hint that we should expose it in the API as an update option beside the "new collection properties" object instead of inside it, more like this
PATCH /arvados/v1/collections/{uuid} { "collection": { "name": "new_name" }, "replace_files": { "/foo.txt": "", "/bar.txt": "fa7aeb5140e2848d39b416daeef4ffc5+45/foo.txt", "/subdir_to_delete": "" } }
Thoughts?
Updated by Peter Amstutz almost 3 years ago
Tom Clegg wrote:
Re discovery doc, I think this would have to be added as a "Collection" property, which incorrectly implies that it's part of a collection response too. This might be a hint that we should expose it in the API as an update option beside the "new collection properties" object instead of inside it, more like this
[...]
Thoughts?
I agree with this, it is an update parameter but not part of the collection object.
Updated by Lucas Di Pentima almost 3 years ago
I think replace_files
is a lot better than splices
!
Re: note-23, I see what you mean, is it possible to do that with the current Google API Client support? (not remembering of any existing similar case) If so, I think it's a good idea to split request data that's about the object from actions being done to it.
Updated by Tom Clegg almost 3 years ago
- instead of a write-only collection attribute called "splices", we now have a create/update parameter called "replace_files"
- add replace_files param to discovery doc
- test replace_files from sdk/cli tests (this relies on the param being in the discovery doc)
Accepting "delete nonexistent path" as a no-op is intentional. Analogous to "rm -rf", it avoids forcing the client to spend extra effort on "check whether X exists before deleting it", and makes it possible to do things like "replace entire contents of dir/ with dir/file.txt" atomically. I think a "can't delete, doesn't exist" error would essentially be a race-detection feature for a certain type of race, which will be better addressed (along with a much wider range of races) by implementing If-Matches support.
Updated by Tom Clegg almost 3 years ago
merge main → 18600-update-files @ f7bf9d69603db2d500563648460e2a96524de266 --
Updated by Tom Clegg almost 3 years ago
18600-update-files @ b8fb328cc2f084d155add478264ffaf384eda190 -- developer-run-tests: #2965
Updated by Tom Clegg almost 3 years ago
- Status changed from In Progress to Resolved
Updated by Tom Clegg over 2 years ago
- Related to Feature #18871: WebDAV uses replace_files API added