Project

General

Profile

Actions

Feature #18600

closed

Collection update API for manipulating files

Added by Peter Amstutz over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

New endpoint "arvados/v1/collections/uuid/modify"

Map in the form:

{"destination": "source"}

destination: in the form "destination/path"

source: either "PDH/source/path" (can be a directory, including root) or empty (destination should be deleted)

Gets the file content from the path at the portable data hash. Replaces the contents of the destination.

The original portable data hash of the collection (before edits are applied) is valid.

Cannot have a destination that's within another destination, e.g.

{
"/sys": "abc",
"/sys/foobar": "xyz"
}

Is an error because the specification of "/sys" overlaps.

It is the job of the client to determine that there are no overlaps, when merging collections, the client must resolve conflicts (e.g. same named files).


Files

18600-doc-preview.png (150 KB) 18600-doc-preview.png Tom Clegg, 03/07/2022 07:35 PM

Subtasks 2 (0 open2 closed)

Task #18647: Review 18600-copy-by-refResolvedTom Clegg02/11/2022Actions
Task #18827: Review 18600-update-filesResolvedTom Clegg02/11/2022Actions

Related issues

Related to Arvados - Feature #18871: WebDAV uses replace_files APINewActions
Blocks Arvados Workbench 2 - Bug #18587: "Copy selected into collection" incorrect behaviorNewActions
Actions #1

Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)
Actions #5

Updated by Lucas Di Pentima over 2 years ago

  • Blocks Bug #18587: "Copy selected into collection" incorrect behavior added
Actions #6

Updated by Peter Amstutz over 2 years ago

  • Release set to 46
  • Target version set to 2022-02-02 sprint
Actions #7

Updated by Tom Clegg over 2 years ago

  • Assigned To set to Tom Clegg
Actions #8

Updated by Peter Amstutz about 2 years ago

  • Status changed from New to In Progress
Actions #9

Updated by Tom Clegg about 2 years ago

  • Target version changed from 2022-02-02 sprint to 2022-02-16 sprint
Actions #10

Updated by Tom Clegg about 2 years ago

  • Status changed from In Progress to New

18600-copy-by-ref @ 668e4f412e70817059fa093b4739c06c50705e79 -- developer-run-tests: #2909

Adds Snapshot and Splice functions to sdk/go/arvados. The idea is that the "modify collection" API described in this ticket will be implemented by calling SiteFileSystem to create a fuse-like filesystem, creating an empty collectionfs, doing a sequence of Snapshot(sitefs, "by_id/"+pdh), target.RemoveAll(...), and Splice(target, dstpath) operations to build a new collection, then calling MarshalManifest and saving that to the target collection.

Another way we could export this functionality is func Copy(dstfs FileSystem, dstpath string, srcfs FileSystem, srcpath string) error ... I'm not really sure which is more convenient/obvious. I think the Snapshot/Splice approach might be a bit easier to write/understand when there's a series of operations where the src and dst filesystem are the same, but I'm pretty sure any given sequence can be done either way.

Actions #11

Updated by Tom Clegg about 2 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Lucas Di Pentima about 2 years ago

Just a couple of cosmetic comments:

  • File sdk/go/arvados/fs_base.go
    • Lines 84, 85: "It is an error to replace a directory with a file" -- may require correction.
    • Line 174: Repeated 'the' word on comment

Other than that, it LGTM.

Actions #13

Updated by Tom Clegg about 2 years ago

  • Target version changed from 2022-02-16 sprint to 2022-03-02 sprint
Actions #14

Updated by Tom Clegg about 2 years ago

18600-copy-by-ref @ ca0b6ff3d75f68b7a2f1821d605fadd49481038e

(just updates comments)

Actions #15

Updated by Lucas Di Pentima about 2 years ago

LGTM, thanks!

Actions #16

Updated by Peter Amstutz about 2 years ago

  • Target version changed from 2022-03-02 sprint to 2022-03-16 sprint
Actions #18

Updated by Tom Clegg about 2 years ago

18600-update-files @ f00d666fdf3f4322a6e153a9dc3a25069859e52d
retry workbench1: developer-run-tests-apps-workbench-integration: #3146
retry workbench1: developer-run-tests-apps-workbench-integration: #3148
  • Rather than a separate "modify" API, I added a virtual attribute "splices" that can be passed in a create or update request in place of manifest_text. This makes it possible to atomically create a new collection using data from existing collections.
  • Deleting a subtree (even the root) and then inserting new content below that point is supported, as an exception to the "cannot have a destination that's within another destination" rule proposed above. This makes it possible to atomically clear and rebuild an existing collection from scratch, rather than requiring the client to list every existing file/subdir to delete, which would be inconvenient and prone to unexpected behavior in races.

The attribute name "splices" alludes to the js/ruby/perl array splice operation ("delete some items and insert some new ones in their place") but perhaps there is a better name for it?

Actions #19

Updated by Lucas Di Pentima about 2 years ago

Some comments:

  • Regarding the name "splices", I find it confusing because the param is about a series of operations that should be done to the data the collection holds (as opposed to the rest of the params, which usually are about data or metadata).
    • As commented on chat: one idea would be to name it something like "manifest_ops" making it explicit that its contents are a list of operations on the collection's data.
    • Another idea would be converting the noun-based name "splices" to "splicings" making it more evident that its contents are actions, not data. Even then, it doesn't have the "data context" in its name, but I think it may be better than using a noun.
  • Do we need to update the discovery document to list this new param?
  • It seems that removing non-existent targets is currently a no-op; not sure if that's on purpose.
  • I think I've found one case that is not a correct behavior:
    • {"/": "SOME-VALID-PDH", "/": ""} → produces an empty collection
    • {"/foo": "SOME-VALID-PDH", "/": ""} → produces a collection with /foo containing SOME-VALID-PDH's contents.
Actions #20

Updated by Lucas Di Pentima about 2 years ago

Lucas Di Pentima wrote:

  • I think I've found one case that is not a correct behavior:
    • {"/": "SOME-VALID-PDH", "/": ""} → produces an empty collection
    • {"/foo": "SOME-VALID-PDH", "/": ""} → produces a collection with /foo containing SOME-VALID-PDH's contents.

I see now that in the 1st case, I used the same target with different sources. Perhaps we should avoid accepting the same target multiple times or document that the last operation is the one that would be the final result.
If accepting the same target multiple times is OK, it would be useful to avoid doing extra work for the [0..N-1] cases of the same target.

Actions #21

Updated by Lucas Di Pentima about 2 years ago

Agh, please forget about my last comment. My tests were being done with the PySDK, so obviously this happens:

>>> {"hello":"world", "hello":"people"}
{'hello': 'people'}

Sorry about that!

Actions #22

Updated by Tom Clegg about 2 years ago

How about replace_files instead of splices?
  • "replace" is a verb, and not a noun
  • hints that existing content at specified paths will be deleted (as opposed to "add" or "update" which might imply some kind of merge)
"collection": {
  "replace_files": {
    "/foo.txt": "",
    "/bar.txt": "fa7aeb5140e2848d39b416daeef4ffc5+45/foo.txt",
    "/subdir_to_delete": "" 
  }
}
Actions #23

Updated by Tom Clegg about 2 years ago

Re discovery doc, I think this would have to be added as a "Collection" property, which incorrectly implies that it's part of a collection response too. This might be a hint that we should expose it in the API as an update option beside the "new collection properties" object instead of inside it, more like this

PATCH /arvados/v1/collections/{uuid}

{
  "collection": {
    "name": "new_name" 
  },
  "replace_files": {
    "/foo.txt": "",
    "/bar.txt": "fa7aeb5140e2848d39b416daeef4ffc5+45/foo.txt",
    "/subdir_to_delete": "" 
  }
}

Thoughts?

Actions #24

Updated by Peter Amstutz about 2 years ago

Tom Clegg wrote:

Re discovery doc, I think this would have to be added as a "Collection" property, which incorrectly implies that it's part of a collection response too. This might be a hint that we should expose it in the API as an update option beside the "new collection properties" object instead of inside it, more like this

[...]

Thoughts?

I agree with this, it is an update parameter but not part of the collection object.

Actions #25

Updated by Lucas Di Pentima about 2 years ago

I think replace_files is a lot better than splices!
Re: note-23, I see what you mean, is it possible to do that with the current Google API Client support? (not remembering of any existing similar case) If so, I think it's a good idea to split request data that's about the object from actions being done to it.

Actions #26

Updated by Tom Clegg about 2 years ago

18600-update-files @ 5cb2775f3ee3d6fbae5deafc354bfb772d864a57 developer-run-tests: #2957
  • instead of a write-only collection attribute called "splices", we now have a create/update parameter called "replace_files"
  • add replace_files param to discovery doc
  • test replace_files from sdk/cli tests (this relies on the param being in the discovery doc)

Accepting "delete nonexistent path" as a no-op is intentional. Analogous to "rm -rf", it avoids forcing the client to spend extra effort on "check whether X exists before deleting it", and makes it possible to do things like "replace entire contents of dir/ with dir/file.txt" atomically. I think a "can't delete, doesn't exist" error would essentially be a race-detection feature for a certain type of race, which will be better addressed (along with a much wider range of races) by implementing If-Matches support.

Actions #27

Updated by Lucas Di Pentima about 2 years ago

This LGTM, thanks!

Actions #28

Updated by Tom Clegg about 2 years ago

merge main → 18600-update-files @ f7bf9d69603db2d500563648460e2a96524de266 --

Actions #30

Updated by Tom Clegg about 2 years ago

  • Status changed from In Progress to Resolved
Actions #31

Updated by Tom Clegg almost 2 years ago

Actions

Also available in: Atom PDF