Feature #5104

[SDKs] Add Ruby SDK methods for manipulating collection contents (rename, delete, copy file from another collection)

Added by Tom Clegg almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
SDKs
Target version:
Start date:
03/08/2015
Due date:
% Done:

100%

Estimated time:
(Total: 4.00 h)
Story points:
2.0

Subtasks

Task #5108: Implement internal representation for collections in Ruby SDK (analogous to Python SDK)Resolved

Task #5109: Add "import files from another collection" API to Ruby SDKResolved

Task #5107: Add collections#rename and collections#delete methods to Ruby SDKResolved

Task #5423: Review 5104-ruby-sdk-collections-wipResolvedBrett Smith


Related issues

Related to Arvados - Bug #4943: [Workbench] [Performance] Combining big collections should start returning a response faster (currently you can get a 502 proxy error even if the collection still combines)New01/08/2015

Related to Arvados - Bug #5096: [Workbench] Avoid passing manifest_text around the network too much when updating and merging large collections.Resolved02/14/2015

Blocks Arvados - Story #3821: [Workbench] Delete and rename files in collectionsResolved12/10/2014

Blocks Arvados - Bug #5179: [Workbench] [SDKs] Incorrect collection display when stream has entries with subdirectoriesResolved04/19/2015

Associated revisions

Revision 36f62415
Added by Brett Smith over 5 years ago

Merge branch '5104-ruby-sdk-collections-wip'

Closes #5104, #5423.

History

#1 Updated by Tom Clegg almost 6 years ago

  • Target version changed from Arvados Future Sprints to 2015-02-18 sprint

#2 Updated by Radhika Chippada almost 6 years ago

  • Assigned To set to Radhika Chippada

#3 Updated by Radhika Chippada almost 6 years ago

  • Assigned To deleted (Radhika Chippada)

#4 Updated by Tom Clegg almost 6 years ago

  • Subject changed from [API] Add actions for manipulating collection contents (rename, delete, copy file from another collection) to [SDKs] Add Ruby SDK methods for manipulating collection contents (rename, delete, copy file from another collection)

#5 Updated by Tom Clegg almost 6 years ago

  • Category set to SDKs
  • Assigned To set to Tom Clegg

#6 Updated by Tom Clegg almost 6 years ago

  • Assigned To changed from Tom Clegg to Brett Smith

#7 Updated by Tom Clegg almost 6 years ago

  • Target version changed from 2015-02-18 sprint to 2015-03-11 sprint

#8 Updated by Brett Smith almost 6 years ago

  • Assigned To deleted (Brett Smith)

#9 Updated by Brett Smith almost 6 years ago

  • Assigned To set to Brett Smith

#10 Updated by Brett Smith over 5 years ago

  • Status changed from New to In Progress

#11 Updated by Peter Amstutz over 5 years ago

Comments

For consistency with the Python SDK terminology, consider using the term "segments" instead of "ranges" to refer the segments of the stream tha makes up a file ("range" is more generic and used for a few different but related purposes.)

Add a comment that for StreamManifest to correctly generate a normalized stream, the files must be added in lexically sorted order.

The logic in StreamManifest::add is a little bit subtle and would benefit from some comments.

In StreamManifest::add, I believe a locator will get added twice if @locators has [A, B] and loc_a has [B, A] resulting in [A, B, A]

Both CollectionFile::add_range StreamManifest::add are performing a O(n) computations on the locator list, potentially multiple times per file. This could get expensive when processing streams with 100s of files and 1000s of locators.

I'd like to see some additional tests to ensure that files with multiple segments are preserved/normalized correctly:
  1. ". abc+5 efg+5 0:10:file1" (spanning blocks complete fit)
  2. ". abc+5 efg+5 0:7:file1" (spanning blocks fit beginning)
  3. ". abc+5 efg+5 3:7:file1" (spanning blocks fit end)
  4. ". abc+5 efg+5 3:5:file1" (spanning blocks middle)
  5. ". abc+5 efg+5 hij+5 0:15:file1" (spanning 3 blocks)
  6. ". abc+5 efg+5 0:3:file1 5:5:file1" (partial range replacement)
  7. ". abc+5 efg+5 0:3:file1 5:3:file1 3:2:file1" (insertion)
  8. ". abc+5 2:3:file1 2:3:file1" (duplication)

API notes

Suggest adding a modified? predicate method.

The behavior of the copy method is different from the Python SDK. In the Python SDK, if you copy a collection into another collection which already has a subcollection of the same name, it will either raise an exception or replace the entire subcollection with the new one. In the Ruby SDK, it will merge the contents of the subcollections and overwrite any conflicting files. We should decide which behavior we prefer and implement it consistently.

The Python SDK intentionally does not offer a "rename" method. We should decide if we want to keep it in the Ruby SDK, and if so it should implemented consistently.

The new Ruby Collection class lacks methods to actually iterate contents of the Collection.

#12 Updated by Brett Smith over 5 years ago

Thanks Peter. I'm writing up some of the in-office discussion about API considerations for Tom's benefit.

Peter Amstutz wrote:

API notes

The behavior of the copy method is different from the Python SDK. In the Python SDK, if you copy a collection into another collection which already has a subcollection of the same name, it will either raise an exception or replace the entire subcollection with the new one. In the Ruby SDK, it will merge the contents of the subcollections and overwrite any conflicting files. We should decide which behavior we prefer and implement it consistently.

I used cp -r as my model. I think its behavior can be described as, "Copy every file from the source to the corresponding target, creating subdirectories under target as necessary." Whereas the Python SDK is currently copies entire substreams over to target verbatim.

Another way to put it is, the Ruby SDK currently implements rsync -r, while the Python SDK implements rsync -r --delete.

I agree we should be consistent about this. I think both behaviors are useful. I think we should just hammer out method signatures to expose each, and then I think it would be relatively straightforward to get both in the Ruby SDK right away. And then we can bring the Python SDK in line next time we open it.

The Python SDK intentionally does not offer a "rename" method. We should decide if we want to keep it in the Ruby SDK, and if so it should implemented consistently.

Also agreed. Like I pointed out, I was following what story description I had (the subject). The method doesn't support renaming across collections, which I understand was the primary concern.

The new Ruby Collection class lacks methods to actually iterate contents of the Collection.

It's obvious why this would be useful, but it's a little less clear if our Rails servers want it today. Keep::Manifest provides a lower-level interface to parse manifests with better performance, and that seems to be meeting read-only use cases all right for now. It may make sense to have them continue iterating that way to help stay snappy. So, we'll definitely do this, but I wonder if it needs to be in this branch.

#13 Updated by Tom Clegg over 5 years ago

Brett Smith wrote:

Another way to put it is, the Ruby SDK currently implements rsync -r, while the Python SDK implements rsync -r --delete.

I agree we should be consistent about this. I think both behaviors are useful. I think we should just hammer out method signatures to expose each, and then I think it would be relatively straightforward to get both in the Ruby SDK right away. And then we can bring the Python SDK in line next time we open it.

IMO it is much more relevant to be consistent with the current language's stdlib than with some other language's Arvados library.

It seems to me we should offer
  • methods with names and behaviors corresponding directly (or as close as meaningful) to the language's stdlib
  • other methods, with non-conflicting names, to avoid incorrect assumptions "this must work just like [...] elsewhere in Ruby"

Ruby has FileUtils::copy_entry, which says dest must not exist. So, should "dest exists → exception" be our default behavior too, when copying a directory-like thing?

The Python SDK intentionally does not offer a "rename" method. We should decide if we want to keep it in the Ruby SDK, and if so it should implemented consistently.

Also agreed. Like I pointed out, I was following what story description I had (the subject). The method doesn't support renaming across collections, which I understand was the primary concern.

Not sure why Python SDK didn't get "rename". #4823 specified an API for renaming within a collection. I can't think why we wouldn't offer a rename() method that works within a collection...?

The new Ruby Collection class lacks methods to actually iterate contents of the Collection.

It's obvious why this would be useful, but it's a little less clear if our Rails servers want it today. Keep::Manifest provides a lower-level interface to parse manifests with better performance, and that seems to be meeting read-only use cases all right for now. It may make sense to have them continue iterating that way to help stay snappy. So, we'll definitely do this, but I wonder if it needs to be in this branch.

No, I don't think a lack of data-reading methods should block the current branch/story. (If there's a reason why it would, I'm not seeing it...)

#14 Updated by Brett Smith over 5 years ago

Tom Clegg wrote:

Brett Smith wrote:

Another way to put it is, the Ruby SDK currently implements rsync -r, while the Python SDK implements rsync -r --delete.

I agree we should be consistent about this. I think both behaviors are useful. I think we should just hammer out method signatures to expose each, and then I think it would be relatively straightforward to get both in the Ruby SDK right away. And then we can bring the Python SDK in line next time we open it.

IMO it is much more relevant to be consistent with the current language's stdlib than with some other language's Arvados library.

I like the principle that our SDKs should feel native to the language they're in, but the fact that manifests don't have strict POSIX semantics makes me wonder about the wisdom of carrying over names of POSIX methods or interfaces.

For example, in POSIX, you can't move a directory to become a child of itself. In manifests, this is no problem: you just rename the stream. FWIW, the current SDK supports this, and accordingly I chose to name the method rename rather than something like mv.

If you have strong opinions about method names, now would be a good time to voice them. Right now the Ruby SDK in branch has copy!, remove!, and rename!.

It seems to me we should offer
  • methods with names and behaviors corresponding directly (or as close as meaningful) to the language's stdlib
  • other methods, with non-conflicting names, to avoid incorrect assumptions "this must work just like [...] elsewhere in Ruby"

Ruby has FileUtils::copy_entry, which says dest must not exist. So, should "dest exists → exception" be our default behavior too, when copying a directory-like thing?

Ruby also has FileUtils::cp_r, which does not have that limit. So am I hearing this as a vote for "implement both, one called copy_entry and one called cp_r?"

#15 Updated by Tom Clegg over 5 years ago

Brett Smith wrote:

I like the principle that our SDKs should feel native to the language they're in, but the fact that manifests don't have strict POSIX semantics makes me wonder about the wisdom of carrying over names of POSIX methods or interfaces.

I think we just have to walk this line. Perhaps "the POSIX behaviors that could reasonably be expected to act the same here should act the same here." It seems that if you consider a collection as a snapshot of a filesystem, you can do lots of POSIX things (like atomic rename) plus some things that are painful or impossible in POSIX (like atomic copy).

For example, in POSIX, you can't move a directory to become a child of itself. In manifests, this is no problem: you just rename the stream. FWIW, the current SDK supports this, and accordingly I chose to name the method rename rather than something like mv.

Stuff that would fail in POSIX seems less worrisome than stuff that would do something different in POSIX. (FWIW the POSIX operation is rename -- mv is the utility program which sometimes calls rename and sometimes does create+copy+unlink -- but I'm still happy with rename.)

If you have strong opinions about method names, now would be a good time to voice them. Right now the Ruby SDK in branch has copy!, remove!, and rename!.

AFAICT, the bang convention (the dominant one) is that copy! should be used if copy is "dangerous", e.g., mutates the object (check) and there is a "safer" non-bang form.

This seems to suggest that we should call it copy! only if we also offer a b = a.copy(src,dest) form that doesn't mutate a, or if copy! is like copy but raises an exception in a situation where copy would return nil/false/...

Ruby also has FileUtils::cp_r, which does not have that limit. So am I hearing this as a vote for "implement both, one called copy_entry and one called cp_r?"

Indeed. Yes, I vote for using those names on our analogous features. I don't care too much whether we offer both variants right away (or, if not, which one we do first). Neither is perfect: cp_r copies src to dest/src, which means you need an explicit rename() if you're trying to make a copy of foo called foo.orig. cp_r doesn't seem to say whether it merges or replaces subdirs (or throws an error), but its name suggests that it merges. Maybe. copy_entry definitely doesn't merge, which could make it less convenient but at least more predictable.

#16 Updated by Brett Smith over 5 years ago

Tom Clegg wrote:

Brett Smith wrote:

Ruby also has FileUtils::cp_r, which does not have that limit. So am I hearing this as a vote for "implement both, one called copy_entry and one called cp_r?"

Indeed. Yes, I vote for using those names on our analogous features. I don't care too much whether we offer both variants right away (or, if not, which one we do first). Neither is perfect: cp_r copies src to dest/src, which means you need an explicit rename() if you're trying to make a copy of foo called foo.orig. cp_r doesn't seem to say whether it merges or replaces subdirs (or throws an error), but its name suggests that it merges. Maybe. copy_entry definitely doesn't merge, which could make it less convenient but at least more predictable.

Testing reveals that, despite the docs, copy_entry acts like you think cp_r does. Since you don't feel pressed to settle on which method must be there, I have implemented only cp_r. remove! has been split into rm and rm_r. rename! is just rename now.

I have implemented all the rest of the feedback. All the comments about traversing and duplicating Locators have been addressed with the introduction of the LocatorList class, which can return segments of itself from file specs, and then generate new file specs that avoid duplicating locators. Now at dad9a13.

#17 Updated by Peter Amstutz over 5 years ago

LGTM with import_manifest made private per discussion on IRC.

And a test for dst_coll.cp_r(".", ".", src_coll)

#18 Updated by Brett Smith over 5 years ago

Peter Amstutz wrote:

LGTM with import_manifest made private per discussion on IRC.

And a test for dst_coll.cp_r(".", ".", src_coll)

This ended up turning into a much bigger project than anticipated. Like I said, import_manifest was the intended way to resolve #5109, even though I realize the concatenation semantics often aren't what's desired.

But dst_coll.cp_r(".", ".", src_coll) doesn't have the intended effect either, because src_coll's root ends up at ./. in dst_coll. The Ruby docs for cp_r say that the way to work around this is to say cp_r("src_dir/.", "dst_dir")—but we can't do that either, for the same reason that . is a legal stream name.

Instead I've done the next closest thing: if source ends with /, it copies the contents of that stream into target, rather than the stream itself. rsync has the same semantics, so that's the mnemonic I'm going for here. I think it's also the most obvious option when you consider what Ruby wants and what manifests allow… although I'll concede that "obvious" may not be the best word for it.

To support this, I ended up making a CollectionRoot class, whose implementation is slightly goofy, but it's better than duplicating a bunch of special-case root-handling logic directly in Collection methods as I had been doing. Following our discussion, I also realized that LocatorList was two classes in one, so I split it out: now it just handles generating segments while parsing a raw manifest, and the manifest generation logic moved into StreamManifest.

Now at cfb2585. Thanks.

#19 Updated by Brett Smith over 5 years ago

  • Target version changed from 2015-03-11 sprint to 2015-04-01 sprint

#20 Updated by Peter Amstutz over 5 years ago

cfb2585 LGTM

#21 Updated by Brett Smith over 5 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:36f6241516d81c726bb7439650cf6ec56e6d6525.

Also available in: Atom PDF