Feature #5104
closed[SDKs] Add Ruby SDK methods for manipulating collection contents (rename, delete, copy file from another collection)
Related issues
Updated by Tom Clegg almost 10 years ago
- Target version changed from Arvados Future Sprints to 2015-02-18 sprint
Updated by Radhika Chippada almost 10 years ago
- Assigned To set to Radhika Chippada
Updated by Radhika Chippada almost 10 years ago
- Assigned To deleted (
Radhika Chippada)
Updated by Tom Clegg almost 10 years ago
- Subject changed from [API] Add actions for manipulating collection contents (rename, delete, copy file from another collection) to [SDKs] Add Ruby SDK methods for manipulating collection contents (rename, delete, copy file from another collection)
Updated by Tom Clegg almost 10 years ago
- Category set to SDKs
- Assigned To set to Tom Clegg
Updated by Tom Clegg almost 10 years ago
- Assigned To changed from Tom Clegg to Brett Smith
Updated by Tom Clegg almost 10 years ago
- Target version changed from 2015-02-18 sprint to 2015-03-11 sprint
Updated by Peter Amstutz over 9 years ago
Comments¶
For consistency with the Python SDK terminology, consider using the term "segments" instead of "ranges" to refer the segments of the stream tha makes up a file ("range" is more generic and used for a few different but related purposes.)
Add a comment that for StreamManifest to correctly generate a normalized stream, the files must be added in lexically sorted order.
The logic in StreamManifest::add
is a little bit subtle and would benefit from some comments.
In StreamManifest::add
, I believe a locator will get added twice if @locators
has [A, B]
and loc_a
has [B, A]
resulting in [A, B, A]
Both CollectionFile::add_range
StreamManifest::add
are performing a O(n) computations on the locator list, potentially multiple times per file. This could get expensive when processing streams with 100s of files and 1000s of locators.
- ". abc+5 efg+5 0:10:file1" (spanning blocks complete fit)
- ". abc+5 efg+5 0:7:file1" (spanning blocks fit beginning)
- ". abc+5 efg+5 3:7:file1" (spanning blocks fit end)
- ". abc+5 efg+5 3:5:file1" (spanning blocks middle)
- ". abc+5 efg+5 hij+5 0:15:file1" (spanning 3 blocks)
- ". abc+5 efg+5 0:3:file1 5:5:file1" (partial range replacement)
- ". abc+5 efg+5 0:3:file1 5:3:file1 3:2:file1" (insertion)
- ". abc+5 2:3:file1 2:3:file1" (duplication)
API notes¶
Suggest adding a modified?
predicate method.
The behavior of the copy
method is different from the Python SDK. In the Python SDK, if you copy a collection into another collection which already has a subcollection of the same name, it will either raise an exception or replace the entire subcollection with the new one. In the Ruby SDK, it will merge the contents of the subcollections and overwrite any conflicting files. We should decide which behavior we prefer and implement it consistently.
The Python SDK intentionally does not offer a "rename" method. We should decide if we want to keep it in the Ruby SDK, and if so it should implemented consistently.
The new Ruby Collection
class lacks methods to actually iterate contents of the Collection.
Updated by Brett Smith over 9 years ago
Thanks Peter. I'm writing up some of the in-office discussion about API considerations for Tom's benefit.
Peter Amstutz wrote:
API notes¶
The behavior of the
copy
method is different from the Python SDK. In the Python SDK, if you copy a collection into another collection which already has a subcollection of the same name, it will either raise an exception or replace the entire subcollection with the new one. In the Ruby SDK, it will merge the contents of the subcollections and overwrite any conflicting files. We should decide which behavior we prefer and implement it consistently.
I used cp -r
as my model. I think its behavior can be described as, "Copy every file from the source to the corresponding target, creating subdirectories under target as necessary." Whereas the Python SDK is currently copies entire substreams over to target verbatim.
Another way to put it is, the Ruby SDK currently implements rsync -r
, while the Python SDK implements rsync -r --delete
.
I agree we should be consistent about this. I think both behaviors are useful. I think we should just hammer out method signatures to expose each, and then I think it would be relatively straightforward to get both in the Ruby SDK right away. And then we can bring the Python SDK in line next time we open it.
The Python SDK intentionally does not offer a "rename" method. We should decide if we want to keep it in the Ruby SDK, and if so it should implemented consistently.
Also agreed. Like I pointed out, I was following what story description I had (the subject). The method doesn't support renaming across collections, which I understand was the primary concern.
The new Ruby
Collection
class lacks methods to actually iterate contents of the Collection.
It's obvious why this would be useful, but it's a little less clear if our Rails servers want it today. Keep::Manifest provides a lower-level interface to parse manifests with better performance, and that seems to be meeting read-only use cases all right for now. It may make sense to have them continue iterating that way to help stay snappy. So, we'll definitely do this, but I wonder if it needs to be in this branch.
Updated by Tom Clegg over 9 years ago
Brett Smith wrote:
Another way to put it is, the Ruby SDK currently implements
rsync -r
, while the Python SDK implementsrsync -r --delete
.I agree we should be consistent about this. I think both behaviors are useful. I think we should just hammer out method signatures to expose each, and then I think it would be relatively straightforward to get both in the Ruby SDK right away. And then we can bring the Python SDK in line next time we open it.
IMO it is much more relevant to be consistent with the current language's stdlib than with some other language's Arvados library.
It seems to me we should offer- methods with names and behaviors corresponding directly (or as close as meaningful) to the language's stdlib
- other methods, with non-conflicting names, to avoid incorrect assumptions "this must work just like [...] elsewhere in Ruby"
Ruby has FileUtils::copy_entry
, which says dest must not exist. So, should "dest exists → exception" be our default behavior too, when copying a directory-like thing?
The Python SDK intentionally does not offer a "rename" method. We should decide if we want to keep it in the Ruby SDK, and if so it should implemented consistently.
Also agreed. Like I pointed out, I was following what story description I had (the subject). The method doesn't support renaming across collections, which I understand was the primary concern.
Not sure why Python SDK didn't get "rename". #4823 specified an API for renaming within a collection. I can't think why we wouldn't offer a rename() method that works within a collection...?
The new Ruby
Collection
class lacks methods to actually iterate contents of the Collection.It's obvious why this would be useful, but it's a little less clear if our Rails servers want it today. Keep::Manifest provides a lower-level interface to parse manifests with better performance, and that seems to be meeting read-only use cases all right for now. It may make sense to have them continue iterating that way to help stay snappy. So, we'll definitely do this, but I wonder if it needs to be in this branch.
No, I don't think a lack of data-reading methods should block the current branch/story. (If there's a reason why it would, I'm not seeing it...)
Updated by Brett Smith over 9 years ago
Tom Clegg wrote:
Brett Smith wrote:
Another way to put it is, the Ruby SDK currently implements
rsync -r
, while the Python SDK implementsrsync -r --delete
.I agree we should be consistent about this. I think both behaviors are useful. I think we should just hammer out method signatures to expose each, and then I think it would be relatively straightforward to get both in the Ruby SDK right away. And then we can bring the Python SDK in line next time we open it.
IMO it is much more relevant to be consistent with the current language's stdlib than with some other language's Arvados library.
I like the principle that our SDKs should feel native to the language they're in, but the fact that manifests don't have strict POSIX semantics makes me wonder about the wisdom of carrying over names of POSIX methods or interfaces.
For example, in POSIX, you can't move a directory to become a child of itself. In manifests, this is no problem: you just rename the stream. FWIW, the current SDK supports this, and accordingly I chose to name the method rename
rather than something like mv
.
If you have strong opinions about method names, now would be a good time to voice them. Right now the Ruby SDK in branch has copy!
, remove!
, and rename!
.
It seems to me we should offer
- methods with names and behaviors corresponding directly (or as close as meaningful) to the language's stdlib
- other methods, with non-conflicting names, to avoid incorrect assumptions "this must work just like [...] elsewhere in Ruby"
Ruby has
FileUtils::copy_entry
, which says dest must not exist. So, should "dest exists → exception" be our default behavior too, when copying a directory-like thing?
Ruby also has FileUtils::cp_r
, which does not have that limit. So am I hearing this as a vote for "implement both, one called copy_entry
and one called cp_r
?"
Updated by Tom Clegg over 9 years ago
Brett Smith wrote:
I like the principle that our SDKs should feel native to the language they're in, but the fact that manifests don't have strict POSIX semantics makes me wonder about the wisdom of carrying over names of POSIX methods or interfaces.
I think we just have to walk this line. Perhaps "the POSIX behaviors that could reasonably be expected to act the same here should act the same here." It seems that if you consider a collection as a snapshot of a filesystem, you can do lots of POSIX things (like atomic rename) plus some things that are painful or impossible in POSIX (like atomic copy).
For example, in POSIX, you can't move a directory to become a child of itself. In manifests, this is no problem: you just rename the stream. FWIW, the current SDK supports this, and accordingly I chose to name the method
rename
rather than something likemv
.
Stuff that would fail in POSIX seems less worrisome than stuff that would do something different in POSIX. (FWIW the POSIX operation is rename -- mv is the utility program which sometimes calls rename and sometimes does create+copy+unlink -- but I'm still happy with rename.)
If you have strong opinions about method names, now would be a good time to voice them. Right now the Ruby SDK in branch has
copy!
,remove!
, andrename!
.
AFAICT, the bang convention (the dominant one) is that copy!
should be used if copy is "dangerous", e.g., mutates the object (check) and there is a "safer" non-bang form.
This seems to suggest that we should call it copy!
only if we also offer a b = a.copy(src,dest)
form that doesn't mutate a
, or if copy!
is like copy
but raises an exception in a situation where copy
would return nil/false/...
Ruby also has
FileUtils::cp_r
, which does not have that limit. So am I hearing this as a vote for "implement both, one calledcopy_entry
and one calledcp_r
?"
Indeed. Yes, I vote for using those names on our analogous features. I don't care too much whether we offer both variants right away (or, if not, which one we do first). Neither is perfect: cp_r copies src to dest/src, which means you need an explicit rename()
if you're trying to make a copy of foo called foo.orig. cp_r doesn't seem to say whether it merges or replaces subdirs (or throws an error), but its name suggests that it merges. Maybe. copy_entry definitely doesn't merge, which could make it less convenient but at least more predictable.
Updated by Brett Smith over 9 years ago
Tom Clegg wrote:
Brett Smith wrote:
Ruby also has
FileUtils::cp_r
, which does not have that limit. So am I hearing this as a vote for "implement both, one calledcopy_entry
and one calledcp_r
?"Indeed. Yes, I vote for using those names on our analogous features. I don't care too much whether we offer both variants right away (or, if not, which one we do first). Neither is perfect: cp_r copies src to dest/src, which means you need an explicit
rename()
if you're trying to make a copy of foo called foo.orig. cp_r doesn't seem to say whether it merges or replaces subdirs (or throws an error), but its name suggests that it merges. Maybe. copy_entry definitely doesn't merge, which could make it less convenient but at least more predictable.
Testing reveals that, despite the docs, copy_entry
acts like you think cp_r
does. Since you don't feel pressed to settle on which method must be there, I have implemented only cp_r
. remove!
has been split into rm
and rm_r
. rename!
is just rename
now.
I have implemented all the rest of the feedback. All the comments about traversing and duplicating Locators have been addressed with the introduction of the LocatorList class, which can return segments of itself from file specs, and then generate new file specs that avoid duplicating locators. Now at dad9a13.
Updated by Peter Amstutz over 9 years ago
LGTM with import_manifest made private per discussion on IRC.
And a test for dst_coll.cp_r(".", ".", src_coll)
Updated by Brett Smith over 9 years ago
Peter Amstutz wrote:
LGTM with import_manifest made private per discussion on IRC.
And a test for
dst_coll.cp_r(".", ".", src_coll)
This ended up turning into a much bigger project than anticipated. Like I said, import_manifest
was the intended way to resolve #5109, even though I realize the concatenation semantics often aren't what's desired.
But dst_coll.cp_r(".", ".", src_coll)
doesn't have the intended effect either, because src_coll's root ends up at ./.
in dst_coll. The Ruby docs for cp_r
say that the way to work around this is to say cp_r("src_dir/.", "dst_dir")
—but we can't do that either, for the same reason that .
is a legal stream name.
Instead I've done the next closest thing: if source
ends with /
, it copies the contents of that stream into target
, rather than the stream itself. rsync has the same semantics, so that's the mnemonic I'm going for here. I think it's also the most obvious option when you consider what Ruby wants and what manifests allow… although I'll concede that "obvious" may not be the best word for it.
To support this, I ended up making a CollectionRoot class, whose implementation is slightly goofy, but it's better than duplicating a bunch of special-case root-handling logic directly in Collection methods as I had been doing. Following our discussion, I also realized that LocatorList was two classes in one, so I split it out: now it just handles generating segments while parsing a raw manifest, and the manifest generation logic moved into StreamManifest.
Now at cfb2585. Thanks.
Updated by Brett Smith over 9 years ago
- Target version changed from 2015-03-11 sprint to 2015-04-01 sprint
Updated by Brett Smith over 9 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:36f6241516d81c726bb7439650cf6ec56e6d6525.