Feature #4823

[SDKs] Good Collection API for Python SDK

Added by Tim Pierce about 4 years ago. Updated over 2 years ago.

Status:
In Progress
Priority:
Normal
Assigned To:
-
Category:
SDKs
Target version:
Start date:
12/17/2014
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Goals

More enjoyable API for Python programmers to use. Something like:
  • c=Collection(...)
    with c.open('foo.txt', 'w') as f:
      f.write('foo')
    with c.open('foo.txt', 'a') as f:
      f.write('bar')
    c.rename('foo.txt', 'baz/foobar.txt')
    with c.open('baz/foobar.txt', 'r') as f:
      print f.read()
    c.save()
    
Serialize/unserialize (manifest) code all in once place.
  • Abstract away the "manifest" encoding as much as possible to pave the way for upgrading/replacing it (say, with a richer JSON format).
  • Only one version of tokenizing/regexp parsing, string concatenation, making sure zero-length streams have a zero-length block locator, stuff like that.
In-memory data structure suitable for mutable collections.
  • Accommodate use of "data buffer" blocks for data not yet written to Keep.
  • Simplify file operations by using a distinct piece of memory for each file. (Modifying a stream in order to modify a file, without disrupting other files in the stream, is painful!)
  • See #4837

Collection interface

Collection()
  • Create a new empty collection.
Collection(uuid)
  • Retrieve the given collection from the API server.
Collection(manifest_text)
  • Create a new collection with the given content.
modified()
  • Return True if the collection has been modified since it was last retrieved or saved to the API server, otherwise False.
manifest_text()
  • Return the "manifest" string representation of this collection. This implicitly commits all buffered data to disk.
portable_manifest_text()
  • Return the "portable manifest" string representation of this collection used to compute portable_data_hash -- i.e., the manifest with the non-portable parts (like Keep permission signatures) removed. This can always be done without flushing any data to disk.
portable_data_hash()
  • Return the portable_data_hash that would be accepted/assigned by the API server if the collection were save()d right now.
listdir(path)
  • Return a list containing the names of the entries in the subcollection given by path.
walk(path, topdown=True, onerror=None)
  • (As close as possible to os.walk().) Generate the file names in a directory tree. For each subcollection (below and including path, where '.' is the whole collection) yield a 3-tuple (dirpath, dirnames, filenames).
remove(path)
  • Remove the file or subcollection named path.
unlink(path)
  • Alias for remove.
rename(old, new)
  • Rename a file from old to new.
rename(old, new, dest_collection)
  • Move a file old (in this collection) to new (in a different collection).
  • Rejected. Use copy + remove instead.
copy(old, new)
  • Create a new file new with the same content old has right now.
copy(source_path, dest_path, source_collection=None, overwrite=False)
  • Create a new file dest_path, with the same content the file at source_path has right now. If source_collection is not None, source_path refers to a file in source_collection, which might not be self. If dest_path already exists and overwrite is False, raise an exception.
open(filename, mode)
  • Semantics as close as practicable to open(). Return an object with (some subset of) the Python "file" interface.
glob(globpattern)
  • Returns an iterator that yields successive files that match globpattern from the collection.
  • Rejected. Use [f for f in fnmatch.filter(c.listdir(path), '*.o')] instead.

Subtasks

Task #4893: Review 4823-python-sdk-writable-collection-apiResolvedPeter Amstutz

Task #5103: Implement live sync, mergeResolvedPeter Amstutz

Task #4837: [SDKs] Define API and in-memory data structure for collections in Python SDKResolvedPeter Amstutz


Related issues

Related to Arvados - Story #3198: [FUSE] Writable streaming arv-mountResolved2015-04-14

Related to Arvados - Story #4930: [FUSE] Design: specify behavior for writable arv-mountResolved2015-01-28

Associated revisions

Revision bb19e060
Added by Peter Amstutz almost 4 years ago

Merge branch '4823-python-sdk-writable-collection-api' closes #4823

Revision 74b568d3 (diff)
Added by Peter Amstutz almost 4 years ago

Update arvados-fuse dependency on python sdk refs #4823

Revision a380fd23 (diff)
Added by Peter Amstutz almost 4 years ago

Fix arv-mount use arvados.config.settings() to initialize ThreadSafeApiCache
refs #4823

Revision 2365c0eb (diff)
Added by Peter Amstutz almost 4 years ago

Fixed SafeApi -> ThreadSafeApiCache refs #4823

Revision 20f3649b (diff)
Added by Peter Amstutz almost 4 years ago

Fix arv-normalize regression. refs #5145 refs #4823

History

#1 Updated by Peter Amstutz about 4 years ago

  • For create_file, by "blocks" do you mean "stream ranges" or "keep blocks"?
  • For the implementation, I suggest implementing each command the simplest way possible (for example, copy_file uses StreamFileReader.as_manifest(), update the stream/file name, then append to the new manifest) and then re-normalize to clean up duplicated streams, drop unnecessary blocks, etc.
  • Need to specify semantics for whether rename_file and copy_file should replace the target file if already present, or raise an error.

#2 Updated by Tom Clegg about 4 years ago

Initial reactions:
  • I'd rather see an API with names/behaviors as close as possible to the standard Python library, e.g., unlink() and remove() instead of delete_file().
  • I think these methods should be Collection methods, not Manifest methods: they seem to be concerned with collection content, rather than the manifest encoding.
  • create_file() looks like a pretty inconvenient API standing on its own. Could you add some context about the workflow where this is intended to be used?
  • dirty() looks like a generic SDK feature that isn't specific to collections at all. It raises the point that (at least IMO) the SDK would be much more enjoyable if "collection object useful for communicating with API server" and "collection object useful for reading and writing data" were the same thing, so you could do stuff like this:
    • collection = arvados.api().collections().create() # or collection = arvados.Collection()?
      with collection.open('foo.txt', 'w') as f:
        f.write('foo')
      collection.save()
      print collection['uuid']
      print collection['portable_data_hash']
      
Logistics:
  • This seems likely to overlap quite a bit with #3198, which will have to implement many of the same features.

#3 Updated by Tim Pierce about 4 years ago

To clarify, this task is explicitly intended to support Peter's work on #3198. The overlap is intentional. We are proposing that I write SDK support to allow his writable FUSE driver to skip assembling its own manifests by concatenating strings together.

Tom Clegg wrote:

Initial reactions:
  • I'd rather see an API with names/behaviors as close as possible to the standard Python library, e.g., unlink() and remove() instead of delete_file().

I'm kind of ambivalent. We discussed using names that mirror the Unix system calls or fileutils names. The argument against is that we can't guarantee actual POSIX semantics and so probably shouldn't pretend that we can, but it's not a terribly strong argument either way.

  • I think these methods should be Collection methods, not Manifest methods: they seem to be concerned with collection content, rather than the manifest encoding.

I guess? They're concerned with building and manipulating the manifest that Arvados uses to find content for the collection. Where exactly does the distinction lie?

  • create_file() looks like a pretty inconvenient API standing on its own. Could you add some context about the workflow where this is intended to be used?

The intent -- in my head, anyway -- is that a tool that's building a collection can stream bytes into Keep, and once it's streamed all of the data for a given collection file, can add an entry for that file to the current collection by calling Manifest.create_file() with the list of blocks that it's just generated.

This particular call may not really be useful to what Peter wants to do right now. What I really want is a SDK that makes it unnecessary for any code, anywhere, ever, to do something like this:

manifest = manifest + ". {} {}\n".format(" ".join(block_list), filename)

#4 Updated by Tom Clegg about 4 years ago

Tim Pierce wrote:

To clarify, this task is explicitly intended to support Peter's work on #3198. The overlap is intentional. We are proposing that I write SDK support to allow his writable FUSE driver to skip assembling its own manifests by concatenating strings together.

OK. I just wanted to point out the likelihood that #3198 will either block on this or implement by itself the same stuff that's implemented here.

  • I'd rather see an API with names/behaviors as close as possible to the standard Python library, e.g., unlink() and remove() instead of delete_file().

I'm kind of ambivalent. We discussed using names that mirror the Unix system calls or fileutils names. The argument against is that we can't guarantee actual POSIX semantics and so probably shouldn't pretend that we can, but it's not a terribly strong argument either way.

That doesn't sound to me like a compelling reason to invent new names.
  • I'm guessing people will expect POSIX semantics just as much regardless of whether we say "rename_file" or "rename". I think the fact they're using collection.whatever(), not os.whatever(), will have to be enough of a reminder. (Aside: what are the expected non-POSIX semantics?)
  • Presumably these operations' POSIX compliance will be exactly as good as using os.rename() etc in a Keep fuse mount.
  • We just recently went to a bunch of trouble to rearrange our original CollectionReader/Writer APIs. Let's avoid making the same mistake/fix/backward-compatibility-compromise again!
  • I think these methods should be Collection methods, not Manifest methods: they seem to be concerned with collection content, rather than the manifest encoding.

I guess? They're concerned with building and manipulating the manifest that Arvados uses to find content for the collection. Where exactly does the distinction lie?

A manifest is a lump of text. A collection is a set of files: it has filenames, data, a uuid, a name, etc.

The implementation behind the proposed API involves manipulating a manifest, but the functionality being exposed to the caller is reading/modifying the state of a collection. Why/when would we ask the caller to (or why/when would the caller want to) keep track of the manifest separately from the collection? I think of the SDK's job in this case as abstracting away the implementation details like "a collection's data block locators are listed in a lump of text we call a manifest" and present an interface that looks native to the language, like a directory tree of files that can be read, written, renamed, and deleted.

Another way to think of this is: If we threw out the "manifest" idea and did something different like store the block locators in one array and the filenames in another array, would it still make sense for a caller to use these same APIs and expect the same results? Or does the caller really care that these operations affect the manifest_text attribute specifically?

## a way

c = arvados.api().collections().get(...).execute()
m = arvados.Manifest(c['manifest_text'])
m.rename(...)
c['manifest_text'] = str(m)
c.save()

## another way

c = arvados.api().collections().get(...).execute()
c.rename(...)
c.save()
  • create_file() looks like a pretty inconvenient API standing on its own. Could you add some context about the workflow where this is intended to be used?

The intent -- in my head, anyway -- is that a tool that's building a collection can stream bytes into Keep, and once it's streamed all of the data for a given collection file, can add an entry for that file to the current collection by calling Manifest.create_file() with the list of blocks that it's just generated.

This particular call may not really be useful to what Peter wants to do right now. What I really want is a SDK that makes it unnecessary for any code, anywhere, ever, to do something like this:

manifest = manifest + ". {} {}\n".format(" ".join(block_list), filename)

Now that you put it that way, that amount of boilerplate seems minor compared to the "stream bytes into Keep" code someone would have to write to come up with that block_list in the first place. Given that CollectionWriter does both of these things (stream the bytes to Keep and make the manifest), perhaps the question is: what is it about the caller's situation that makes CollectionWriter insufficient? What makes the caller want to handle that array of blocks? Perhaps collectionB = collectionA + arvados.Collection(localfile) would be a nicer way to append to an existing collection -- but it's hard to pick the best solution without knowing the problem!

Another point about this particular API is that it only covers the degenerate case where each block contains data from exactly one file: it can't achieve the expected level of storage efficiency for multiple <<64MiB files.

#5 Updated by Tim Pierce about 4 years ago

Tom Clegg wrote:

Tim Pierce wrote:

To clarify, this task is explicitly intended to support Peter's work on #3198. The overlap is intentional. We are proposing that I write SDK support to allow his writable FUSE driver to skip assembling its own manifests by concatenating strings together.

OK. I just wanted to point out the likelihood that #3198 will either block on this or implement by itself the same stuff that's implemented here.

Yes, we're doing this work in parallel to avoid blocking as much as possible. There's enough work on #3198 that doesn't block on this that we expect we can effectively work on these tickets simultaneously.

That doesn't sound to me like a compelling reason to invent new names.
  • I'm guessing people will expect POSIX semantics just as much regardless of whether we say "rename_file" or "rename". I think the fact they're using collection.whatever(), not os.whatever(), will have to be enough of a reminder. (Aside: what are the expected non-POSIX semantics?)
  • Presumably these operations' POSIX compliance will be exactly as good as using os.rename() etc in a Keep fuse mount.
  • We just recently went to a bunch of trouble to rearrange our original CollectionReader/Writer APIs. Let's avoid making the same mistake/fix/backward-compatibility-compromise again!

I'm fine with following the existing Python file API.

Examples of POSIX semantics that we can't replicate in collections include hardlinks and the "update" mode to opening files (once a block has been written to Keep and added to the collection, it will be a lot harder to accommodate a client which wants to seek backward in the file and rewrite those bytes).

This particular call may not really be useful to what Peter wants to do right now. What I really want is a SDK that makes it unnecessary for any code, anywhere, ever, to do something like this:

manifest = manifest + ". {} {}\n".format(" ".join(block_list), filename)

Now that you put it that way, that amount of boilerplate seems minor compared to the "stream bytes into Keep" code someone would have to write to come up with that block_list in the first place. Given that CollectionWriter does both of these things (stream the bytes to Keep and make the manifest), perhaps the question is: what is it about the caller's situation that makes CollectionWriter insufficient? What makes the caller want to handle that array of blocks? Perhaps collectionB = collectionA + arvados.Collection(localfile) would be a nicer way to append to an existing collection -- but it's hard to pick the best solution without knowing the problem!

Another point about this particular API is that it only covers the degenerate case where each block contains data from exactly one file: it can't achieve the expected level of storage efficiency for multiple <<64MiB files.

There are two things that I think are important about providing an abstraction over manifest handling.

One, to separate the manifest structure from its representation. This is important even if this API level remains entirely internal to the Arvados SDK and is never exposed to clients. CollectionReader should never have to parse a manifest text lump by applying regexes to it. CollectionWriter should never build a manifest by concatenating strings. Leaving this level of manifest handling up to individual methods makes it easier for methods to introduce subtle formatting bugs that are only triggered by certain other methods in edge cases. Providing an internal API for handling manifests makes it easier for us to ensure internal consistency for the Arvados SDK itself.

Two, there really are cases where clients need to be able to ensure control over the portable data hash of the collection, and therefore, over the manifest itself. arv-copy is one recent example of a tool which needs this level of control. It has to construct its own manifest and cannot trust CollectionWriter to do it without disrupting the collection's portable data hash.

In most cases, it will be entirely sufficient to use high-level file and stream operations for collections, and the client won't care about these details. It makes sense to me to provide a multilevel API: CollectionReader and CollectionWriter for most user operations, and a manifest SDK that deals with the nitty gritty of parsing and serializing manifest_text blobs.

I get the sense that you think this is not only a low priority, but maybe even actively a bad idea, and I'm really puzzled at that, because I don't think this API is even optional, let alone a low priority. Should we chalk out some time to hangout and figure out whether we're actually talking about the same things?

#6 Updated by Tim Pierce about 4 years ago

  • Description updated (diff)

#7 Updated by Tom Clegg about 4 years ago

Tim Pierce wrote:

Examples of POSIX semantics that we can't replicate in collections include hardlinks and the "update" mode to opening files (once a block has been written to Keep and added to the collection, it will be a lot harder to accommodate a client which wants to seek backward in the file and rewrite those bytes).

OK. I think it's valuable to use the same names for the methods we do offer, even if we don't offer all of the os methods (e.g., link()). (Is there a case where the lack of hardlink support would prevent POSIX semantics for the features we're contemplating here? If there are cases like that (hardlinks or otherwise) I think it would be sensible to follow Python's approach to NTFS, i.e., raise an exception when the specified POSIX behavior is impossible.)

As for "update" mode, I sure hope the FUSE mount will support this, and I hope it's implemented somewhere in the Python SDK so other Python programs can use it too. (It would be a shame to require a Python client to go through a FUSE mount to make use of some functionality that's written in Python!)

There are two things that I think are important about providing an abstraction over manifest handling.

One, to separate the manifest structure from its representation. This is important even if this API level remains entirely internal to the Arvados SDK and is never exposed to clients. CollectionReader should never have to parse a manifest text lump by applying regexes to it. CollectionWriter should never build a manifest by concatenating strings. Leaving this level of manifest handling up to individual methods makes it easier for methods to introduce subtle formatting bugs that are only triggered by certain other methods in edge cases. Providing an internal API for handling manifests makes it easier for us to ensure internal consistency for the Arvados SDK itself.

I see. So this part is about extracting the string-manipulation parts from CollectionWriter into a library of well-defined operations? That makes a bit more sense. (Note CollectionReader already uses a lower-level API, namely StreamReader, for this.)

Two, there really are cases where clients need to be able to ensure control over the portable data hash of the collection, and therefore, over the manifest itself. arv-copy is one recent example of a tool which needs this level of control. It has to construct its own manifest and cannot trust CollectionWriter to do it without disrupting the collection's portable data hash.

The kind of manipulation needed by arv-copy is "change the signatures, but don't change anything else", right? Perhaps this section of arv-copy would be cleaner if it used a Python SDK feature analogous to apiserver's Collection.munge_manifest_locators! ...?

I notice create_file() proposed here is just as useless as CollectionWriter for arv-copy's purposes. In fact, I think you could say about any "add stuff to the manifest" API what you say here about CollectionWriter -- unless "stuff" is an entire stream/line. (This is why I think it's helpful to lay out specific examples of who will be using the proposed APIs before we try too hard to decide whether they're any good.)

Are there other examples of tools that need greater control over the manifest?

In most cases, it will be entirely sufficient to use high-level file and stream operations for collections, and the client won't care about these details. It makes sense to me to provide a multilevel API: CollectionReader and CollectionWriter for most user operations, and a manifest SDK that deals with the nitty gritty of parsing and serializing manifest_text blobs.

To me, the way to improve the external API is to offer a single read/write Collection class that is structurally capable of handling all normal file operations (even if not all are implemented right away, or ever). CollectionReader and CollectionWriter could then be refactored as compatibility shims that use Collection under the hood, so we don't maintain two implementations.

This would mean we could show a Python programmer this:

c = arvados.api().collections().get(uuid=x).execute()
with c.open('foo.txt', 'r') as f:
  foo = f.read()
with c.open('bar.txt', 'w') as f:
  f.write('bar')
c.save()

...and the Python Programmer would know how to work with data in Keep, without having to read any more docs.

I get the sense that you think this is not only a low priority, but maybe even actively a bad idea, and I'm really puzzled at that, because I don't think this API is even optional, let alone a low priority. Should we chalk out some time to hangout and figure out whether we're actually talking about the same things?

Well, I suppose that depends entirely on what "this API" ends up being. I definitely think adding a bad API is worse than doing nothing. It's good to implement a good API, though. So our mission right now is to figure out what the right API is!
  • An external API is a commitment, and having a lot of APIs to the same functionality is not necessarily better for users.
  • Proliferation of internal APIs can make the code harder to read, which is harmful. So I don't take it for granted that adding another API at a lower level than CollectionReader/Writer is a good idea. Quickly scanning CollectionWriter I see only one function, 12 lines long, that concatenates strings to build a manifest. Who else will stand to benefit from a new manifest-building API?

#8 Updated by Tom Clegg about 4 years ago

  • Subject changed from [SDKs] Manifest handling SDK to [SDKs] Good Collection API for Python SDK
  • Description updated (diff)

#9 Updated by Tim Pierce about 4 years ago

I'm a little worried that the API calls are misleading:

  • Collection.rename -- this looks like it will rename the collection, not a file it contains.
  • Collection.copy -- this looks like it will copy the collection, not a file it contains.
  • Collection.delete -- this looks like it will delete the collection, not a file it contains.

Maybe this will work out in practice: when we have stuff like

c = Collection.get(uuid)
c.rename("foo.csv", "bar.csv")
c.save()

... then it'll be clearer from context what's going on. But I think we should at least be aware of what we can do to provide clarity between "calls that operate on the collection contents" and "calls that operate on the collection record". Maybe instance methods handle the collection contents and class methods handle the container? Dunno.

For #3198 I think it will be sufficient if we can implement the open, copy, remove, unlink, and rename methods. The rest could come later.

manifest_text()
  • Return the "manifest" string representation of this collection. This implicitly commits all buffered data to disk.

We should also provide unsigned_manifest_text() to deliver a manifest stripped of any block permission signatures (i.e. suitable for calculating portable_data_hash

portable_data_hash()
  • Return the portable_data_hash that would be accepted/assigned by the API server if the collection were save()d right now. This implicitly writes all buffered data to disk, but does not update the collection record on the API server.

(TC) Alternate semantics: Return the pdh assigned/accepted by the server. Raise an exception if not saved(). But it would be weird to require save() in order to get manifest_text(), and weird if you could get manifest_text() but not portable_data_hash() when not saved().

I think generating portable_data_hash is just

"{}+{}".format(hashlib.md5(self.unsigned_manifest_text()).hexdigest(), len(self.unsigned_manifest_text()))

Yes? It could cache the PDH delivered by the server (or the one last calculated internally, for that matter) and deliver that if the collection has not been modified.

(TC) Assuming this doesn't atomically commit/save the two collections to the API server, which is currently impossible, atomicity affects only the current process. Perhaps it's OK to just offer copy+delete -- just like POSIX, which doesn't offer an atomic move (or even copy) across filesystems?

Makes sense to me. You're right about atomicity, of course (I just meant "don't delete the source file if the copy failed, for crying out loud")

(TC) One thing that makes me uncomfortable about the (old,new,dest) signature is that it's not obvious, looking at c1.copy('foo', 'bar', c2), whether we're copying c1→c2 or c2→c1.

Does it help if we establish a convention of using the dest argument as a kwarg? c1.copy('foo', 'bar', dest=c2)

Alternatively, we could add a method copy_to for copying a file to a different collection, but that seems clumsy.

#10 Updated by Peter Amstutz about 4 years ago

@dirty()@
  • I prefer "modified()"
manifest_text()
  • This should not modify anything (no implicit save/flush/sync whatever). We can easily compute the current hash of the buffer blocks on the fly to determine what the streams would be as if everything were already committed to Keep.
portable_data_hash()
  • This should just return the hash of manifest_text() and not do anything else.
copy(old, new, other_collection)
  • I think that self should be the destination collection and other_collection should be the source collection. That way, other_collection is not modified.
  • I think the method should accept either paths or a StreamFileReader/Writer objects.
rename(old, new, other_collection)
  • We can't do an atomic move between collections, so this can only be implemented as a copy and then a delete. The only question is whether or not to provide it as a convenience function or require that users write out both steps themselves (to make it clear what is happening.)

#11 Updated by Tom Clegg about 4 years ago

Tim Pierce wrote:

I'm a little worried that the API calls are misleading:

  • Collection.rename -- this looks like it will rename the collection, not a file it contains.
  • Collection.copy -- this looks like it will copy the collection, not a file it contains.
  • Collection.delete -- this looks like it will delete the collection, not a file it contains.

That's true -- it would seem a bit more OO to put this method on the file rather than the collection, maybe a getter-setter like file.name() / file.name(newname). But that gets back into "never mind what you know about files, you have to learn a whole new approach here" territory. os.rename() doesn't rename the OS, but Python programmers know that. This whole approach hinges on the user noticing "aha, collections have an os-like interface for browsing/manipulating its entries."

(And yes, collection.rename('foo','bar') -- at least to me -- strongly suggests that it renames something in the collection. "Collection has a rename() method", otoh, does seem ambiguous on its own.)

But I think we should at least be aware of what we can do to provide clarity between "calls that operate on the collection contents" and "calls that operate on the collection record". Maybe instance methods handle the collection contents and class methods handle the container? Dunno.

Hm, class methods wouldn't know which collection you're trying to operate on... If we follow the os lead, the pattern is "Collection methods only go as far as directory entries and file metadata. To get at the file content, open a file and use the file object."

(I'm not sure I follow the "collection contents" and "collection record" distinction here, though. By "collection contents" do you mean file content?)

For #3198 I think it will be sufficient if we can implement the open, copy, remove, unlink, and rename methods. The rest could come later.

I think open (or, more precisely, the File class whose instance is returned by open) is where #3198's write-magic will end up.

Check with Peter but it's possible the { serialize, deserialize, ... } set could unblock the write-magic development, and the { remove/unlink, copy, and rename } set could be done afterward, and ready in time to support various "rename" operations under FUSE. Something like that, maybe?

In fact, it seems to me that if we're agreed on #4837 Peter could already be writing and testing write-magic code against that data structure. But serialize/deserialize would be needed to test arv-mount↔write-magic integration.

We should also provide unsigned_manifest_text() to deliver a manifest stripped of any block permission signatures (i.e. suitable for calculating portable_data_hash

Currently we call that stripped_manifest_text(). If we're going to change its name (which is definitely worth considering) I suggest portable_manifest_text(). "Stripped" is vague, and "unsigned" is misleading (you need to strip more than just signatures in order to get portable_data_hash right).

I think generating portable_data_hash is just
[...]

Yes. (But we do need to make sure we're not serializing twice, or being thread-unsafe.)

Yes? It could cache the PDH delivered by the server (or the one last calculated internally, for that matter) and deliver that if the collection has not been modified.

Yes. I'd suggest starting with: Return whatever the server said if saved(), otherwise compute it. (We've bitten off a lot already, let's skip the "anything modified since pdh computed?" complication for now.)

Alternatively, we could add a method copy_to for copying a file to a different collection, but that seems clumsy.

I think my favorite alternative so far is my #2 above:

Collection.copy(c1.open('foo.txt','r'), c2.open('bar.txt','w'))

It's a bit verbose, but it's obvious what to expect. Also: if c2.open(...,'w') fails because c2 is not writable, you'll obviously get exactly the same exception you'd get if you opened it for "normal" writing. It also has the benefit of providing a fast implementation for this:

with dst.open('waz.txt', 'w') as outfile:
  Collection.copy(src.open('foo.txt','r'), outfile)
  Collection.copy(src.open('bar.txt','r'), outfile)
  Collection.copy(src.open('baz.txt','r'), outfile)

And could even be extended to do this sort of thing without moving any data:

with dst.open('waz.txt', 'w') as outfile:
  Collection.copy(src.open('foo.txt','r').slice(0,1024), outfile)
  Collection.copy(src.open('bar.txt','r').slice(0,1024), outfile)
  Collection.copy(src.open('baz.txt','r').slice(0,1024), outfile)

The same benefits would come with my #3 but that version looks trickier to implement.

#12 Updated by Tom Clegg about 4 years ago

Peter Amstutz wrote:

dirty()
  • I prefer "modified()"

Yes, modified() sounds better than saved/dirty.

manifest_text()
  • This should not modify anything (no implicit save/flush/sync whatever). We can easily compute the current hash of the buffer blocks on the fly to determine what the streams would be as if everything were already committed to Keep.

Hm. It'll be hard to predict the block signatures. (Hopefully!)

But this is possible for portable_manifest_text().

portable_data_hash()
  • This should just return the hash of manifest_text() and not do anything else.

Well, portable_manifest_text(), but yes, that's much better.

rename(old, new, other_collection)
  • We can't do an atomic move between collections, so this can only be implemented as a copy and then a delete. The only question is whether or not to provide it as a convenience function or require that users write out both steps themselves (to make it clear what is happening.)

To me, the attraction of "make it clear what is happening" + "do what you would do in POSIX when moving across filesystems (except that here we provide a copy() feature!)" make the convenience function seem somewhere between low-priority and unappealing.

#13 Updated by Tom Clegg about 4 years ago

  • Description updated (diff)

#14 Updated by Tim Pierce about 4 years ago

  • Assigned To changed from Tim Pierce to Peter Amstutz

#15 Updated by Peter Amstutz about 4 years ago

Benchmark note: taking and releasing a lock on every read imposes an up to ~30% performance hit in a tight loop with smallish (1 KiB) reads. This can be mitigated with larger reads (64 KiB).

#16 Updated by Peter Amstutz about 4 years ago

  • Status changed from New to In Progress

#17 Updated by Peter Amstutz about 4 years ago

Notes for when you get to this:

manifest_text() / portable_manifest_text():
  • Current a single manifest_text() method with optional parameters strip and normalize
listdir() / walk()
  • Not implemented. However you can use Collection.find() to look up a path, you can iterate over collection objects, and you can do regex matching yourself...
remove() / unlink()
  • remove() is implemented, there is currently no unlink() alias.
rename()
  • Removes files/directories from one collection and adds them to another. Haven't written any tests for it. May remove this in favor of copy()/remove().
copy()
  • Missing, remains to be implemented in some form. Should probably support deep copy of both nested collections and files. Need to define behavior when copying a file that has been written to but not committed.
glob()
  • Not implemented, arguably not necessary

Outstanding issues:

  • Collections and ArvadosFile objects are not threadsafe. Locking on every read and write incurs some overhead that is unnecessary when there is only a single thread and/or files are only opened for reading. Right now I'm inclined to make it the responsibility of the caller to manage access to collection/file objects by multiple threads.
  • Uses background threads for get and put. These use a passed in keep client object, which carries with it an api client object. The api client object may not be threadsafe. (The FUSE driver handles the api client thread safety problem by having a special api client wrapper that allocates a separate api client object per thread).
  • Lots of small files will result in lots of small blocks. It could pack small files into bigger blocks upon commit, but that code hasn't been written yet.

#18 Updated by Tom Clegg about 4 years ago

  • Target version changed from 2015-01-07 sprint to 2015-01-28 Sprint

#19 Updated by Tom Clegg about 4 years ago

  • Story points changed from 3.0 to 1.0

#20 Updated by Tim Pierce about 4 years ago

Review at 04f5f75659

This is some really excellent stuff, so please don't take the length of this review as criticism. IMHO this is a huge step forward and I think it's going to make a lot of things pretty awesome. I think most of these comments are about clarity and code organization anyway.

Overall

  • Please add docstrings to public methods, documenting arguments, return values and side effects. Most of the methods here look pretty self explanatory but there are several that would really benefit from some documentation, like:
    • BufferBlock
    • ArvadosFile
    • Collection.find
    • arvados.collection.import_manifest

ArvadosFileReaderBase

  • The _NameAttribute shim was added to provide backward compatibility for old code using StreamFileReader.name(), right? If so, it should probably stay in StreamFileReader.
  • For clarity, ArvadosFileReaderBase should implement stubs for these methods (i.e. just have them raise NotImplementedError)
    • size()
    • read()

BufferBlock

  • I don't see any mechanism for making sure that a client doesn't allocate a BufferBlock larger than the maximum permitted size of a Keep block. How do we want to handle this?
  • It seems weird for BufferBlock.append() to raise AssertionError if the block isn't writable. I'd expect this to be IOError.
  • Not clear why the _put_queue is limited to 2 blocks (or why the thread managers hold only 2 threads). Is it arbitrary or is there some problem with managing a longer queue or more threads? Either way, please document it in a comment so we don't lose track.
  • block_prefetch appears to get each block and then not do anything with it. AFAIK KeepClient.get() doesn't do anything with the block once it's fetched, so if we don't save it somewhere, it will just get thrown away. Have I misunderstood what's going on here?

ArvadosFile

  • This class seems to be the odd man out in arvfile.py -- it doesn't have any relationship to ArvadosFileBase, ArvadosFileReader, or anything else in this package. It would help to have a little package documentation giving the reader an idea of what these classes are for.
  • Both stream and segments are defined pretty vaguely here (and if they're both a list of Range objects, what's the difference?) This would be a good place in particular to define more clearly what a "segment" is.

Collection

  • Documentation for what the "parent" parameter is (it looks like this has something to do with exposing subdirectories in collections as their own Collection object?)
  • _my_api(), _my_keep(), _my_block_manager()
    • _my_api() could set self._api_client = self.parent._my_api() to avoid having to repeatedly perform a recursive lookup. Ditto for _my_keep() and _my_block_manager()
  • _populate_from_api_server(), _populate_from_keep()
    • The comment about creating an API client at the last possible moment now makes more sense before the _my functions.
    • Need to enumerate the exceptions that are being handled in _populate_from_api_server, _populate_from_keep etc.
  • Collection.find()
    • appears to assume that the last component in the path is a simple file. What if it isn't and the user has just given a path to a subcollection within the collection? Will find() still work?
  • import_manifest
    • should use arvados.util.portable_data_hash_pattern to match block locators.
    • I don't quite understand why this is a package function and not a method on a Collection object. What's the difference between these, given that a Collection will implicitly call import_manifest when it populates itself anyway?
          c = import_manifest('. 781e5e245d69b566979b86e28d23f2c7+10 0:10:count.txt\n', api_client=api, keep_client=keep)
      
          c = Collection('. 781e5e245d69b566979b86e28d23f2c7+10 0:10:count.txt\n', api_client=api, keep_client=keep)
      
  • Collection.save()
    • Is there any reason that this shouldn't just call Collection.save_as() if it's a new collection, rather than throwing an exception?

range.py

  • locators_and_ranges()
    • commented-out test should either be removed (or maybe used to raise an exception, but probably not worth it)
  • replace_range()
    • docstring doesn't reflect the actual parameters
    • commented-out test can be removed here too

stream.py

  • locator_block_size()
    • can be replaced with KeepLocator(loc).size I think
  • StreamReader
    • _keepget(), readfrom()
      • What are these wrappers for? Are they here just to override the underlying KeepClient's num_retries setting to None?
  • StreamWriter
    • Is this commented-out class a work in progress, or something that we started but didn't go anywhere? Either way it should probably be taken out and done something with.

#21 Updated by Tom Clegg almost 4 years ago

  • Target version changed from 2015-01-28 Sprint to 2015-02-18 sprint

#22 Updated by Brett Smith almost 4 years ago

Looking at 8128b45. The comments here are pretty high-level. I haven't even read many of the method implementations in detail. It seems like we'll both save time by worrying about the high-level API and design issues first, and then dig into the implementation.

What am I reviewing?

A lot of the code in this branch is meant to provide API support for #3198, but there hasn't been any discussion about what that API should look like, what we can support in the SDK, etc. I also don't have the benefit of being able to tell whether or how the FUSE driver will use it, because that end isn't implemented yet.

I think we need to have a team discussion about what public APIs you expect to need for #3198, and to make sure they make sense in the context of the SDK. Basically, it would be nice if the top of #3198 looked like the top of this ticket.

API differences from spec

Even after accounting for note-17,

  • There is no Collection() constructor in any form.
  • copy() does not match the spec.
  • You mentioned that the implementation for manifest_text() has diverged from the spec, but I don't understand why.

Merge with master

I know there have been a lot of outside forces leading to this situation, none of which are your fault, but you need to merge with master and account for some of the recent SDK changes.

  • Is AsyncKeepWriteErrors still needed after #3835?
  • #3021 did some work in arvados_testutil to create mock API objects that might help in some of your testing.

New API questions

  • It seems surprising that find() can create objects. The name suggests a read-only operation.
  • I'm not a fan of having methods take arguments that substantially change their behavior. Usually it seems simpler for both the user and developer to have separate methods.
    • Having api accept a separate dictionary of setup parameters looks odd when it already accepts those parameters as arguments. A wrapper seems more appropriate.
    • Same applies to cache_only in KeepClient.get().
  • I don't see any reason why the new functions in the ranges module should have debug arguments. It seems better to always log debug messages.

Changes to the existing API

  • locators_and_ranges is used by at least one Crunch script, so it needs to remain backward-compatible. I think this could be straightforward by making Range and LocatorAndRange NamedTuples (from the collections module)?
  • from arvfile import * - Following PEP 8, please don't import *. It brings a lot of stuff into the module's namespace that probably isn't designed to be there, and can make it difficult to move things around later. As far as I'm concerned, the existing imports in __init__.py are grandfathered in, and we should never do this again.

Naming things

I'm putting my foot down: single-letter variable names are appropriate for short loops and except blocks only. Anywhere else, it's too difficult to remember what they mean. Please spell out longer names.

Please don't use variable names that mask built-in functions, like id.

I've learned the hard way, it helps a lot if your decorators give useful names to the wrapper function, rather than the generic wrapper. That name shows up in tracebacks, so knowing at a glance what the wrapper does can save a lot of time in debugging.

SafeApi probably needs a more distinguishing name in our post-#5037 world. Just as one suggestion, maybe ThreadSafeApiCache?

It's surprising that ArvadosFile isn't related to ArvadosFileBase. A more distinct name could help clarify how its functionality is different from that class hierarchy.

recursive seems like a nicer name than rm_r for the remove method.

Tests

Please make tests more focused. Ideal is setting up state, then making assertions about that state. Interweaving a series of mutations and assertions makes it difficult to tell what went wrong when a test fails.

Please give the tests descriptive names so that future readers can understand what's being tested, and how tests differ from each other.

The following units look like they're meant to be independent but have no dedicated tests:

  • BufferBlock
  • BlockManager
  • import_collection
  • export_collection
  • The new ranges functions
  • The Collection methods portable_data_hash, update, save, and open with modes a, a+, and w

Miscellaneous

Please make docstrings follow PEP 257. In particular, each docstring should start with opening quotes followed immediately by a one-line summary of the class or method. Multi-line docstrings should have a blank line after that summary.

Don't call dunder methods that don't belong to self, maybe through super(). Prefer iter(foo) to foo.__iter__().

You committed your development setup to the API server's application.yml.example.

#23 Updated by Peter Amstutz almost 4 years ago

Brett Smith wrote:

What am I reviewing?

A lot of the code in this branch is meant to provide API support for #3198, but there hasn't been any discussion about what that API should look like, what we can support in the SDK, etc. I also don't have the benefit of being able to tell whether or how the FUSE driver will use it, because that end isn't implemented yet.

While originally intended just to support #3198, this story has taken on a life of its own, so the API is intended to provide general SDK support to users for Keep collection and file manipulations.

I think we need to have a team discussion about what public APIs you expect to need for #3198, and to make sure they make sense in the context of the SDK. Basically, it would be nice if the top of #3198 looked like the top of this ticket.

FUSE needs a representation that includes directories (implemented here as Collection and Subcollection), directory manipulation (Collection.mkdirs, Collection.remove) and writable files (implemented as ArvadosFile).

API differences from spec

Even after accounting for note-17,

  • There is no Collection() constructor in any form.

I had changed Collection to CollectionRoot, but have changed it back. However, users are encouraged to use ReadOnlyCollection() or WritableCollection() instead of the Collection() constructor generally.

  • copy() does not match the spec.

It's bad form to mutate a method's arguments, so I choose to specify the source collection rather than the destination collection.

  • You mentioned that the implementation for manifest_text() has diverged from the spec, but I don't understand why.

The spec says that accessing manifest_text() would implicitly commit the collection first. This seems like an unnecessary and confusing side effect. This sort of makes sense for the way CollectionWriter is used (the manifest text isn't available until the very end), but writable collections decouple collection modifications from commits and the manifest text can be generated at any time.

Merge with master

I know there have been a lot of outside forces leading to this situation, none of which are your fault, but you need to merge with master and account for some of the recent SDK changes.

Done.

  • Is AsyncKeepWriteErrors still needed after #3835?

Not needed any more. Uses KeepWriteError now.

  • #3021 did some work in arvados_testutil to create mock API objects that might help in some of your testing.

I did not rewrite any existing test mocks, but I'll look at using mocks more in the future.

New API questions

  • It seems surprising that find() can create objects. The name suggests a read-only operation.
  • I'm not a fan of having methods take arguments that substantially change their behavior. Usually it seems simpler for both the user and developer to have separate methods.

Separated into find() and find_or_create()

  • Having api accept a separate dictionary of setup parameters looks odd when it already accepts those parameters as arguments. A wrapper seems more appropriate.

This is a wart in the current api that I was trying to address. By default it queries the arvados.config object, but you can't directly override the arvados.config object, you have provide "host" and "token" separately, which means juggling those parameters separately with a wrapper like ThreadSafeApiCache and duplicating the logic for reading the configuration environment, which violates DRY. (I would just remove the the host and token parameters if it it didn't affect backwards compatibility.) I'm open to alternatives.

  • Same applies to cache_only in KeepClient.get().

Split out to KeepClient.get_from_cache

  • I don't see any reason why the new functions in the ranges module should have debug arguments. It seems better to always log debug messages.

Changed to use logger (or removed).

Changes to the existing API

  • locators_and_ranges is used by at least one Crunch script, so it needs to remain backward-compatible. I think this could be straightforward by making Range and LocatorAndRange NamedTuples (from the collections module)?

That's from an unused feature of the splitfastq crunch script, and locators_and_ranges is not intended to be a public API. Deleted the offending code from splitfastq.

This did get me thinking that ArvadoFile should have a "slice()" method that returns a new ArvadoFile with a subset of the existing file's contents, however this scoped has creeped enough.

  • from arvfile import * - Following PEP 8, please don't import *. It brings a lot of stuff into the module's namespace that probably isn't designed to be there, and can make it difficult to move things around later. As far as I'm concerned, the existing imports in __init__.py are grandfathered in, and we should never do this again.

Cleaned up imports.

Naming things

I'm putting my foot down: single-letter variable names are appropriate for short loops and except blocks only. Anywhere else, it's too difficult to remember what they mean. Please spell out longer names.

Fixed a bunch of instances of overly terse names in the code.

Please don't use variable names that mask built-in functions, like id.

Fixed.

I've learned the hard way, it helps a lot if your decorators give useful names to the wrapper function, rather than the generic wrapper. That name shows up in tracebacks, so knowing at a glance what the wrapper does can save a lot of time in debugging.

Fixed.

SafeApi probably needs a more distinguishing name in our post-#5037 world. Just as one suggestion, maybe ThreadSafeApiCache?

ThreadSafeApiCache it is.

It's surprising that ArvadosFile isn't related to ArvadosFileBase. A more distinct name could help clarify how its functionality is different from that class hierarchy.

Renamed "ArvadosFileBase" to "FileLikeObjectBase" which more accurately describes its function.

recursive seems like a nicer name than rm_r for the remove method.

Done.

Tests

Please make tests more focused. Ideal is setting up state, then making assertions about that state. Interweaving a series of mutations and assertions makes it difficult to tell what went wrong when a test fails.

Many of these tests are inherently designed to test that a representative sequence of operations produces the intended outcome. The purpose of making assertions about the intermediate state is specifically there to make it easier to tell what went wrong when a test fails, instead of having the failure only captured by a final assertion.

Please give the tests descriptive names so that future readers can understand what's being tested, and how tests differ from each other.

Fixed.

The following units look like they're meant to be independent but have no dedicated tests:

  • BufferBlock
  • BlockManager
  • import_collection
  • export_collection
  • The new ranges functions
  • The Collection methods portable_data_hash, update, save, and open with modes a, a+, and w

Added many new tests. There is NewCollectionTestCase.test_import_export_manifest, but also import_collection and export_collection are implicitly tested by many other methods.

Arguably the tests could benefit from another refactoring pass, but ultimately tests are a means to an end of ensuring code quality, and not an end to themselves.

Miscellaneous

Please make docstrings follow PEP 257. In particular, each docstring should start with opening quotes followed immediately by a one-line summary of the class or method. Multi-line docstrings should have a blank line after that summary.

Fixed.

Don't call dunder methods that don't belong to self, maybe through super(). Prefer iter(foo) to foo.__iter__().

Fixed, I think. I only found one place where this occurred.

You committed your development setup to the API server's application.yml.example.

Fixed.

#24 Updated by Peter Amstutz almost 4 years ago

  • Target version changed from 2015-02-18 sprint to 2015-03-11 sprint

#25 Updated by Brett Smith almost 4 years ago

Reviewing f39118d. I have read more into the implementations for this round, which informs some of my discussion about the API design. I'm holding off smaller comments about them because it's still not clear how many of them will be relevant after continued discussion.

Peter Amstutz wrote:

Brett Smith wrote:

API differences from spec

Even after accounting for note-17,

  • There is no Collection() constructor in any form.

I had changed Collection to CollectionRoot, but have changed it back. However, users are encouraged to use ReadOnlyCollection() or WritableCollection() instead of the Collection() constructor generally.

The whole suite of non-classes and extra createFooCollection() constructor methods here is confusing for readers, confusing for users, and more API than we need to meet the requirements. I think it should be implemented as:

  • Collection behaves as the current WritableCollection.
  • ReadOnlyCollection is a Collection subclass that uses NoopLock for the instance lock, and raises an error if you try any write operations.
  • LiveCollection is a Collection subclass that overrides the initializer and other methods as needed to manage an events subscription and update the object when events come in.
  • CollectionReader should fit into this hierarchy somehow. You're literally copying huge swaths of code from CollectionReader into Collection, so it should be possible to DRY this up.
  • None of these constructors take a sync argument. There are no extra createFooCollection() constructor methods for them.

Related:

  • import_manifest should be a method of Collection. You said it yourself: having methods mutate their arguments is weird. You don't need the make-a-new-Collection form because we already support constructing Collection objects from manifests.
  • Whatever export_manifest does should be covered by Collection.manifest_text. In general, polymorphism is more Pythonic than switching on isinstance results: it's more consistent with the duck typing philosophy.
  • You mentioned that the implementation for manifest_text() has diverged from the spec, but I don't understand why.

The spec says that accessing manifest_text() would implicitly commit the collection first.

I'm referring to the discrepancy discussed in note-17: the fact that the spec says that the method takes no arguments, but you have it taking two.

New API questions

  • locator_block_size(s) is redundant. We already have KeepLocator(s).size.
  • BlockManager has at least a couple of methods that grab the lock of one of its child blocks, and then manipulate that. Having one object grab another's lock is really surprising and seems bug-prone. Please move this functionality into BufferBlock methods.
  • Ditto for CollectionBase.copy.
  • It seems surprising that find() can create objects. The name suggests a read-only operation.
  • I'm not a fan of having methods take arguments that substantially change their behavior. Usually it seems simpler for both the user and developer to have separate methods.

Separated into find() and find_or_create()

There's a lot of repeated tree traversal code between these methods and remove(). Please DRY this up into a method.

  • Having api accept a separate dictionary of setup parameters looks odd when it already accepts those parameters as arguments. A wrapper seems more appropriate.

This is a wart in the current api that I was trying to address. By default it queries the arvados.config object, but you can't directly override the arvados.config object, you have provide "host" and "token" separately, which means juggling those parameters separately with a wrapper like ThreadSafeApiCache and duplicating the logic for reading the configuration environment, which violates DRY. (I would just remove the the host and token parameters if it it didn't affect backwards compatibility.) I'm open to alternatives.

I can agree that our earlier solution to this problem, adding individual arguments to api(), was probably not the best design-wise. But we're stuck with it, and accepting an extra configuration argument just exacerbates the issues around it. Now users are left to wonder which configuration mechanism they "should" use, and how they take precedence over each other. I would prefer to see a new method like api_from_config() (not wedded to the name) that takes a config-like dict and calls api() with the corresponding arguments.

Changes to the existing API

  • locators_and_ranges is used by at least one Crunch script, so it needs to remain backward-compatible. I think this could be straightforward by making Range and LocatorAndRange NamedTuples (from the collections module)?

That's from an unused feature of the splitfastq crunch script, and locators_and_ranges is not intended to be a public API. Deleted the offending code from splitfastq.

The problem here is that we've only recently started giving users hints about what APIs are intended to be public or not, and so a lot of them have been used as if they're public, regardless of our intentions as developers. Just because we're not using it externally doesn't mean none of our users are.

I'm willing to agree that the risk is low enough, and the upgrade path is easy enough, that we can let this one go. But in general, saying "the API wasn't intended to be public" doesn't justify an incompatible change unless the intent was clearly marked or documented.

Speaking of which: I thought we discussed moving some of the not-ready-for-public-consumption additions in this branch to a separate module marked with an underscore, or some other mechanism. What happened to that?

  • from arvfile import * - Following PEP 8, please don't import *. It brings a lot of stuff into the module's namespace that probably isn't designed to be there, and can make it difficult to move things around later. As far as I'm concerned, the existing imports in __init__.py are grandfathered in, and we should never do this again.

Cleaned up imports.

You missed several. git grep -F 'import *'

Also per PEP 8, please do all imports at the top of the file. At least StreamFileReader.as_manifset does an inline import.

Naming things

I'm putting my foot down: single-letter variable names are appropriate for short loops and except blocks only. Anywhere else, it's too difficult to remember what they mean. Please spell out longer names.

Fixed a bunch of instances of overly terse names in the code.

I think it's worth expanding

  • r in the different segments loops of ArvadosFile
  • n and s in import_manifest
  • Many in normalize_stream (especially the argument s)

SafeApi probably needs a more distinguishing name in our post-#5037 world. Just as one suggestion, maybe ThreadSafeApiCache?

ThreadSafeApiCache it is.

Can this just go in the api module? I think that's better for discoverability since it's meant to be public, and it's short enough that I don't think it would hurt maintainability too much either.

Tests

Please make tests more focused. Ideal is setting up state, then making assertions about that state. Interweaving a series of mutations and assertions makes it difficult to tell what went wrong when a test fails.

Many of these tests are inherently designed to test that a representative sequence of operations produces the intended outcome.

I'm fine when that's the intent of the test, but there are many instances where it isn't. Take this block from test_truncate:

       with WritableCollection('. 781e5e245d69b566979b86e28d23f2c7+10 0:10:count.txt\n',
                             api_client=api, keep_client=keep) as c:
            writer = c.open("count.txt", "r+")
            self.assertEqual(writer.size(), 10)
            writer.seek(5)
            self.assertEqual("56789", writer.read(8))
            writer.truncate(8)
            writer.seek(5, os.SEEK_SET)
            self.assertEqual("567", writer.read(8))

Asserting size() doesn't change the state at all. And seeking doesn't change the behavior of truncate, so why do it, let alone assert the result?

For another example, this block begins three different tests in CollectionCreateUpdateTest, none of which are intended to test the constructor:

       c1 = arvados.collection.createWritableCollection("hello world")
        self.assertEquals(c1.portable_data_hash(), "d41d8cd98f00b204e9800998ecf8427e+0")
        self.assertEquals(c1.api_response()["portable_data_hash"], "d41d8cd98f00b204e9800998ecf8427e+0" )

Arguably the tests could benefit from another refactoring pass, but ultimately tests are a means to an end of ensuring code quality, and not an end to themselves.

I agree, but improving test readability contributes directly to that goal. Focused tests give developers a lot more information beyond a simple boolean "this does or doesn't conform to expected behavior;" they can give the developer extra information about what went wrong to help debug the issue.

test_create_subdir doesn't assert anything.

Miscellaneous

Please make docstrings follow PEP 257. In particular, each docstring should start with opening quotes followed immediately by a one-line summary of the class or method. Multi-line docstrings should have a blank line after that summary.

Fixed.

I'm not seeing it. There are still many docstrings that start with multiple consecutive lines of text, or with a blank line.

Why does the code have so many explicit casts to long? Python generally takes care of this for you:

>>> import sys
>>> type(sys.maxint)
<type 'int'>
>>> sys.maxint
9223372036854775807
>>> sys.maxint + 1
9223372036854775808L
>>> type(_)
<type 'long'>

Thanks.

#26 Updated by Brett Smith almost 4 years ago

  • CollectionReader should fit into this hierarchy somehow. You're literally copying huge swaths of code from CollectionReader into Collection, so it should be possible to DRY this up.

Thinking over this some more, I was a little too adamant about the solution. I still think it's important to DRY this up. I think some kind of subclassing arrangement will probably end up being the most straightforward way to do that, but I don't care as much whether or not CollectionReader is somehow related in a class hierarchy, or even that there's necessarily subclassing at all, if there's a more straightforward way to do it.

Another thing I meant to mention earlier: what's up with tests calling private methods and asserting private attributes? I think it would be surprising for a developer to refactor class internals and have tests fail because of it. And asserting private state gets us further away from verifying public behavior, which is what users care about.

Thanks again.

#27 Updated by Brett Smith almost 4 years ago

The FUSE "edge cases" test revealed some awkward interactions between POSIX semantics and what's expressable in manifests. We discussed it some in the office. Highlights:

  • It seems safer, for now at least, to have the Collection objects obey the rules of the manifest format only. Collection(uuid).open('./.') should open a file named '.', not raise EISDIR. The root collection class can understand '.' at the beginning of a path to refer to self; no others should. Nothing should treat '..' specially. Users who construct relative paths can use os.path.normpath() to make those paths acceptable to Collection methods (assuming the normalized path is sensible).
  • If a manifest is not unambiguously expressable as a tree—e.g., it defines both a file and a stream named 'foo'—that should raise an error when instantiating the Collection. The API discussed in this story description assumes deep in its bones that paths are unambiguous, so it seems OK to refuse to support this, even if some methods could theoretically do something useful (e.g., currently CollectionReader.open() returns the file and ignores the stream).
  • We discussed DRYing up the path traversal code between find/find_or_create/remove. I still think it would be a good idea but I recognize the methods are doing very different things in the different cases, and don't consider this a blocker for merging.
  • API versioning would be cool but not in this branch.

#28 Updated by Peter Amstutz almost 4 years ago

Brett Smith wrote:

The whole suite of non-classes and extra createFooCollection() constructor methods here is confusing for readers, confusing for users, and more API than we need to meet the requirements. I think it should be implemented as:

  • Collection behaves as the current WritableCollection.

Done.

  • ReadOnlyCollection is a Collection subclass that uses NoopLock for the instance lock, and raises an error if you try any write operations.
  • CollectionReader should fit into this hierarchy somehow. You're literally copying huge swaths of code from CollectionReader into Collection, so it should be possible to DRY this up.

CollectionReader is now a subclass of Collection with the changes described above, and also backwards compatible with the existing CollectionReader (passes all tests).

  • LiveCollection is a Collection subclass that overrides the initializer and other methods as needed to manage an events subscription and update the object when events come in.

Removed LiveCollection entirely.

  • None of these constructors take a sync argument. There are no extra createFooCollection() constructor methods for them.

Removed sync argument and createFoo() functions.

Related:

  • import_manifest should be a method of Collection. You said it yourself: having methods mutate their arguments is weird. You don't need the make-a-new-Collection form because we already support constructing Collection objects from manifests.

Done.

  • Whatever export_manifest does should be covered by Collection.manifest_text. In general, polymorphism is more Pythonic than switching on isinstance results: it's more consistent with the duck typing philosophy.

Done.

  • You mentioned that the implementation for manifest_text() has diverged from the spec, but I don't understand why.

The spec says that accessing manifest_text() would implicitly commit the collection first.

I'm referring to the discrepancy discussed in note-17: the fact that the spec says that the method takes no arguments, but you have it taking two.

I think this was an oversight in the the spec since it is necessary to strip or normalize the contents in certain situations. The method signature for manifest_text() now implemented is compatible with the existing CollectionReader.

New API questions

  • locator_block_size(s) is redundant. We already have KeepLocator(s).size.

Done.

  • BlockManager has at least a couple of methods that grab the lock of one of its child blocks, and then manipulate that. Having one object grab another's lock is really surprising and seems bug-prone. Please move this functionality into BufferBlock methods.

Fixed.

  • Ditto for CollectionBase.copy.

This lock turns out to be unnecessary since target_dir is a subcollection of the 'self' and thus already under the umbrella lock for the whole collection.

I can agree that our earlier solution to this problem, adding individual arguments to api(), was probably not the best design-wise. But we're stuck with it, and accepting an extra configuration argument just exacerbates the issues around it. Now users are left to wonder which configuration mechanism they "should" use, and how they take precedence over each other. I would prefer to see a new method like api_from_config() (not wedded to the name) that takes a config-like dict and calls api() with the corresponding arguments.

Added api_from_config() also renamed api.py to apisetup.py because the api method was shadowing the api module making the api module impossible to import from outside the arvados module.

Changes to the existing API

Speaking of which: I thought we discussed moving some of the not-ready-for-public-consumption additions in this branch to a separate module marked with an underscore, or some other mechanism. What happened to that?

The _ranges is module now underscored, as well as various classes including _BufferBlock and _BlockManager.

  • from arvfile import * - Following PEP 8, please don't import *. It brings a lot of stuff into the module's namespace that probably isn't designed to be there, and can make it difficult to move things around later. As far as I'm concerned, the existing imports in __init__.py are grandfathered in, and we should never do this again.

Cleaned up imports.

You missed several. git grep -F 'import *'

Fixed all the "import *" in code that I touched.

Also per PEP 8, please do all imports at the top of the file. At least StreamFileReader.as_manifset does an inline import.

That was due to a circular dependency, so I couldn't import at the top. Fixed by refactoring.

I think it's worth expanding

  • r in the different segments loops of ArvadosFile
  • n and s in import_manifest
  • Many in normalize_stream (especially the argument s)

Fixed.

SafeApi probably needs a more distinguishing name in our post-#5037 world. Just as one suggestion, maybe ThreadSafeApiCache?

ThreadSafeApiCache it is.

Can this just go in the api module? I think that's better for discoverability since it's meant to be public, and it's short enough that I don't think it would hurt maintainability too much either.

ThreadSafeApiCache also creates a KeepClient object, which means it can't go into API without creating circular dependency since keep depends on api and api would depend on keep.

Take this block from test_truncate:

[...]

Asserting size() doesn't change the state at all. And seeking doesn't change the behavior of truncate, so why do it, let alone assert the result?

Fixed.

For another example, this block begins three different tests in CollectionCreateUpdateTest, none of which are intended to test the constructor:

Fixed.

Arguably the tests could benefit from another refactoring pass, but ultimately tests are a means to an end of ensuring code quality, and not an end to themselves.

I agree, but improving test readability contributes directly to that goal. Focused tests give developers a lot more information beyond a simple boolean "this does or doesn't conform to expected behavior;" they can give the developer extra information about what went wrong to help debug the issue.

Some of the tests have some noisy redundancy because of they were copied and pasted instead of attempting to refactor common code. On the other hand, in my experience the DRY principal is not necessarily a good thing in tests. Since by definition don't know ahead of time all possible errors are going to be caught by the test, it's better to test code paths in many places (even when it might appear redundant) since a failure might only appear in a specific circumstance where state is being invisibly mutated behind the scenes.

test_create_subdir doesn't assert anything.

Fixed.

Miscellaneous

Please make docstrings follow PEP 257. In particular, each docstring should start with opening quotes followed immediately by a one-line summary of the class or method. Multi-line docstrings should have a blank line after that summary.

Fixed.

I'm not seeing it. There are still many docstrings that start with multiple consecutive lines of text, or with a blank line.

All of the code I touched should have PEP 257/287 docstrings. There's lots of old code that's not up to standards, which might include stuff I moved around but did not otherwise modify.

Why does the code have so many explicit casts to long? Python generally takes care of this for you:

Probably out of an abundance of caution, but if Python DTRT then it's not necessary. Removed.

Now at ad69b48

#29 Updated by Brett Smith almost 4 years ago

Reviewing ad69b48

New API stuff

I think malformed manifests should always raise ArgumentError when you're instantiating with them, and SyntaxError any other time. Having it sometimes raise IOError (as in test_init_manifest_with_collision) seems surprising—parsing a manifest string doesn't involve any IO.

The tests (that one plus test_init_manifest_with_error) make it impossible to know whether the constructor or manifest_text() is expected to raise the error. We specified that it should happen during the constructor; removing the manifest_text() call would help clarify that to future readers.

Peter Amstutz wrote:

Brett Smith wrote:

  • None of these constructors take a sync argument. There are no extra createFooCollection() constructor methods for them.

Removed sync argument and createFoo() functions.

There's still a lot of internal references to the old sync arguments. Some of it is noop cruft. Other instances could be replaced with, e.g., a simple boolean writable() method on Collection objects. I think cleaning all this up would help make this code easier to follow for future readers, since "sync mode" isn't a meaningful concept in the current implementation.

  • import_manifest should be a method of Collection. You said it yourself: having methods mutate their arguments is weird. You don't need the make-a-new-Collection form because we already support constructing Collection objects from manifests.

Done.

The variables n, s, and f in this new implementation should be spelled out.

  • BlockManager has at least a couple of methods that grab the lock of one of its child blocks, and then manipulate that. Having one object grab another's lock is really surprising and seems bug-prone. Please move this functionality into BufferBlock methods.

Fixed.

This is still an issue in _BlockManager.dup_block.

Added api_from_config() also renamed api.py to apisetup.py because the api method was shadowing the api module making the api module impossible to import from outside the arvados module.

I would love to clean this up, but I don't think we can without breaking backward compatibility. With current master you can say things like from arvados.api import api and it works fine. Renaming the module would break that code.

Along the same lines, it's fine to move classes and methods around if the new organization makes more internal sense, but we need to make sure they're still reachable from their old locations. The main change I noticed here is that CollectionWriter is no longer available from arvados.collections. A simple import would fix that. Some others are in that quasi-public state where it's harder to say how much they matter, like stream.split.

Bugs

It would be good to add tests for these cases.

  • ArvadosFile.manifest_text() refers to a name stream which does not seem to be in scope. I believe this will raise NameError.
  • The code in ArvadosFileWriter.close() is a noop. It never calls super(), so the object doesn't actually go into a closed state.
  • find and find_or_create will return will return self if the path starts with /. I'm not 100% sure what the right thing to do is in this case (raise ArgumentError?), but I'm pretty sure returning self isn't it.
  • arv-normalize is trying to refer to KEEP_BLOCK_SIZE at its new location, but never imports config, so it crashes.

Other functional code issues

  • I don't think Collection.__exit__ should commit anything if the block raised an exception (exc_type is not None).
  • As Tim mentioned earlier, because it's a backward compatibility hack, I would prefer to see _NameAttribute used only in StreamFileReader. I don't think we should support a name method through any new objects with a file-like API.
  • BlockManager is not using KeepWriteError in line with its API. The error pairs are expected to have Keep service roots as the first argument, not block locators. I don't want to get in the way of you raising whatever information you need, but the mismatch here is awkward since the error has methods like service_errors that don't make sense with this data.
  • SynchronizedCollectionBase.manifest_text() does a lot of string concatenation. Per PEP 8, this can be surprisingly expensive in Python. Prefer building a list of strings and joining it into one after you're done looping.
  • SynchronizedCollectionBase.iterkeys() returns a full list of keys, rather than an iterator over them, which is surprising.
  • The __repr__ methods for range-related objects should should use %r instead %s in the format string. This gives you the argument's own repr, so you don't have to worry about quoting it, what happens if the string has quotes in it, etc.
  • Is it possible to tighten up the exceptions that the bufferblock workers catch?
  • Don't say just except: - at least BlockManager.block_prefetch() does this.
  • Don't leave debug prints lying around - at least BlockManager.commit_bufferblock() has one.

Style issues

  • It would be nicer if CollectionReader.__init__ required manifest_locator_or_text in its signature (just by naming it as the first argument, before *args). This way users get the Python native exception if they fail to provide it, so you don't have to worry about generating that yourself, and other tools can introspect it correctly.
  • For the path traversal methods of Collections, I think you can save yourself a lot of manual work by using a max split argument: pathcomponents = path.split("/", 1). Then the first element is the thing to find here, and the second element (if any) is the rest of the path. You don't have to worry about del pathcomponents[0] or rejoining the array.
  • Why did StreamReader methods get broken into public and private versions? It has no locking, so I don't understand what this indirection accomplishes.
  • There's commented out code in range methods. Please remove it. Yeah, it was there before, but you might as well take this opportunity while you're moving it anyway.
  • There's a redundant hashlib import in arvfile.

Tests

  • My earlier comment about testing internal attributes and methods in note-26 is still open.
  • Prefer assertMatches(…) to assertTrue(re.match(…)). This generates a nicer diagnostic if the assertion fails.
  • Many of those re.match patterns have dots in filename extensions that should very strictly be escaped.
  • assertEquals is deprecated. Prefer assertEqual. (Why? Darned if I know. But the docs are pretty clear about this)

Documentation

Please make docstrings follow PEP 257. In particular, each docstring should start with opening quotes followed immediately by a one-line summary of the class or method. Multi-line docstrings should have a blank line after that summary.

Fixed.

I'm not seeing it. There are still many docstrings that start with multiple consecutive lines of text, or with a blank line.

All of the code I touched should have PEP 257/287 docstrings. There's lots of old code that's not up to standards, which might include stuff I moved around but did not otherwise modify.

PEP 257 says, "Multi-line docstrings consist of a summary line just like a one-line docstring, followed by a blank line, followed by a more elaborate description. The summary line may be used by automatic indexing tools; it is important that it fits on one line and is separated from the rest of the docstring by a blank line." There are many new docstrings that open with more than one line of text; for example, many in the new classes ArvadosFile, BufferBlock, and BlockManager.

  • The docstrings for Collection.save and .save_new don't match the signatures.
  • Please add KeepClient's new session argument to the constructor docstring.
  • The find_or_create docstring has a couple of typos referreng to arvado.collection (missing 's').
  • stream.split previously had a method signature as its summary line. This is less compliant with PEP 287, but it had the upside that our generated documentation can parse it out to provide clearer information about arguments and return values. I'm not wild about losing that before our generated documentation understands PEP 287 better.

Thanks.

#30 Updated by Peter Amstutz almost 4 years ago

Brett Smith wrote:

Reviewing ad69b48

New API stuff

I think malformed manifests should always raise ArgumentError when you're instantiating with them, and SyntaxError any other time. Having it sometimes raise IOError (as in test_init_manifest_with_collision) seems surprising—parsing a manifest string doesn't involve any IO.

Done. SyntaxError gets caught and re-thrown as ArgumentError.

The tests (that one plus test_init_manifest_with_error) make it impossible to know whether the constructor or manifest_text() is expected to raise the error. We specified that it should happen during the constructor; removing the manifest_text() call would help clarify that to future readers.

Oops. Fixed.

There's still a lot of internal references to the old sync arguments. Some of it is noop cruft. Other instances could be replaced with, e.g., a simple boolean writable() method on Collection objects. I think cleaning all this up would help make this code easier to follow for future readers, since "sync mode" isn't a meaningful concept in the current implementation.

Refactored to use writable() instead.

The variables n, s, and f in this new implementation should be spelled out.

Done.

This is still an issue in _BlockManager.dup_block.

Done.

Added api_from_config() also renamed api.py to apisetup.py because the api method was shadowing the api module making the api module impossible to import from outside the arvados module.

I would love to clean this up, but I don't think we can without breaking backward compatibility. With current master you can say things like from arvados.api import api and it works fine. Renaming the module would break that code.

Is there any code that actually does from arvados.api import api to break? The other things in the arvados.api module seem to be mostly internal.

I've reverted it back to api.py .
Instead, I've added an optional arvapi= option to one_task_per_input_file() to provide a way to inject the mock into the right place.

Bugs

It would be good to add tests for these cases.

  • ArvadosFile.manifest_text() refers to a name stream which does not seem to be in scope. I believe this will raise NameError.

Fixed.

  • The code in ArvadosFileWriter.close() is a noop. It never calls super(), so the object doesn't actually go into a closed state.

Fixed.

  • find and find_or_create will return will return self if the path starts with /. I'm not 100% sure what the right thing to do is in this case (raise ArgumentError?), but I'm pretty sure returning self isn't it.

I think the behavior is different enough now that this comment no longer holds? Added test_find()

  • arv-normalize is trying to refer to KEEP_BLOCK_SIZE at its new location, but never imports config, so it crashes.

??? it doesn't do that for me

Other functional code issues

  • I don't think Collection.__exit__ should commit anything if the block raised an exception (exc_type is not None).

Fixed.

  • As Tim mentioned earlier, because it's a backward compatibility hack, I would prefer to see _NameAttribute used only in StreamFileReader. I don't think we should support a name method through any new objects with a file-like API.

Fixed.

  • BlockManager is not using KeepWriteError in line with its API. The error pairs are expected to have Keep service roots as the first argument, not block locators. I don't want to get in the way of you raising whatever information you need, but the mismatch here is awkward since the error has methods like service_errors that don't make sense with this data.

Well, It seems logical for commit_all() to raise a KeepWriteError since that's what we're trying to communicate, and KeepRequestError is already nicely designed to store a dictionary of keys mapped to exceptions. I propose tweaking the docstrings and/or adding a method alias to make it fit its dual role better. Thoughts?

  • SynchronizedCollectionBase.manifest_text() does a lot of string concatenation. Per PEP 8, this can be surprisingly expensive in Python. Prefer building a list of strings and joining it into one after you're done looping.

Improved a bit.

  • SynchronizedCollectionBase.iterkeys() returns a full list of keys, rather than an iterator over them, which is surprising.

Iteration is not threadsafe, so I just removed it entirely.

  • The __repr__ methods for range-related objects should should use %r instead %s in the format string. This gives you the argument's own repr, so you don't have to worry about quoting it, what happens if the string has quotes in it, etc.

Done.

  • Is it possible to tighten up the exceptions that the bufferblock workers catch?
  • Don't say just except: - at least BlockManager.block_prefetch() does this.

The background PUT workers want to catch and report whatever errors get thrown, and the prefetch GET workers don't care if an error is raised because they are on a non-critical path.

  • Don't leave debug prints lying around - at least BlockManager.commit_bufferblock() has one.

Fixed

Style issues

  • It would be nicer if CollectionReader.__init__ required manifest_locator_or_text in its signature (just by naming it as the first argument, before *args). This way users get the Python native exception if they fail to provide it, so you don't have to worry about generating that yourself, and other tools can introspect it correctly.

Done.

  • For the path traversal methods of Collections, I think you can save yourself a lot of manual work by using a max split argument: pathcomponents = path.split("/", 1). Then the first element is the thing to find here, and the second element (if any) is the rest of the path. You don't have to worry about del pathcomponents[0] or rejoining the array.

Done. Thanks.

  • Why did StreamReader methods get broken into public and private versions? It has no locking, so I don't understand what this indirection accomplishes.

I think that's a holdover from a much earlier version of the code. Fixed.

  • There's commented out code in range methods. Please remove it. Yeah, it was there before, but you might as well take this opportunity while you're moving it anyway.

Done.

  • There's a redundant hashlib import in arvfile.

Fixed.

Tests

  • My earlier comment about testing internal attributes and methods in note-26 is still open.

Could you be specific? Sometimes the best way to test a behavior is to directly induce a specific internal state of an object.

  • Prefer assertMatches(…) to assertTrue(re.match(…)). This generates a nicer diagnostic if the assertion fails.

Fixed.

  • Many of those re.match patterns have dots in filename extensions that should very strictly be escaped.

Fixed.

Fixed.

Documentation

PEP 257 says, "Multi-line docstrings consist of a summary line just like a one-line docstring, followed by a blank line, followed by a more elaborate description. The summary line may be used by automatic indexing tools; it is important that it fits on one line and is separated from the rest of the docstring by a blank line." There are many new docstrings that open with more than one line of text; for example, many in the new classes ArvadosFile, BufferBlock, and BlockManager.

So "one-line" in PEP 257 means literally one line, not one sentence. I think I cut all the summary lines down to one line, but typically that doesn't really leave enough room to provide summaries more sophisticated than "The GlobFrobber frobs globs".

  • The docstrings for Collection.save and .save_new don't match the signatures.

Fixed.

  • Please add KeepClient's new session argument to the constructor docstring.

Fixed.

  • The find_or_create docstring has a couple of typos referreng to arvado.collection (missing 's').

Fixed.

Restored the method signature.

Thanks.

Thank you!

#31 Updated by Brett Smith almost 4 years ago

Reviewing cb832096

Peter Amstutz wrote:

Brett Smith wrote:

I would love to clean this up, but I don't think we can without breaking backward compatibility. With current master you can say things like from arvados.api import api and it works fine. Renaming the module would break that code.

Is there any code that actually does from arvados.api import api to break?

It's hard to know without going through all the world's Crunch scripts. I'd be willing to jump if it was really worth the effort, but that said,

Instead, I've added an optional arvapi= option to one_task_per_input_file() to provide a way to inject the mock into the right place.

This seems like a good change anyway. We've been following this pattern in the SDK for a while, so following through here seems sensible.

  • arv-normalize is trying to refer to KEEP_BLOCK_SIZE at its new location, but never imports config, so it crashes.

??? it doesn't do that for me

Sorry, I summarized the issue too much. It's really _normalize_stream. From running the Workbench tests:

Traceback (most recent call last):
  File "/tmp/tmp.iInwaxz7G2/bin/arv-normalize", line 23, in <module>
    cr.normalize()
  File "/tmp/tmp.iInwaxz7G2/local/lib/python2.7/site-packages/arvados/collection.py", line 1477, in populate_streams_wrapper
    return orig_func(self, *args, **kwargs)
  File "/tmp/tmp.iInwaxz7G2/local/lib/python2.7/site-packages/arvados/collection.py", line 1502, in normalize
    for s in sorted(streams)]
  File "/tmp/tmp.iInwaxz7G2/local/lib/python2.7/site-packages/arvados/_normalize_stream.py", line 29, in normalize_stream
    stream_tokens.append(config.EMPTY_BLOCK_LOCATOR)
NameError: global name 'config' is not defined
  • BlockManager is not using KeepWriteError in line with its API. The error pairs are expected to have Keep service roots as the first argument, not block locators. I don't want to get in the way of you raising whatever information you need, but the mismatch here is awkward since the error has methods like service_errors that don't make sense with this data.

Well, It seems logical for commit_all() to raise a KeepWriteError since that's what we're trying to communicate, and KeepRequestError is already nicely designed to store a dictionary of keys mapped to exceptions. I propose tweaking the docstrings and/or adding a method alias to make it fit its dual role better. Thoughts?

I'm fine with this in general as long as we can tell the user what the keys are. Probably just accept that as an additional string argument to __init__. And then make the name service_errors throughout more generic. This API is new enough that I don't think we have to worry about anything using it outside the SDK. Heck, even the SDK never calls service_errors() directly outside of tests, it just relies on the nice stringification.

  • SynchronizedCollectionBase.iterkeys() returns a full list of keys, rather than an iterator over them, which is surprising.

Iteration is not threadsafe, so I just removed it entirely.

That's fine by me since this method seems a little extraneous, but for what it's worth, this seems like a bit of an overreaction? Yes, iterators are not threadsafe, but you get a new iterator every time you call the method and it's rare to share them across call boundaries. Plus, you already have and use __iter__, which does what iterkeys would do too.

If it helps you feel any better, Python notices and raises an exception if you munge a dictionary while you're iterating over it:

>>> d = {'a': 1, 'b': 2}
>>> for k in d:
...     d[k.upper()] = d[k]
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration
  • Why did StreamReader methods get broken into public and private versions? It has no locking, so I don't understand what this indirection accomplishes.

I think that's a holdover from a much earlier version of the code. Fixed.

Still an issue for size/_size.

Tests

  • My earlier comment about testing internal attributes and methods in note-26 is still open.

Could you be specific? Sometimes the best way to test a behavior is to directly induce a specific internal state of an object.

Accessing private attributes in your tests poses a couple of problems. The biggest one is that you get away from testing the behavior the user would see. If you call a private method to mutate the object during the test (and any attribute in Python could potentially be a method, if it's a property), then you lose assurance that users will see the desired behavior when they use the public methods, which is one of the primary goals of the tests.

The second is that future developers shouldn't have to worry about maintaining the API compatibility of private methods. Those should be there just to make implementation of the public methods easier, so their signatures, return values, and side effects can all change at whim. If tests call private methods, then developers may have to change the tests to accommodate interface changes to those methods, which erodes assurance that the public methods are maintaining their APIs.

In short: test behavior, not state. What would you do if this were Ruby, and the private-ness of these attributes were enforced? Do that.

Thanks.

#32 Updated by Brett Smith almost 4 years ago

Peter Amstutz wrote:

Brett Smith wrote:

  • My earlier comment about testing internal attributes and methods in note-26 is still open.

Could you be specific? Sometimes the best way to test a behavior is to directly induce a specific internal state of an object.

To be actually specific:

  • Many tests that peek at c._manifest_locator.
  • test_rewrite_* call writer.arvadosfile._repack_writes.
  • ArvadosFileReadTestCase.reader_for reaches into col._my_block_manager—couldn't you just construct it with the desired attributes and pass it in?

#33 Updated by Peter Amstutz almost 4 years ago

Brett Smith wrote:

Sorry, I summarized the issue too much. It's really _normalize_stream. From running the Workbench tests:

[...]

Fixed. Also added a test.

I'm fine with this in general as long as we can tell the user what the keys are. Probably just accept that as an additional string argument to __init__. And then make the name service_errors throughout more generic. This API is new enough that I don't think we have to worry about anything using it outside the SDK. Heck, even the SDK never calls service_errors() directly outside of tests, it just relies on the nice stringification.

Done. Now represents a dict of request_errors instead of service_errors. Added a "label" parameter.

I think that's a holdover from a much earlier version of the code. Fixed.

Still an issue for size/_size.

Fixed.

Tests

  • Many tests that peek at c._manifest_locator.
  • test_rewrite_* call writer.arvadosfile._repack_writes.
  • ArvadosFileReadTestCase.reader_for reaches into col._my_block_manager—couldn't you just construct it with the desired attributes and pass it in?

Fixed. Added ArvadosFile.flush().

Now at 63b03a3

#34 Updated by Brett Smith almost 4 years ago

Reviewing 63b03a3. We're coming down the home stretch. I expect this to be my last top-to-bottom review of the diff.

Follow-ups

  • Now that all the sync stuff is gone, I'm realizing SynchronizedCollectionBase is no longer the most applicable name either. This is not a blocker for merge, but is there a better one that might better describe the purpose? RichCollectionBase, or something like that? (I don't want to call it CollectionBaseV2 or anything like that, on the expectation we'll have a real v2 SDK at some point.)
  • Let's name the new argument of one_task_per_input_file api_client, for consistency with the rest of the SDK.

Code issues

  • FUSE's test_mount.py is missing an import:
    ======================================================================
    ERROR: runTest (tests.test_mount.FuseHomeTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/home/brett/repos/arvados/services/fuse/tests/test_mount.py", line 24, in setUp
        self.api = arvados.safeapi.ThreadSafeApiCache(arvados.config.settings())
    AttributeError: 'module' object has no attribute 'safeapi'
    
  • KeepLocator.stripped() needs to just return self.md5sum if self.size is None. We've recently fixed related bugs in other components; see, e.g., d302307a4a66867419722034228823d1fc3910a6.
  • In the Python file API, it's legal to close a file multiple times. Instead of decorating ArvadosFileWriter.close with _before_close, suggest wrapping the main body under if not self.closed:.
  • In the Python file API, truncating past the current size normally pads the file with zeroes. Was there a conscious decision to not support this in ArvadosFile.truncate()? I feel pretty OK with that decision if so (and I like indicating that with IOError), just wanted to double-check.

Documentation

  • In the KeepRequestError __init__ docstring, please update the method signature and document the label argument.
  • The api_from_config docstring references a cache argument that doesn't exist.
  • In the docstring for Collection.manifest_locator(), please explain when the return value is None.
  • The first paragraph of Collection.save()'s docstring refers to an update argument—it should say merge.
  • The docstring for locators_and_ranges() starts with multiple lines of text.
  • In a comment in ranges.py, there's a typo "contigious" (should be "contiguous").

Code cleanup

  • Prefer is not None in SynchronizedCollectionBase.exists(). (I appreciate that you've generally taken this suggestion to heart, and just missed this spot.)
  • ArvadosFile.manifest_text() defines item = self for no good reason. It never reassigns item. Just say self.
  • The same issue exists in SynchronizedCollectionBase.manifest_text().
  • The long() cast cleanup has ended up creating a number of noop lines in _ranges.py. It includes the line range_start = range_start, and again for range_size, new_range_start, and new_range_size.
  • In the tests, MockParent defines sync_mode().
  • This is definitely not required, but you would make me unreasonably happy if this branch followed the PEP 8 guidelines for separating definition blocks (one blank line after a function definition, two after a class).

Thanks.

#35 Updated by Peter Amstutz almost 4 years ago

Brett Smith wrote:

Reviewing 63b03a3. We're coming down the home stretch. I expect this to be my last top-to-bottom review of the diff.

Follow-ups

  • Now that all the sync stuff is gone, I'm realizing SynchronizedCollectionBase is no longer the most applicable name either. This is not a blocker for merge, but is there a better one that might better describe the purpose? RichCollectionBase, or something like that? (I don't want to call it CollectionBaseV2 or anything like that, on the expectation we'll have a real v2 SDK at some point.)

"Synchronized" actually referred to the fact that the methods has @synchronized decorators, but RichCollectionBase is fine too.

  • Let's name the new argument of one_task_per_input_file api_client, for consistency with the rest of the SDK.

Done

Code issues

  • FUSE's test_mount.py is missing an import:
    [...]

Fixed, but run-tests.sh is having trouble importing arvados.safeapi. ?!

Fixed.

  • In the Python file API, it's legal to close a file multiple times. Instead of decorating ArvadosFileWriter.close with _before_close, suggest wrapping the main body under if not self.closed:.

Fixed.

  • In the Python file API, truncating past the current size normally pads the file with zeroes. Was there a conscious decision to not support this in ArvadosFile.truncate()? I feel pretty OK with that decision if so (and I like indicating that with IOError), just wanted to double-check.

Yes, that's a conscious decision.

In the documentation for file.truncate():

"Note that if a specified size exceeds the file’s current size, the result is platform-dependent: possibilities include that the file may remain unchanged, increase to the specified size as if zero-filled, or increase to the specified size with undefined new content."

That doesn't preclude "raise an exception" as a platform-dependent behavior :-)

Documentation

  • In the KeepRequestError __init__ docstring, please update the method signature and document the label argument.\\\

Fixed.

  • The api_from_config docstring references a cache argument that doesn't exist.

Fixed.

  • In the docstring for Collection.manifest_locator(), please explain when the return value is None.

Fixed.

  • The first paragraph of Collection.save()'s docstring refers to an update argument—it should say merge.

Fixed.

  • The docstring for locators_and_ranges() starts with multiple lines of text.

Fixed.

  • In a comment in ranges.py, there's a typo "contigious" (should be "contiguous").

Fixed.

Code cleanup

  • Prefer is not None in SynchronizedCollectionBase.exists(). (I appreciate that you've generally taken this suggestion to heart, and just missed this spot.)

Fixed.

  • ArvadosFile.manifest_text() defines item = self for no good reason. It never reassigns item. Just say self.
  • The same issue exists in SynchronizedCollectionBase.manifest_text().

These are methods that were originally defined standalone and then pulled into the class; that was a quick trick to ensure that the methods continued to do the same thing without worrying about introducing typos. Fixed now.

  • The long() cast cleanup has ended up creating a number of noop lines in _ranges.py. It includes the line range_start = range_start, and again for range_size, new_range_start, and new_range_size.

Duh! fixed.

  • In the tests, MockParent defines sync_mode().

Removed.

  • This is definitely not required, but you would make me unreasonably happy if this branch followed the PEP 8 guidelines for separating definition blocks (one blank line after a function definition, two after a class).

Well, unreasonably happy reviewers are hopefully better reviewers, so I went through and tweaked some spacing, let me know if there's anything else.

#36 Updated by Brett Smith almost 4 years ago

Reviewing c691e44.

Peter Amstutz wrote:

Brett Smith wrote:

Code issues

  • FUSE's test_mount.py is missing an import:
    [...]

Fixed, but run-tests.sh is having trouble importing arvados.safeapi. ?!

Hmm, funny enough, I can't reproduce. The tests run fine for me, whether run as python setup.py test (after installing the latest SDK—you'll need to bump that version dependency after you merge) or via run-tests.sh. Are you calling run-tests with extra switches?

Documentation

  • In the KeepRequestError __init__ docstring, please update the method signature and document the label argument.

Fixed.

I see the signature update. Any chance of adding label to the arguments list below?

  • In a comment in ranges.py, there's a typo "contigious" (should be "contiguous").

Fixed.

Oh, that's an even better typo fix! But I was referring to this comment: # assumes that all of the blocks are contigious, so range_start is guaranteed

  • ArvadosFile.manifest_text() defines item = self for no good reason. It never reassigns item. Just say self.
  • The same issue exists in SynchronizedCollectionBase.manifest_text().

These are methods that were originally defined standalone and then pulled into the class; that was a quick trick to ensure that the methods continued to do the same thing without worrying about introducing typos. Fixed now.

Yeah, I figured that was the basic evolution. That's cool, but this seemed like the perfect opportunity to clean them up, when all the history is swapped in and we know it's no longer necessary. Thanks for that.

I'm comfortable merging after the KeepRequestError.__init__ documentation update, if you're OK with the test situation. Thanks.

#37 Updated by Peter Amstutz almost 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:bb19e060336c4e734b3e1922c5be3c4b40ff7da8.

#38 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)
  • Status changed from Resolved to In Progress
  • Assigned To deleted (Peter Amstutz)
  • Target version changed from 2015-03-11 sprint to Arvados Future Sprints

Reopening because some APIs aren't implemented yet, e.g., portable_manifest_text(), walk(), listdir().

#39 Updated by Tom Morris over 2 years ago

Are these additional methods needed? If so, can we enumerate the important ones?

Also available in: Atom PDF