Feature #4823

Updated by Tom Clegg over 4 years ago

h3. Goals

More enjoyable API for Python programmers to use. Something like:
* <pre><code class="python">
c=Collection(...)
with c.open('foo.txt', 'w') as f:
f.write('foo')
with c.open('foo.txt', 'a') as f:
f.write('bar')
c.rename('foo.txt', 'baz/foobar.txt')
with c.open('baz/foobar.txt', 'r') as f:
print f.read()
c.save()
</code></pre>

Serialize/unserialize (manifest) code all in once place.
* Abstract away the "manifest" encoding as much as possible to pave the way for upgrading/replacing it (say, with a richer JSON format).
* Only one version of tokenizing/regexp parsing, string concatenation, making sure zero-length streams have a zero-length block locator, stuff like that.

In-memory data structure suitable for mutable collections.
* Accommodate use of "data buffer" blocks for data not yet written to Keep.
* Simplify file operations by using a distinct piece of memory for each file. (Modifying a stream in order to modify a file, without disrupting other files in the stream, is painful!)
* See #4837

h3. Collection interface

@Collection()@
* Create a new empty collection.

@Collection(uuid)@
* Retrieve the given collection from the API server.

@Collection(manifest_text)@
* Create a new collection with the given content.

@modified()@ @dirty()@
* Return _True_ if the collection has been modified since it was last retrieved or saved to the API server, otherwise _False_.

> (TC) I'm not 100% sold on the term "dirty". It vaguely implies there's an automatic write caching system at work, and it's not clear whether it's meant to cover the "remote copy has changed, ours hasn't" case. Perhaps @saved()@ would be more direct?

@manifest_text()@
* Return the "manifest" string representation of this collection. This implicitly commits all buffered data to disk.

@portable_manifest_text()@
* Return the "portable manifest" string representation of this collection used to compute portable_data_hash -- i.e., the manifest with the non-portable parts (like Keep permission signatures) removed. This can always be done without flushing any data to disk.

@portable_data_hash()@
* Return the portable_data_hash that would be accepted/assigned by the API server if the collection were <code>save()</code>d right now. This implicitly writes all buffered data to disk, but does not update the collection record on the API server.

> (TC) Alternate semantics: Return the pdh assigned/accepted by the server. Raise an exception if not @saved()@. But it would be weird to require @save()@ in order to get @manifest_text()@, and weird if you could get @manifest_text()@ but not @portable_data_hash()@ when not @saved()@.

@listdir(path)@
* Return a list containing the names of the entries in the subcollection given by _path_.

@walk(path, topdown=True, onerror=None)@
* (As close as possible to @os.walk()@.) Generate the file names in a directory tree. For each subcollection (below and including _path_, where '.' is the whole collection) yield a 3-tuple @(dirpath, dirnames, filenames)@.

@remove(path)@
* Remove the file or subcollection named _path_.

@unlink(path)@
* Alias for @remove@.

@rename(old, new)@
* Rename a file from _old_ to _new_.

@rename(old, new, dest_collection)@
* Move a file _old_ (in this collection) to _new_ (in a different collection).

> (TC) Assuming this doesn't atomically commit/save the two collections to the API server, which is currently impossible, atomicity affects only the current process. Perhaps it's OK to just offer copy+delete -- just like POSIX, which doesn't offer an atomic move (or even copy) across filesystems?

@copy(old, new)@
* Create a new file _new_ with the same content _old_ has right now.

@copy(old, new, dest_collection)@
* Create a new file _new_ in a different collection, with the same content _old_ has right now.
* Alternate suggestion 1: @copy(old, new)@ copies across collections if _old_ is a file-like object obtained from a (different) Collection's @open()@ method.
* Alternate suggestion 2: @Collection.copy(old, new)@ (a class method) copies across collections: _old_ and _new_ are both file-like objects obtained from @open()@ methods on collections. This allows efficient "concatenate" operations too, not just entire-file copying, and could be extended in the future to support byte ranges from the source files as well.
* Alternate suggestion 3: could we put magic in place so this works without moving any data around?
** <pre><code class="python">
dest_collection.open(new_name, 'w').write(src_collection.open(old_name, 'r').read())
</code></pre>

> (TC) One thing that makes me uncomfortable about the @(old,new,dest)@ signature is that it's not obvious, looking at @c1.copy('foo', 'bar', c2)@, whether we're copying c1&rarr;c2 or c2&rarr;c1.

@open(filename, mode)@
* Semantics as close as practicable to open(). Return an object with (some subset of) the Python "file" interface.

@glob(globpattern)@
* Returns an iterator that yields successive files that match _globpattern_ from the collection.

> (TC) I'd suggest the @glob@ feature should be implemented by the caller. That lets the caller decide, and be explicit about, whether to use regexps, globs, etc. In Python it's pretty easy, and idiomatic, to do stuff like @[f for f in fnmatch.filter(c.listdir(path), '*.o')]@ -- that pattern can be extended to @walk()@ as well, all with well-defined and unsurprising semantics.

Back