Feature #4823

Updated by Tom Clegg over 4 years ago

h3. Goals

More enjoyable API for Python programmers
For more consistent and uniform manifest handling, we should provide an SDK that applications can use to use. Something like:
* <pre><code class="python">
with c.open('foo.txt', 'w') as f:
with c.open('foo.txt', 'a') as f:
c.rename('foo.txt', 'baz/foobar.txt')
with c.open('baz/foobar.txt', 'r') as f:
print f.read()
build and manipulate manifest files.

Serialize/unserialize (manifest) code all in once place.
* Abstract away the "manifest" encoding as much as possible
This proposal is to pave implement the way for upgrading/replacing it (say, with following methods, either directly on a richer JSON format).
* Only one version of tokenizing/regexp parsing, string concatenation, making sure zero-length streams have
Collection or on a zero-length block locator, stuff like that. Manifest class as it seems appropriate:

In-memory data structure suitable for mutable collections.
* Accommodate use of "data buffer" blocks for data not yet written to Keep. @rename(old, new, dest_manifest=None)@
* Simplify ** Rename a file operations by using a distinct piece of memory for each file. (Modifying a stream in order the manifest from _old_ to modify a file, without disrupting other files _new_, updating all necessary lines in the stream, manifest. If _dest_manifest_ is painful!) supplied, this atomically performs @copy@ followed by @delete@.
* See #4837

h3. Collection interface

@copy(old, new, dest_manifest=None)@
* Create ** Add a new empty collection.

manifest entry for file _new_, duplicating the block list for existing file _old_. If _dest_manifest_ is supplied, the new entries should be added to that manifest instead.
* Retrieve the given collection @delete(filename)@
** Remove all entries for _filename_
from the API server.

* @create(filename, stream=".", blocks=[])@
Create a new collection with manifest entry for _filename_. Use the given content.

specified stream and blocklist.
* ** Return _True_ if the collection manifest has been modified since it was last retrieved being created or saved to the API server, otherwise _False_.

> (TC) I'm not 100% sold on the term "dirty". It vaguely implies there's an automatic write caching system at work, and it's not clear whether it's meant to cover the "remote copy has changed, ours hasn't" case. Perhaps @saved()@ would be more direct?

* Return the "manifest" string representation of this collection. This implicitly commits all buffered data to disk.

* Return the portable_data_hash that would be accepted/assigned by the API server if the collection were <code>save()</code>d right now. This implicitly writes all buffered data to disk, but does not update the collection record on the API server.

> (TC) Alternate semantics: Return the pdh assigned/accepted by the server. Raise an exception if not @saved()@. But it would be weird to require @save()@
If these methods are implemented in order to get @manifest_text()@, and weird if you could get @manifest_text()@ but not @portable_data_hash()@ when not @saved()@. a new Manifest class:

* Return a list containing the names of the entries in the subcollection given by _path_.

@walk(path, topdown=True, onerror=None)@
* (As close as possible to @os.walk()@.) Generate the file names in ** Constructor takes a directory tree. For each subcollection (below and including _path_, where '.' is the whole collection) yield string (default "") with which it populates a 3-tuple @(dirpath, dirnames, filenames)@.

new Manifest object
* Remove the file or subcollection named _path_.

* Alias for @remove@.

@rename(old, new)@
* Rename
** Returns a file from _old_ to _new_.

@rename(old, new, dest_collection)@
* Move a file _old_ (in
string containing the serialized text representing this collection) to _new_ (in a different collection). manifest

> (TC) Assuming this doesn't atomically commit/save the two collections Additional methods that are likely to the API server, which is currently impossible, atomicity affects only the current process. Perhaps it's OK to just offer copy+delete -- just like POSIX, which doesn't offer an atomic move (or even copy) across filesystems? prove useful include:

@copy(old, new)@
* Create a new file _new_ with the same content _old_ has right now.

@copy(old, new, dest_collection)@
* Create a new ** Returns an iterator that yields successive file _new_ basenames found in a different collection, with directory _dirname_ inside the same content _old_ has right now. manifest.
* Alternate suggestion 1: @copy(old, new)@ copies across collections if _old_ is a file-like object obtained from a (different) Collection's @open()@ method. @glob(globpattern)@
* Alternate suggestion 2: @Collection.copy(old, new)@ (a class method) copies across collections: _old_ and _new_ are both file-like objects obtained from @open()@ methods on collections.
* Alternate suggestion 3: could we put magic in place so this works without moving any data around?
** <pre><code class="python">
dest_collection.open(new_name, 'w').write(src_collection.open(old_name, 'r').read())

> (TC) One thing that makes me uncomfortable about the @(old,new,dest)@ signature is that it's not obvious, looking at @c1.copy('foo', 'bar', c2)@, whether we're copying c1&rarr;c2 or c2&rarr;c1.

@open(filename, mode)@
* Semantics as close as practicable to open(). Return an object with (some subset of) the Python "file" interface.

Returns an iterator that yields successive files that match _globpattern_ from the collection.

> (TC) I'd suggest the @glob@ feature should be implemented by the caller. That lets the caller decide, and be explicit about, whether to use regexps, globs, etc. In Python it's pretty easy, and idiomatic, to do stuff like @[f for f in fnmatch.filter(c.listdir(path), '*.o')]@ -- that pattern can be extended to @walk()@ as well, all with well-defined and unsurprising semantics.