Feature #4823

Updated by Tom Clegg over 4 years ago

h3. Goals

More enjoyable API for Python programmers to use. Something like:
* <pre><code class="python">
c=Collection(...)
with c.open('foo.txt', 'w') as f:
f.write('foo')
with c.open('foo.txt', 'a') as f:
f.write('bar')
c.rename('foo.txt', 'baz/foobar.txt')
with c.open('baz/foobar.txt', 'r') as f:
print f.read()
c.save()
</code></pre>

Serialize/unserialize (manifest) code all in once place.
* Abstract away the "manifest" encoding as much as possible to pave the way for upgrading/replacing it (say, with a richer JSON format).
* Only one version of tokenizing/regexp parsing, string concatenation, making sure zero-length streams have a zero-length block locator, stuff like that.

In-memory data structure suitable for mutable collections.
* Accommodate use of "data buffer" blocks for data not yet written to Keep.
* Simplify file operations by using a distinct piece of memory for each file. (Modifying a stream in order to modify a file, without disrupting other files in the stream, is painful!)
* See #4837

h3. Collection interface

@Collection()@
* Create a new empty collection.

@Collection(uuid)@
* Retrieve the given collection from the API server.

@Collection(manifest_text)@
* Create a new collection with the given content.

@modified()@
* Return _True_ if the collection has been modified since it was last retrieved or saved to the API server, otherwise _False_.

@manifest_text()@
* Return the "manifest" string representation of this collection. This implicitly commits all buffered data to disk.

@portable_manifest_text()@
* Return the "portable manifest" string representation of this collection used to compute portable_data_hash -- i.e., the manifest with the non-portable parts (like Keep permission signatures) removed. This can always be done without flushing any data to disk.

@portable_data_hash()@
* Return the portable_data_hash that would be accepted/assigned by the API server if the collection were <code>save()</code>d right now.

@listdir(path)@
* Return a list containing the names of the entries in the subcollection given by _path_.

@walk(path, topdown=True, onerror=None)@
* (As close as possible to @os.walk()@.) Generate the file names in a directory tree. For each subcollection (below and including _path_, where '.' is the whole collection) yield a 3-tuple @(dirpath, dirnames, filenames)@.

@remove(path)@
* Remove the file or subcollection named _path_.

@unlink(path)@
* Alias for @remove@.

@rename(old, new)@
* Rename a file from _old_ to _new_.

-@rename(old, @rename(old, new, dest_collection)@- dest_collection)@
* -Move Move a file _old_ (in this collection) to _new_ (in a different collection).-
* Rejected. Use copy + remove instead.
collection).

> (TC) Assuming this doesn't atomically commit/save the two collections to the API server, which is currently impossible, atomicity affects only the current process. Perhaps it's OK to just offer copy+delete -- just like POSIX, which doesn't offer an atomic move (or even copy) across filesystems?

@copy(old, new)@
* Create a new file _new_ with the same content _old_ has right now.

@copy(source_path, dest_path, source_collection=None, overwrite=False)@ @copy(old, new, dest_collection)@
* Create a new file _dest_path_, _new_ in a different collection, with the same content the file at _source_path_ _old_ has right now. If _source_collection_
* Alternate suggestion 1: @copy(old, new)@ copies across collections if _old_
is a file-like object obtained from a (different) Collection's @open()@ method.
* Alternate suggestion 2: @Collection.copy(old, new)@ (a class method) copies across collections: _old_ and _new_ are both file-like objects obtained from @open()@ methods on collections. This allows efficient "concatenate" operations too,
not @None@, _source_path_ refers just entire-file copying, and could be extended in the future to a file support byte ranges from the source files as well.
* Alternate suggestion 3: could we put magic
in source_collection, which might place so this works without moving any data around?
** <pre><code class="python">
dest_collection.open(new_name, 'w').write(src_collection.open(old_name, 'r').read())
</code></pre>

> (TC) One thing that makes me uncomfortable about the @(old,new,dest)@ signature is that it's
not be @self@. If _dest_path_ already exists and _overwrite_ is False, raise an exception. obvious, looking at @c1.copy('foo', 'bar', c2)@, whether we're copying c1&rarr;c2 or c2&rarr;c1.

@open(filename, mode)@
* Semantics as close as practicable to open(). Return an object with (some subset of) the Python "file" interface.

-@glob(globpattern)@- @glob(globpattern)@
* -Returns Returns an iterator that yields successive files that match _globpattern_ from the collection.-
* Rejected. Use
collection.

> (TC) I'd suggest the @glob@ feature should be implemented by the caller. That lets the caller decide, and be explicit about, whether to use regexps, globs, etc. In Python it's pretty easy, and idiomatic, to do stuff like
@[f for f in fnmatch.filter(c.listdir(path), '*.o')]@ instead. -- that pattern can be extended to @walk()@ as well, all with well-defined and unsurprising semantics.

Back