Story #3036

Updated by Tom Clegg over 7 years ago

h2. Note: This is a draft -- not yet ready to implement.

h3. Summary

Collections should have system-assigned UUIDs that look like other Arvados UUIDs; UUIDs, and should be mutable; and should have a "name" attribute. mutable.

h4. Background (current behavior)

Collections are content-addressed: they have @uuid = hash(manifest_text)@ and they are immutable. This makes them behave differently than all other objects in Arvados, which tends to be confusing and awkward.
* Feature: This makes it possible to do a bitwise comparison of collections (e.g., job outputs) without even looking up the collection itself.
* Feature: This (partially) de-duplicates manifest storage: a given manifest is only stored once in the collections table. But this does not survive tiny changes like renaming a file within a collection.
* Feature: Collection metadata (manifests) cannot be deleted by users. Even if the content data has been deleted, a superuser can still see the filenames and sizes for every collection ever made. (Not clear whether this feature is valuable, though.)
* Drawback: Applications must create Link objects (link_class="name") in order to attach names to collections (analogous to file/directory names in regular filesystems). This API is unwieldy.
* Drawback: The default permission model -- i.e., the creator of an object has permission to read/edit/delete it -- cannot be achieved using the owner_uuid attribute. In order to provide a predictable outcome for "create a collection" regardless of whether another user has already created an identical collection, we are forced to give all collections owner_uuid=root and create a "permission" link for each collection creation. We also have to synchronize "name" links with "permission" links in order to achieve reasonable behavior for users.
* Drawback: The timestamps of collections are confusing. If I create a new collection when (unbeknownst to me) another user has already created one with identical content some days ago, the "created" and "updated" timestamps and similar metadata will be surprising and generally useless (except perhaps as an undesirable information leak).

h4. New behavior

* Collections are mutable, and have a name attribute.
* Look up the hash of a collection's manifest when you want to do a bitwise comparison of content.

(Certainly incomplete) list of changes/consequences:
* First step: allow clients to call collections.create without providing a uuid.
* Update uuid→class regexps to accept collection uuids in the usual arvados uuid format as well as portable_data_hashes.
* Copy current uuid values to portable_data_hash
* If clients provide portable_data_hash to collections.create, verify that as uuid is verified now (i.e., compare it to the portable_data_hash computed from the provided (stripped) manifest, and respond 422 if it doesn't match). Skip this check if no portable_data_hash provided by client.
* Fix clients so they pass the expected portable_data_hash instead of uuid (or pass neither) and use the uuid provided by Arvados, rather than assuming the new collection's uuid will be a content address.
* Add usual mutable fields like "name", "description", and "properties" to the collections table.
* Remove "all collections are owned by root" logic.
* Remove "add a permission link for me after creating a collection" logic.
* Update Workbench to use collections' "name" attributes instead of name links.
* Migrate existing name links in the database to become new collections.

h4. Unresolved:

* Existing workbench links -- and repeating old jobs -- with old collection UUIDs should still work
** Look up by @portable_data_hash@ if collections.get called with old format collection UUID? (Redact the mutable fields?)
* Collection content still immutable? Assign @uuid=hash(random+manifest)@ and keep @random@ in the collection record, so integrity can still be verified by a client that remembers only the UUID?
** Alternative: record both @uuid@ and @portable_data_hash@ whenever referencing collections in job inputs, etc.
** Alternative: assign @uuid=random+hash(random+manifest)@ (or @uuid=random+HMAC(random,manifest)@) so clients can verify integrity of a retrieved collection without knowing anything more than the UUID and the UUID format.

Back