Project

General

Profile

Actions

Idea #3036

closed

[API] Use regular uuids instead of content hashes to identify collections

Added by Tom Clegg almost 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
08/07/2014
Due date:
Story points:
5.0

Description

Summary

Collections should have system-assigned UUIDs that look like other Arvados UUIDs; should be mutable; and should have a "name" attribute.

Background (current behavior)

Collections are content-addressed: they have uuid = hash(manifest_text) and they are immutable. This makes them behave differently than all other objects in Arvados, which tends to be confusing and awkward.
  • Feature: This makes it possible to do a bitwise comparison of collections (e.g., job outputs) without even looking up the collection itself.
  • Feature: This (partially) de-duplicates manifest storage: a given manifest is only stored once in the collections table. But this does not survive tiny changes like renaming a file within a collection.
  • Feature: Collection metadata (manifests) cannot be deleted by users. Even if the content data has been deleted, a superuser can still see the filenames and sizes for every collection ever made. (Not clear whether this feature is valuable, though.)
  • Drawback: Applications must create Link objects (link_class="name") in order to attach names to collections (analogous to file/directory names in regular filesystems). This API is unwieldy.
  • Drawback: The default permission model -- i.e., the creator of an object has permission to read/edit/delete it -- cannot be achieved using the owner_uuid attribute. In order to provide a predictable outcome for "create a collection" regardless of whether another user has already created an identical collection, we are forced to give all collections owner_uuid=root and create a "permission" link for each collection creation. We also have to synchronize "name" links with "permission" links in order to achieve reasonable behavior for users.
  • Drawback: The timestamps of collections are confusing. If I create a new collection when (unbeknownst to me) another user has already created one with identical content some days ago, the "created" and "updated" timestamps and similar metadata will be surprising and generally useless (except perhaps as an undesirable information leak).

New behavior

  • Collections are mutable, and have a name attribute.
  • Look up the hash of a collection's manifest when you want to do a bitwise comparison of content.
(Certainly incomplete) list of changes/consequences:
  • First step: allow clients to call collections.create without providing a uuid. (merged in 5bbd6abc)
  • Update uuid→class regexps to accept collection uuids in the usual arvados uuid format as well as portable_data_hashes.
  • Copy current uuid values to portable_data_hash
  • If clients provide portable_data_hash to collections.create, verify that as uuid is verified now (i.e., compare it to the portable_data_hash computed from the provided (stripped) manifest, and respond 422 if it doesn't match). Skip this check if no portable_data_hash provided by client.
  • Fix clients so they pass the expected portable_data_hash instead of uuid (or pass neither) and use the uuid provided by Arvados, rather than assuming the new collection's uuid will be a content address.
  • Add usual mutable fields like "name", "description", and "properties" to the collections table.
  • Remove "all collections are owned by root" logic.
  • Remove "add a permission link for me after creating a collection" logic.
  • Update Workbench to use collections' "name" attributes instead of name links.
  • Migrate existing name links in the database to become new collections.

Looking up collections by portable data hash

Existing workbench links with old collection UUIDs should still work. Crunch jobs (new ones and repetitions of old ones) should continue to use portable data hashes.

  • Look up by portable_data_hash if collections.get called with old format collection UUID, and redact the mutable fields.
  • Jobs' script_parameters should be filled in with portable data hashes rather than collection UUIDs. Pipelines will record both fields (UUID and portable data hash) much like they do now with link_uuid, link_name, and value keys. (See Workbench application_helper.rb)

Subtasks 8 (0 open8 closed)

Task #3502: Document expectations for content addressingResolvedPeter Amstutz08/07/2014Actions
Task #3503: Document intended behavior and design for projects, collections and permissions ResolvedPeter Amstutz08/08/2014Actions
Task #3581: Update workbenchResolvedPeter Amstutz08/07/2014Actions
Task #3509: Links to collections in workbench render the "name" field sensiblyResolvedPeter Amstutz08/22/2014Actions
Task #3579: Update test fixturesResolvedPeter Amstutz08/07/2014Actions
Task #3632: Review 3036-collection-uuidsResolvedPeter Amstutz08/07/2014Actions
Task #3580: Update testsResolvedPeter Amstutz08/07/2014Actions
Task #3578: Write db migrationResolvedPeter Amstutz08/07/2014Actions

Related issues

Related to Arvados - Bug #3024: [API] Synchronize read permissions and collection name linksResolvedActions
Related to Arvados - Idea #3504: [SDKs] Clients are compatible with #3036ResolvedPeter Amstutz08/07/2014Actions
Related to Arvados - Bug #4756: [API] Add migration to change collection uuids to portable_data_hash in old job script_parametersRejected12/09/2014Actions
Actions

Also available in: Atom PDF