Project

General

Profile

Actions

Feature #17112

open

Store unsigned collection manifests in keep

Added by Joshua Randall almost 4 years ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

This is a write-up of the manifests-in-keep idea I mentioned on the Arvados community call earlier.

I think that I wasn't clear on the call about what I meant by getting around the 64MB block size limit. What I was trying to suggest was to have a manifest of a manifest_collection stored in a single block (the block locator for that block could be used in place of a portable_data_hash). The content of that initial block could just be the raw manifest of a collection that itself contains a `manifest.txt` file as well as `portable_data_hash.txt` (exact names not important). I would note that this is also extensible as in the future other files could be added alongside those two that would have additional collection properties, possibly alternative manifest formats, or even indices into the text-based `manifest.txt` to improve access to large collections. The `manifest.txt` file in that manifest collection would contain the actual manifest for the real collection. By my calculations, a single 64MB "manifest collection" manifest block should be able to store references to those two files with a limit on the size of `manifest.txt` of 1597829 blocks (so with full 64MB blocks this would mean that the actual collection manifest could theoretically be up to 97TB, which would mean that the actual collection would theoretically be able to reference more than 2.5 trillion blocks (more than 145 exabytes). That should be enough to get by for a while. :-)

Keep clients could continue to support traditional manifests stored in the API server (identified as they are now by either collection uuid or pdh) but could be extended to also support this new style of manifest-collection manifest block stored directly in keep by having some alternative scheme for referring to the manifest collection block instead of a uuid or pdh (perhaps something like `manifest_collection_block:<block_locator>`. Some slight changes to the keep clients would be required to add this, but since the manifests are all just regular keep manifests, it would mostly just be plumbing to make calls using existing keep client and manifest parsing functionality to get the manifest from keep rather than from the API server. No new manifest format is needed, just the semantics of using the keep client to get the final manifest from a collection rather than directly from the API server. The schema of the "manifest collection" would need to be defined but is extremely simple - as I proposed above it could just be to have a `manifest.txt` and a `portable_data_hash.txt` at the top-level of the collection.

I believe all of that should work with permissions turned off - with the only changes required being to add the functionality to read and write keep-based manifests to the keep client libraries. Importantly, it should not require any backwards-incompatible changes to achieve and existing collections referenced by uuid or pdh could continue to work.

With an Arvados system with permissions turned on, the API server and the keep client would need additional functionality added in order to sign block locators, which would obviously need to be stored unsigned in the keep-based manifests. Again, I believe these changes can be made entirely backwards compatibly. The API server could have a new endpoint that takes a collection portable data hash and a set of block locators, and returns signed block locators for the requested blocks (assuming the user has permission). The keep client would need to be modified to handle unsigned blocks by going to the API server to get them signed when actual data is requested (in the first instance, the client could just request that the server sign all blocks in the collection, but in the future this would be a place for tuning behaviour to only request signatures relevant to a particular directory prefix, file, or portion of file (perhaps in chunks of signing N blocks at a time). This could dramatically reduce the load on the API server by only requesting signatures for data that is actually being accessed by the client rather than for whole (potentially enormous) collections at a time.

I would note also that a keep client that has the new capability to request signed block locators only when required could have the option to also apply that strategy to collections that are entirely API-server based (by requesting the collection using existing API server functionality to request an unsigned manifest text and then using the new endpoint to request block signatures as needed using the same code paths as above - i.e. if the client finds itself with unsigned block locators and permissions enabled, it would try to get them signed using the new block signing API endpoint).

The two portions of this work, one being the keep-based manifest-of-manifest-collection option for storing manifests and the other being the support for partial block signing - could be implemented independently and in either order, and together should massively reduce the load on the API server and also enable extremely large collections by raising the theoretical single-file collection size limit from 195TB (given the default 128MB manifest text limit) to over 152043520TB or 145EB (given the new 97TB manifest text limit).

Actions

Also available in: Atom PDF