Feature #17112
openStore unsigned collection manifests in keep
Description
This is a write-up of the manifests-in-keep idea I mentioned on the Arvados community call earlier.
I think that I wasn't clear on the call about what I meant by getting around the 64MB block size limit. What I was trying to suggest was to have a manifest of a manifest_collection stored in a single block (the block locator for that block could be used in place of a portable_data_hash). The content of that initial block could just be the raw manifest of a collection that itself contains a `manifest.txt` file as well as `portable_data_hash.txt` (exact names not important). I would note that this is also extensible as in the future other files could be added alongside those two that would have additional collection properties, possibly alternative manifest formats, or even indices into the text-based `manifest.txt` to improve access to large collections. The `manifest.txt` file in that manifest collection would contain the actual manifest for the real collection. By my calculations, a single 64MB "manifest collection" manifest block should be able to store references to those two files with a limit on the size of `manifest.txt` of 1597829 blocks (so with full 64MB blocks this would mean that the actual collection manifest could theoretically be up to 97TB, which would mean that the actual collection would theoretically be able to reference more than 2.5 trillion blocks (more than 145 exabytes). That should be enough to get by for a while. :-)
Keep clients could continue to support traditional manifests stored in the API server (identified as they are now by either collection uuid or pdh) but could be extended to also support this new style of manifest-collection manifest block stored directly in keep by having some alternative scheme for referring to the manifest collection block instead of a uuid or pdh (perhaps something like `manifest_collection_block:<block_locator>`. Some slight changes to the keep clients would be required to add this, but since the manifests are all just regular keep manifests, it would mostly just be plumbing to make calls using existing keep client and manifest parsing functionality to get the manifest from keep rather than from the API server. No new manifest format is needed, just the semantics of using the keep client to get the final manifest from a collection rather than directly from the API server. The schema of the "manifest collection" would need to be defined but is extremely simple - as I proposed above it could just be to have a `manifest.txt` and a `portable_data_hash.txt` at the top-level of the collection.
I believe all of that should work with permissions turned off - with the only changes required being to add the functionality to read and write keep-based manifests to the keep client libraries. Importantly, it should not require any backwards-incompatible changes to achieve and existing collections referenced by uuid or pdh could continue to work.
With an Arvados system with permissions turned on, the API server and the keep client would need additional functionality added in order to sign block locators, which would obviously need to be stored unsigned in the keep-based manifests. Again, I believe these changes can be made entirely backwards compatibly. The API server could have a new endpoint that takes a collection portable data hash and a set of block locators, and returns signed block locators for the requested blocks (assuming the user has permission). The keep client would need to be modified to handle unsigned blocks by going to the API server to get them signed when actual data is requested (in the first instance, the client could just request that the server sign all blocks in the collection, but in the future this would be a place for tuning behaviour to only request signatures relevant to a particular directory prefix, file, or portion of file (perhaps in chunks of signing N blocks at a time). This could dramatically reduce the load on the API server by only requesting signatures for data that is actually being accessed by the client rather than for whole (potentially enormous) collections at a time.
I would note also that a keep client that has the new capability to request signed block locators only when required could have the option to also apply that strategy to collections that are entirely API-server based (by requesting the collection using existing API server functionality to request an unsigned manifest text and then using the new endpoint to request block signatures as needed using the same code paths as above - i.e. if the client finds itself with unsigned block locators and permissions enabled, it would try to get them signed using the new block signing API endpoint).
The two portions of this work, one being the keep-based manifest-of-manifest-collection option for storing manifests and the other being the support for partial block signing - could be implemented independently and in either order, and together should massively reduce the load on the API server and also enable extremely large collections by raising the theoretical single-file collection size limit from 195TB (given the default 128MB manifest text limit) to over 152043520TB or 145EB (given the new 97TB manifest text limit).
Updated by Peter Amstutz almost 4 years ago
In this scheme, the block contains a manifest which consists of a file which is itself a manifest.
It sounds like this would add two levels of indirection over the current approach: the API server provides the PDH of the meta-manifest, which has to be looked up, then it has to look up the actual manifest blocks.
Or does the API server provide the meta-manifest text?
The PDH would be the meta-manifest, and not the real manifest?
You still need to be able to prove to the system that you have permission to read the blocks. You also have to prevent permission laundering where you create a manifest with the (unsigned) blocks you want to read, then tell the system that you made a manifest with those blocks and you swear you have permission to them, it accepts the manifest, and then you read it back (now with the signatures).
If the issue is API server load, I feel like there's other ways to tackle it. We could scale out multiple API server instances. We could add a feature where you tell it to only sign a subset of the manifest. To handle really huge collections, a scheme that enables manifests to embed other collections in subdirectory (referenced by PDH) seems more generally useful.
Updated by Joshua Randall almost 4 years ago
Peter Amstutz wrote:
In this scheme, the block contains a manifest which consists of a file which is itself a manifest.
It sounds like this would add two levels of indirection over the current approach: the API server provides the PDH of the meta-manifest, which has to be looked up, then it has to look up the actual manifest blocks.
Or does the API server provide the meta-manifest text?
The PDH would be the meta-manifest, and not the real manifest?
It could be an option for the API server to provide the meta-manifest text (although it would need a way to distinguish that it should interpret it as such), which would give the client the option of getting the meta-manifest block locators already signed in the usual manner. That may be easier.
I guess what I had been thinking was that the API server's new block signing endpoint would be able to take a collection locator that could either be a PDH of the collection itself or a meta-manifest block locator and use that to decide whether you have permission to view the block(s) they reference. The API server would still have to contain collection metadata (most importantly, permissions links) and in addition to being able to look up a collection by PDH it would need to now be able to do that by meta-manifest block locator as well.
I imagined that a new client first opening a meta-manifest based collection would either take the unsigned meta-manifest block locator it has (in place of PDH) and ask the API server to sign it so that it can access it, or it would have obtained an already signed meta-manifest block locator by querying the collection API.
To make signing decisions for the blocks on the next level(s) down, the API server would need to either maintain its own index of what blocks are associated with a particular meta-manifest block or it would need to be able to contact keep to get that information as needed (possibly with some caching so that it does not access the same set of manifests from keep repeatedly). Either way, it would use its own privileged keep client to read the meta-manifest and manifest to validate that the blocks requested for signing are present in that collection (either at the meta or actual level). It could cache that information the first time it comes across such a collection, either semi-permanently by recording the relationships between a collection and its constituent blocks into some new table(s) or possibly just using an in-memory (possibly distributed) cache.
You still need to be able to prove to the system that you have permission to read the blocks. You also have to prevent permission laundering where you create a manifest with the (unsigned) blocks you want to read, then tell the system that you made a manifest with those blocks and you swear you have permission to them, it accepts the manifest, and then you read it back (now with the signatures).
I hadn't considered the laundering issue. I guess the way that probably works now is that you need to use signed block locators whenever you create a new collection or update its manifest text, and the API server validates that all referenced blocks are valid before writing the collection's (unsigned) manifest text, or else refuses to write it and gives a permission error? Clients must be getting signed locators back when they upload new data to keepstores, and they can only do manifest text manipulations using valid signed locators, or else the API server will reject the request to create/modify a collection. Is that close to correct?
If that is the case, then we would just need the meta-manifest equivalent: a way to provide a comprehensive set of signed block locators when attempting to set the meta-manifest block locator (i.e. the alternative to manifest_text when referencing a meta-manifest stored in keep) for a collection. The meta-manifest block in keep is content addressed and immutable so we should not need to worry about it being manipulated after the fact. The API server would be able to use the same mechanism described above (with its own privileged keep client) to enumerate all blocks referenced by the collection at all levels (the meta manifest block, all manifest collection blocks, and all data blocks) and validate that a signed version of every block has been provided by the client before allowing the collection to be set to the meta manifest locator.
The gate preventing laundering in the current system is on an update to a collection in the API server. That would remain the case in this system, it would just need to be extended to validate a meta-manifest rather than the embedded manifest text, and a mechanism by which a client can provide a full set of signed blocks would be needed as well.
If the issue is API server load, I feel like there's other ways to tackle it. We could scale out multiple API server instances. We could add a feature where you tell it to only sign a subset of the manifest.
I am aware that we can scale up somewhat by running multiple API servers - we used to run 8 API servers behind a load balancer at Sanger, although that really only works up to a point as at some point the database becomes the bottleneck, and although that can also be scaled to some degree, it gets harder and more complex with diminishing returns.
We did also add the unsigned_manifest_text option to avoid API server load in a situation where you explicitly just want to know the structure of the collection and have no need to access the data, but IMHO it usually going to be wasteful for a normal keep client to do that without also being able to follow up with a request to get a subset of blocks signed.
A scheme in which you can request only part of the manifest be signed might be useful in some situations, although from my perspective what I've suggested as a signing API could basically be thought of as exactly that -- the API server can return a set of signed block locators when provided a collection PDH and a set of unsigned block locators by a client. That could just be implemented as an extension of the collection API rather than a separate endpoint.
To handle really huge collections, a scheme that enables manifests to embed other collections in subdirectory (referenced by PDH) seems more generally useful.
I agree embedding collections within other collections could be useful, although I would view the utility of that more as a mechanism for easier composition of collections that happen to already be organised into distinct subdirectories rather than a mechanism to make them arbitrarily large. I agree this mechanism could also be used to make larger collections, but with a potentially annoying need to manually manage how large any one collection gets. I suppose clients used in pipeline jobs could be set up to automatically make subdirectories into subcollections in some circumstances (in order to avoid failing with a collection-too-large error and needing to be manually refactored). The main issue I have with this approach as a way to enable larger collections (and with the current collections-in-a-project way of accomplishing something similar in the current system) is that it exacerbates the already high load placed on the API server by now requiring that a client contact the API server for the main collection (or project) as well as each of N subcollections (or collections) before it can even enumerate the files contained within them.
The proposal here can also be used for composition of very large collections, by manipulating manifest texts at the meta-manifest level. Manifest texts from several enormous collections could be concatenated together by adding blocks from one `manifest.txt` to the other, reading back the resulting `manifest.txt` content to calculate the resulting PDH (which might be done on the API server), and proving to the server that you have access to all of the blocks within it. The manifest text content itself is also deduplicated (by being stored in the same keep blocks) rather than duplicating 100MB of manifest text in the database each time it is copied/composed). We had hundreds of gigabytes of manifest text in our database at Sanger and we had to aggressively delete collections from the database in order to keep it reasonably sized.
Where we intend to regularly work with many collections that have 50-100MB of manifest text, keep seems like a much better place to store that data as most relational databases will struggle when pushed up towards hundreds of gigabytes or even terabytes of data.
I do agree that smaller collections are best stored as they are now as inline manifest_text in the API server database.