Federated collections

In a federation, a client on cluster A can read a collection that is hosted on cluster B. Cluster A pulls the metadata and file content from cluster B as needed. The client's behavior is exactly the same as it is for collections hosted on cluster A.

  • Read collection by uuid
  • Read collection by pdh
  • Update collection by uuid (not covered here yet; needs a strategy for writing the data through to the remote cluster)

Differences from federated workflow retrieval

If the collection is requested from cluster A with GET /arvados/v1/collections/{uuid}, cluster A can proxy a request to cluster B, using the same approach used for workflows in #13493.

If the collection is requested from cluster A with GET /arvados/v1/collections/{pdh}, and cluster A does not have a matching collection, it can scan remote clusters until it finds one.

Once the collection is retrieved, the client also needs to read the data blocks. Without some additional mechanism, this won't work: the local keepstore servers will reject the blob signatures provided by the remote cluster, and they generally won't have the requested data anyway.

Remote data hints

If cluster A uses a salted token to retrieve a collection from cluster B, cluster B provides a signed manifest:

. acbd18db4cc2f85cedef654fccc4a4d8+3+Aabcdef@12345678 0:3:foo.txt

Cluster A propagates cluster B's signature but modifies it to be a remote cluster signature:

. acbd18db4cc2f85cedef654fccc4a4d8+3+Rbbbbb-abcdef@12345678 0:3:foo.txt
Any keepstore service on cluster A will be able to fetch the block from cluster B:
  • Look up bbbbb in remote cluster list in discovery doc
  • Look up bbbbb's keepproxy address in bbbbb's discovery doc
  • Fetch https://{keepproxy}/acbd18db4cc2f85cedef654fccc4a4d8+3+Abcdefa@12345678

Remote signature hint

  • acbd18db4cc2f85cedef654fccc4a4d8+3+Rbbbbb-bcdefa@12345678

The syntax supports having both local and remote signatures on a single locator. This can help a sophisticated (future) controller communicate securely to keepstore, on a per-block or per-collection basis, whether keepstore should skip contacting the remote cluster when returning remote data that also happens to be stored locally.

  • acbd18db4cc2f85cedef654fccc4a4d8+3+Rbbbbb-bcdefa@12345678+Aabcdef@12345678

Optimization: Data cache on cluster A

A keepstore service on cluster A, when proxying a GET request to cluster B, has some opportunities to conserve network resources:
  1. Before proxying, check whether the block exists on a local volume. If so:
    1. Request a content challenge from the remote cluster to ensure the remote cluster does in fact have the data. (This can be skipped if cluster A trusts cluster B to enforce data access permissions.)
    2. Return the local copy.
  2. When passing a proxied response through to the client, write the data to a local volume as well, so it can be returned more efficiently next time.

Optimization: Identical content exists on cluster A

When proxying a "get collection by UUID" request to cluster B, cluster A might notice that the PDH returned by cluster B matches a collection stored on cluster A. In this case, all data blocks are already stored locally: it can replace the cluster B's signatures with its own, and the client will end up reading the blocks from local volumes.

To avoid an information leak, a configuration setting can restrict this optimization to cases where the caller's token has permission to read the existing local collection.


  • #13993 [API] Fetch remote-hosted collection by UUID
  • #13994 [Keepstore] Fetch blocks from federated clusters