Project

General

Profile

Actions

Keep storage classes » History » Revision 6

« Previous | Revision 6/15 (diff) | Next »
Peter Amstutz, 06/13/2017 11:57 PM


Keep storage groups

Use cases

  • User has option to store some data in cheaper storage, but only certain data qualifies. Can be indicated on a per-collection basis.
  • User wants data moved from "hot" to "cool" storage a certain amount of time after it has been generated.

Requirements

  • arv-put has option to specify storage group.
    • When writing blocks, client can specify which storage group for the block.
    • Use API to can specify that the blocks belonging to a collection should go into a certain storage group.
  • Use API to can specify the storage group for the output collection of a container request.
    • arvados-cwl-runner has options to specify storage groups for intermediate and final output collections.

Design

A "pool" is effectively a tagging scheme to specify a subset of keep servers where a block should be preferentially stored.

Related to (but not the same thing as) Keep storage tiers. For some use cases, the assumption of a roughly linear relationship between slow/cheap and fast/expensive doesn't necessarily hold.

Each service has access to one or more storage pools. Storage pools are independent. There is no implied relationship between pools. Data assigned to a pool may still be sharded among multiple servers. Pools can be identified with labels or uuids instead of integers. The keep services table adds a column which lists which pools are available at which services.

When writing blocks, keepstore recognizes a header X-Keep-Pool and accepts or denies the block based on whether it can place the block in the designated pool. If not supplied, keepstores should have a default pool. The value of X-Keep-Pool should be reported in the response.

A keepstore mount is associated with a specific pool.

Collections may specify a desired pool for the blocks in the collection. Keep balance should move blocks to the desired pool. If multiple collections reference the same block in different pools, each pool should have a copy.

Data management policies, for example "move data from hot storage to cold storage if not accessed after 1 month", should be implemented with additional tooling/scripts on top of the keepstore later.

Updated by Peter Amstutz almost 7 years ago · 6 revisions