Project

General

Profile

Keep storage classes » History » Revision 7

Revision 6 (Peter Amstutz, 06/13/2017 11:57 PM) → Revision 7/15 (Peter Amstutz, 06/14/2017 01:16 AM)

h1. Keep storage groups 

 h2. Use cases 

 * Partition User has option to store some data among several cloud buckets for legal or financial reasons. in cheaper storage, but only certain data qualifies.    Can be indicated on a per-collection basis. 
 * Shift User wants data moved from "hot" to "cool" storage (e.g. SSD to disk) for price/performance tradeoff. 
 * Move data from on-line to off-line storage (e.g. Glacier) but maintain provenance. a certain amount of time after it has been generated. 

 h2. Requirements 

 * arv-put has option to specify storage group. 
 ** When writing each block, blocks, client can specify which storage group for the block. 
 ** Use API to can specify that the blocks belonging to a collection should go into a certain storage group. 
 * Workbench permits changing Use API to can specify the storage group on for the output collection of a collection container request. 
 * ** arvados-cwl-runner has options to specify storage groups for intermediate and final output collections. 
 ** Use API to specify the storage group for the output collection of a container request. 
 * TBD: access controls on storage groups, can restrict which users can place collections in which storage group? 

 h2. Design 

 A "storage group" "pool" is effectively a tagging scheme to specify a group subset of keep servers where a block should be preferentially stored. 

 Related to (but not the same thing as) [[Keep storage tiers]]. For some use cases, the assumption of a roughly linear relationship between slow/cheap and fast/expensive doesn't necessarily hold. 

 Each keepstore service has access to one or more storage groups. pools.    Storage groups pools are independent, there independent.    There is no implied relationship between groups. pools.    Data assigned to a group pool may still be sharded among multiple servers.    Groups are Pools can be identified with labels or uuids instead of integers.    The keep services table adds a column which lists which groups pools are available at which services. 

 When writing blocks, keepstore recognizes a header @X-Keep-Storage-Group@ @X-Keep-Pool@ and accepts or denies the block based on whether it can place the block in the designated group. pool.    If not supplied, keepstores should have a default pool.    The value of @X-Keep-Storage-Group@ @X-Keep-Pool@ should be reported in the response. 

 Each A keepstore volume (mount) mount is associated with a storage group. specific pool. 

 Collections may specify a desired group pool for the blocks in the collection.    Keep balance should move blocks to the desired group. pool.    If multiple collections reference the same block in different group, pools, each group pool should have a copy with full replication. copy. 

 Data management policies, such as for example "move data from hot storage to cool cold storage if not accessed after 1 month", should be implemented    with additional tooling/scripts on top of the keepstore layer with additional tooling/scripts that set storage groups on collections. 

 Storage groups could be used for moving data into long-term storage (e.g. Glacier, tape backup, etc).    As an example, the user would change the storage group to "glacier", which would copy the blocks into offline storage and delete them from the online storage.    To retrieve the blocks, the user would change the storage group to "s3".    This would fetch the blocks and copy them back to online storage.    (TBD: how does the client find out when the data actually becomes available.) later.