Keep storage classes » History » Version 11

Tom Clegg, 06/14/2017 01:05 PM

1 6 Peter Amstutz
h1. Keep storage groups
2 1 Tom Clegg
3 2 Peter Amstutz
h2. Use cases
5 7 Peter Amstutz
* Partition data among several cloud buckets for legal or financial reasons.
* Shift data from "hot" to "cool" storage (e.g. SSD to disk) for price/performance tradeoff.
* Move data from on-line to off-line storage (e.g. Glacier) but maintain provenance.
8 6 Peter Amstutz
9 1 Tom Clegg
h2. Requirements
11 9 Peter Amstutz
* arv-put & arv-copy have option to specify storage group.
12 7 Peter Amstutz
** When writing each block, client can specify storage group for the block.
** Use API to specify that the blocks belonging to a collection should go into a certain storage group.
* Workbench permits changing storage group on a collection
* arvados-cwl-runner has options to specify storage groups for intermediate and final output collections.
** Use API to specify the storage group for the output collection of a container request.
* TBD: access controls on storage groups, can restrict which users can place collections in which storage group?
18 11 Tom Clegg
* TBD: rules for de-duplicating blocks across groups? (e.g., if collections with identical data exist in "hot" & "cool" groups, do we really need a copy of the data in "cool" as well as the copy in "hot"?)
19 1 Tom Clegg
h2. Design
22 8 Peter Amstutz
A "storage group" is effectively a tagging scheme to specify a group of keep servers (& volumes/mounts of a keep server) where a block should be preferentially stored.
23 1 Tom Clegg
24 8 Peter Amstutz
Generalized from [[Keep storage tiers]] (but unlike storage tiers proposal, there is no implied price/performance relationship between groups).
25 5 Peter Amstutz
26 8 Peter Amstutz
Each keepstore service has access to one or more storage groups.  Storage groups are independent.  Data assigned to a group may still be sharded among multiple servers.  Groups are be identified with labels or uuids instead of integers.  The keep services table adds a column which lists which groups are available at which services.
27 1 Tom Clegg
28 7 Peter Amstutz
When writing blocks, keepstore recognizes a header @X-Keep-Storage-Group@ and accepts or denies the block based on whether it can place the block in the designated group.  If not supplied, keepstores should have a default pool.  The value of @X-Keep-Storage-Group@ should be reported in the response.
29 1 Tom Clegg
30 7 Peter Amstutz
Each keepstore volume (mount) is associated with a storage group.
31 1 Tom Clegg
32 7 Peter Amstutz
Collections may specify a desired group for the blocks in the collection.  Keep balance should move blocks to the desired group.  If multiple collections reference the same block in different group, each group should have a copy with full replication.
33 1 Tom Clegg
Data management policies, such as "move data from hot storage to cool storage after 1 month", should be implemented  on top of the keepstore layer with additional tooling/scripts that set storage groups on collections.
Storage groups could be used for moving data into long-term storage (e.g. Glacier, tape backup, etc).  As an example, the user would change the storage group to "glacier", which would copy the blocks into offline storage and delete them from the online storage.  To retrieve the blocks, the user would change the storage group to "s3".  This would fetch the blocks and copy them back to online storage.  (TBD: how does the client find out when the data actually becomes available.)
37 8 Peter Amstutz
h2. Development tasks
# keepstore: configurable group per volume/mount
# keepstore: support x-keep-storage-group header
# apiserver: collections.desired_storage_group column, site default group (probably called "default")
# keep-balance: compute/report desired storage group(s) for each block