Keep-balance » History » Version 4
Tom Clegg, 02/14/2014 10:22 AM
1 | 1 | Tom Clegg | h1. Data Manager |
---|---|---|---|
2 | |||
3 | 4 | Tom Clegg | The Data Manager enforces policies and generates reports about storage resource usage. The Data Manager interacts with the [[Keep server]] and the metadata database. Clients/users do not interact with the Data Manager directly. |
4 | 1 | Tom Clegg | |
5 | 4 | Tom Clegg | See also: |
6 | * [[Keep server]] |
||
7 | * [[Keep manifest format]] |
||
8 | * source: n/a (design phase) |
||
9 | 1 | Tom Clegg | |
10 | 4 | Tom Clegg | Responsibilities: |
11 | * Garbage collector: decide what is eligible for deletion (and some partial order of preference) |
||
12 | * Replication enforcer: copy and delete blobs in various backing stores to achieve desired replication level |
||
13 | * Rebalancer: move blobs to redistribute free space and reduce client probes |
||
14 | * Tell managers how much disk space is being conserved due to CAS |
||
15 | * Tell managers how much disk space is occupied in a given backing store service |
||
16 | * Tell managers how disk usage would be affected by modifying storage policy |
||
17 | * Tell managers how much disk space+time is used (per user, group, node, disk) |
||
18 | * Tell users when replication/policy specified for a collection is not currently satisfied (and why, for how long, etc) |
||
19 | * Tell users how much disk space is represented by a given set of collections |
||
20 | * Tell users how much disk space can be made available by garbage collection |
||
21 | * Tell users how soon they should expect their cached data to disappear |
||
22 | * Tell users performance statistics (how fast should I expect my job to read data?) |
||
23 | * Tell ops where each block was most recently read/written, in case data recovery is needed |
||
24 | * Tell ops how unbalanced the backing stores are across the cluster |
||
25 | * Tell ops activity level and performance statistics |
||
26 | * Tell ops activity level vs. amount of space (how much of the data is being accessed by users?) |
||
27 | * Tell ops disk performance/error/status trends (and SMART reports) to help identify bad hardware |
||
28 | * Tell ops history of disk adds, removals, moves |
||
29 | |||
30 | Basic kinds of data in the index: |
||
31 | * Which blocks are used by which collections (and which collections are valued by which users/groups) |
||
32 | * Which blocks are stored on which disks |
||
33 | * Which disks are attached to which nodes |
||
34 | * Read events |
||
35 | * Write events |
||
36 | * Exceptions (checksum mismatch, IO error) |
||
37 | |||
38 | h2. Implementation considerations |
||
39 | |||
40 | Overview |
||
41 | * REST service |
||
42 | * API server may cache/proxy some queries |
||
43 | * API server may redirect some queries |
||
44 | |||
45 | Permissions |
||
46 | * Support +A tokens like [[Keep server]] when accepting collection/blob uuids in request? |
||
47 | * Require admin api_token for some queries, site-configurable? |
||
48 | |||
49 | Distributed/asynchronous |
||
50 | * Easy to run multiple keep index services. |
||
51 | * Most features do not need synchronous operation / real time data. |
||
52 | * Features that move or delete data should be tied to a single "primary" indexing service (failover event likely requires resetting some state). |
||
53 | * Substantial disagreement between multiple index services should be easy to flag on admin dashboard. |