Project

General

Profile

Keep index » History » Version 1

Tom Clegg, 02/04/2014 02:09 AM

1 1 Tom Clegg
h1. Keep index
2
3
See also:
4
* [[Keep server]]
5
* [[Keep manifest format]]
6
* source: n/a (design phase)
7
8
Purposes of index:
9
* Tell garbage collector what is eligible for deletion (and some partial order of preference)
10
* Tell replication enforcer which blocks should be stored how many × (and in which [types of] backing store)
11
* Tell rebalancer which blocks should be moved to redistribute free space and reduce probe time
12
* Tell managers how much disk space is being conserved due to CAS
13
* Tell managers how much disk space is occupied in a given backing store service
14
* Tell managers how disk usage would be affected by modifying storage policy
15
* Tell users how much disk space is represented by a given set of collections
16
* Tell users how much disk space can be made available by garbage collection
17
* Tell users how soon they should expect their cached data to disappear
18
* Tell users performance statistics (how fast should I expect my job to read data?)
19
* Tell ops where each block was most recently read/written, in case data recovery is needed
20
* Tell ops how unbalanced the backing stores are across the cluster
21
* Tell ops activity level and performance statistics
22
* Tell ops activity level vs. amount of space (how much of the data is being accessed by users?)
23
* Tell ops disk performance/error/status trends to help identify bad hardware
24
25
Basic kinds of data in the index:
26
* Which blocks are used by which collections (and which collections are valued by which users/groups)
27
* Which blocks are stored on which disks
28
* Which disks are attached to which nodes
29
* Read events
30
* Write events
31
* Exceptions (checksum mismatch, IO error)
32
33
h2. Implementation considerations
34
35
Overview
36
* REST service
37
* API server may cache/proxy some queries
38
* API server may redirect some queries
39
40
Permissions
41
* Support +A tokens like [[Keep server]] when accepting collection/blob uuids in request?
42
* Require admin api_token for some queries, site-configurable?
43
44
Distributed/asynchronous
45
* Easy to run multiple keep index services.
46
* Most features do not need synchronous operation / real time data.
47
* Features that move or delete data should be tied to a single "primary" indexing service (failover event likely requires resetting some state).
48
* Substantial disagreement between multiple index services should be easy to flag on admin dashboard.