Project

General

Profile

Keep Design Doc » History » Version 2

Peter Amstutz, 06/03/2015 07:53 PM

1 1 Misha Zatsman
{{>toc}}
2
3
h1=. Keep Design Doc
4
5
p>. *"Misha Zatsman":mailto:misha@curoverse.com
6
Last Updated: July 21, 2014*
7
8
h2. Overview
9
10
h3. Objective
11
12
Keep is a content-addressable distributed file system that yields high performance in I/O-bound cluster environments. Keep is designed to run on low-cost commodity hardware and integrate with the rest of the Arvados system. It provides high fault tolerance and high aggregate performance to a large number of clients.
13
14
h3. Scope
15
16
This design doc provides an overview of the Keep system along with details about the interfaces exposed to users and shared between components. The intended audience is software engineers and bioinformaticians.
17
18
h3. Background
19
20
* [[Introduction to Arvados]]
21
* [[Technical Architecture]]
22
* [[Provenance and Reproducibility]]
23
24
There is also [[Keep|another overview of the Keep system]].
25
26
h3. Alternatives
27
28
* *"HDFS":http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html* - Is not content-addressed.
29
* *"CEPH":http://ceph.com/* - Is not content-addressed.
30
* *"GFS":http://en.wikipedia.org/wiki/Google_File_System* - Is neither content-addressed nor available outside google.
31
* *"Blobstore":https://developers.google.com/appengine/docs/python/blobstore/* - Is not available outside of "google's app engine":https://developers.google.com/appengine/docs/whatisgoogleappengine
32
33
h3. Design Goals and Tradeoffs
34
35
* *Scale* - Keep architecture is built to handle data sets that grow to petabyte and ultimately exabyte scale. 
36
37
* *Fault-Tolerance* - Hardware failure is expected. Installations will include tens of thousands of drives on thousands of nodes; therefore, drive, node, and rack failures are inevitable. Keep has redundancy and recovery capabilities at its core.
38
39
* *Large Files* - Unlike disk file systems, Keep is tuned to store large files from gigabytes to terabytes in size. It also reduces the overhead of storing small files by packing those files’ data into larger data blocks. 
40
41
* *High Aggregate Performance Over Low Latency* - Keep is optimized for large batch computations instead of lots of small reads and writes by users. It maximizes overall throughput in a busy cluster environment and data bandwidth from client to disk, while at the same time minimizing disk thrashing commonly caused by multiple simultaneous readers. 
42
43
* *Provenance and Reproducibility* - Keep makes tracking the origin of data sets, validating contents of objects, and enabling the reproduction of complex computations on large datasets a core function of the storage system. 
44
45
* *Data Preservation* - Keep is ideal for data sets that don’t change once saved (e.g. whole genome sequencing data). It’s organized around a write once, read many (WORM) model rather than create-read-update-delete (CRUD) use cases.
46
47
* *Complex Data Management* - Keep operates well in environments where there are many independent users accessing the same data or users who want to organize data in many different ways. Keep facilitates data sharing without expecting users either to agree with one another about directory structures or to create redundant copies of the data.
48
49
* *Security* - Biomedical data is assumed to have strong security requirements, so Keep has on-disk and in-transport encryption as a native capability.
50
51
* *Ease of use* - The combination of the above goals demands some complexity in the keep system. We endeavor to shield our users from much of the complexity and instead provide them with a simple model of storage so that they can easily and correctly reason about their data.
52
53
h3. Usage
54
55
h4. Persisting Data
56
57
Data stored in Keep is ephemeral by default and will only be saved for two weeks.  To store it for longer, a user must create a persisted collection containing the data.
58
59
To permanently store data in Keep, a user must make the following two calls in the Keep Client:
60
61
# The user stores data which will get recorded in one or more blocks.
62
# The user creates a persisted collection containing one or more blocks.
63
64
* A collection's permanence is either ephemeral or persisted.
65
* Ephemeral collections will be deleted by Keep two weeks after the collection became ephemeral, but not before.
66
* Persisted collections will be kept forever (or until a user changes its permanence to ephemeral).
67
* Both ephemeral and persisted collections can be read and used as inputs for jobs, deleted collections cannot.
68
* A user may switch a collection's permanence back and forth between persisted and ephemeral.
69
* A user may delete a collection at any point regardless of it's permanence.
70
71
The following chart illustrates the possible permanence state transitions for collections. Solid arrows indicate transitions that users can perform. The dashed arrow indicates that ephemeral collections can be deleted by users or by Keep.
72
73
!https://docs.google.com/drawings/d/1AVp3zuyCzgWNNjgF-HoI-SbNdeblXMR8vmO8-SyUigM/pub?w=727&h=304!:https://docs.google.com/a/curoverse.com/drawings/d/1AVp3zuyCzgWNNjgF-HoI-SbNdeblXMR8vmO8-SyUigM/edit?usp=sharing
74
75
h4. Referencing a Collection
76
77
Users can refer to collections stored in Keep using either a UUID or a name.
78
79
* A UUID is a _Universally Unique IDentifier_ which Keep assigns to a collection when it is created. Other Arvados entities have UUIDs too.
80
* A user can assign a name to a collection through the Keep Client, either when creating the collection or at a later time.
81
* Collections have at most one name, so assigning a name to a collection will overwrite the previous name, if any.
82
* Collections aren't required to have names, but some parts of Arvados (e.g. fuse mounts) will not work with collections without names.
83
* Within a particular project, each named collection must have a unique name.
84
85
h3. Architecture
86
87
!https://docs.google.com/drawings/d/14NSDPiQSeAeWkwHYuXIkFrupwoBwRYjmqm0wemkAvTA/pub?w=841&h=222!:https://docs.google.com/a/curoverse.com/drawings/d/14NSDPiQSeAeWkwHYuXIkFrupwoBwRYjmqm0wemkAvTA/edit
88
89
The Keep system involves three components:
90
* The *Keep Client* can be built into any binary through an sdk. The client is responsible for:
91
** Sending requests to write and read blocks.
92
** Determining which servers to send each read/write request to.
93
** Combining collections of files and breaking them into blocks.
94
** Replicating blocks to the level specified by the user.
95
** Reporting the specified replication levels for collections to the API server.
96
* The *Keep Server* stores and returns blocks in response to requests it receives.
97
** By design, each keep server instance runs independently of the others.
98
** A given instance does not communicate with other instances, know where they are or which blocks they store.
99
* The *[[Data Manager Design Doc|Data Manager]]* periodically examines the aggregate state of keep storage across keep server instances. The Data Manager is responsible for:
100
** Validating that each block is sufficiently replicated.
101
** Reporting on user disk usage.
102
** Determining which blocks should be deleted when space is needed.
103
104
h2. Specifics
105
106
h3. Interfaces
107
108
h4. Locators
109
110
Each block is identified by a locator, which has the following format (described in "BNF":http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form):
111
112
<pre>locator      ::= address hint*
113
address      ::= digest size
114
digest       ::= <32 hexadecimal digits>
115
size         ::= "+" [0-9]+
116
hint         ::= "+" hint-type hint-content
117
hint-type    ::= [A-Z]
118
hint-content ::= [A-Za-z0-9@_-]+</pre>
119
120
Individual hints may have their own required format:
121
122
<pre>sign-hint      ::= "+A" <40 lowercase hex digits> "@" sign-timestamp
123
sign-timestamp ::= <8 lowercase hex digits></pre>
124
125
h4. Keep Server
126
127
_locators_ are built from the relevant block, as described above.
128
129
The Keep Server supports the following http requests sent by the Keep Client.
130
131
|_.Request                                    |_. Meaning                                                                                 |_.Possible Responses |
132
| PUT /locator                               | Write a block
133
                                                       The content of the block goes in the body of the request.        | -- |
134
| HEAD /locator                            | Check whether block exists                                                       | -- |
135
| HEAD /locator?checksum=true   | Check whether block exists and verify checksum                      | -- |
136
| GET /locator                               | Retrieve a block                                                                         | -- |
137
| GET /locator?checksum=true     | Retrieve a block, verify checksum on server before sending       | -- |
138
139
The Keep Server also supports the following requests sent by the Data Manager. Each of these requests requires a privileged token.
140
141
|_.Request                                 |_. Meaning                                                                                 |_.Possible Responses |
142
| DELETE /locator                      | delete all copies of the specified block                                       | 200 OK: body like {"copies_deleted":2,"copies_not_deleted":1} (this would mean one copy was found on a read-only volume, two copies were found on writable volumes).
143
                                                                                                                                                       401 Unauthorized: Token is invalid, probably expired.
144
                                                                                                                                                       403 Forbidden: Scope problem.
145
                                                                                                                                                       403 Forbidden: User does not have admin privileges.
146
                                                                                                                                                       404 Not Found: Block not found on this server.
147
                                                                                                                                                       422 Unprocessable Entity: Block was written too recently to be deleted. |
148
| GET /index                             | get full list of blocks stored here
149
                                                   including size and unix timestamp of most recent PUT              | -- |
150
| GET /index/prefix                 | get list of blocks stored here whose locators start with _prefix_
151
                                                   including size and unix timestamp of most recent PUT              | -- |
152
| GET /state.json                       | get list of backing filesystems, disk fullness,
153
                                                   IO counters, and perhaps recent IO statistics                             | -- |
154
| PUT /trash                               | Update the trash list.
155
                                                    The trash list specifies blocks that Keep should delete when it runs low on disk space. For details see "Trash List" below.        | 200 OK
156
400 Bad Request
157
401 Unauthorized |
158
| PUT /pull                               | Update the pull list.
159
                                                    The pull list specifies blocks that Keep should pull from other Keep servers in order to maintain appropriate replication levels. For details, see "Pull List", below.        | 200 OK
160
400 Bad Request
161
401 Unauthorized |
162
163
h4. Pull List
164
165
The _pull list_ is a JSON message containing a list of blocks to pull from sibling Keep servers. It is supplied as the message body in @PUT /pull@ requests.  When Keep has resources available, it should request these blocks in order, trying the servers in the order given, and store the block returned in its own storage.
166
167
Only the Data Manager is allowed to submit a pull list to Keep.  If the sender cannot be authenticated as the Data Manager, 401 Unauthorized is returned.
168
169
Each element of the pull list is an object with the following fields:
170
* @locator@ - a Keep locator string specifying the block to be replicated
171
* @servers@ - a list of servers, each one specified as @"host:port"@.
172
173
An example pull list:
174
175
 [
176
     {
177
         "locator":"e4d909c290d0fb1ca068ffaddf22cbd0+4985",
178 2 Peter Amstutz
         "servers":["https://keep0.qr1hi.arvadosapi.com:25107","https://keep1.qr1hi.arvadosapi.com:25108"]
179 1 Misha Zatsman
     },
180
     {
181
         "locator":"55ae4d45d2db0793d53f03e805f656e5+658395",
182 2 Peter Amstutz
         "servers":["http://10.0.1.5:25107","http://10.0.1.6:25107","http://10.0.1.7:25108"]
183 1 Misha Zatsman
     },
184
     ...
185
]
186
187
If the pull list cannot be parsed into a list with the expected fields, Keep should return 400 Bad Request.
188
189
The pull list should be discarded when Keep exits, or when a new pull list is received.  Only one pull list is retained at a time.
190
191
h4. Trash List
192
193
The _trash list_ specifies which blocks should be recycled.  The @PUT /trash@ request supplies the trash list as a JSON message in the message body.  When Keep runs low on disk space, it should delete all local copies of each block on the trash list.
194
195
Only the Data Manager is allowed to submit a trash list to Keep.  If the sender cannot be authenticated as the Data Manager, 401 Unauthorized is returned.
196
197
The trash list consists of a JSON object with the following fields:
198
199
* @expiration_time@ - an integer signifying the Unix time at which the trash list expires. After this time, any unprocessed entries on the trash list should be discarded.
200
* @trash_blocks@ - a list of block hashes to be recycled.
201
202
Example:
203
204
 {
205
     "expiration_time":1409082153
206
     "trash_blocks":[
207
         "e4d909c290d0fb1ca068ffaddf22cbd0",
208
         "55ae4d45d2db0793d53f03e805f656e5",
209
         "1fd08fc162a5c6413070a8bd0bffc818",
210
         ...
211
     ]
212
}
213
214
If the @PUT /trash@ request does not have this structure, Keep should return 400 Bad Request.
215
216
When processing the list, Keep should check that each block is at least two weeks old (or whatever default systemwide expiration time we have chosen). Keep should never delete a block younger than that.
217
218
h3. Detailed Designs
219
220
The details of each Keep component will be described in its own document. Currently we have the following documents:
221
222
*  [[Data Manager Design Doc]] (in progress)
223
*  [[Data Manager]]
224
*  [[Keep Server]]
225
226
h3. Debugging
227
228
To be determined.
229
230
h3. Caveats
231
232
To be determined.
233
234
h3. Security Concerns
235
236
To be determined.
237
238
h3. Open Questions and Risks
239
240
h4. Pull list
241
242
In the /pull list that the Data Manager sends to the Keep Server, who should sign the block locators?
243
* (TC) The signature mechanism uses a shared key known to all Keep Servers, which means that a Keep Server can easily sign a GET request by itself.
244
* (TC) Each Keep server can be provisioned with an API token. Any token should work if it passes the usual validation done by the remote Keep Server for GET requests (e.g., it doesn't need admin privileges, but it must not be expired).
245
246
Should /pull and /trash be renamed to /pull.json and /trash.json to reflect the expected format of the request body?
247
248
* (twp) No -- there is no particular benefit to it. If we expect to support different message formats (e.g. YAML) then we should expect to negotiate it via the Content-Type header.  If we don't, then there's no point.
249
* (TC) I thought this could help make it more obvious what's happening, much like /index.json and /status.json do.
250
251
h4. Trash list
252
253
An alternative way of specifying expiration for the trash list would be to list each block's original timestamp along with its hash, instead of listing a single expiration date for the entire list.  Keep should delete the specified block only if the block's actual mtime matches the mtime specified in the trash list.
254
255
Advantages of this approach include:
256
* It is not necessary to include a grace period for deletable blocks.
257
* The trash list can be used to delete blocks at any time, e.g., for lowering replication count of blocks whose excessive replicas are new.
258
* (TC) This means changing the implicit guarantee of PUT:
259
** from "PUT gives you X days of storage _for this block on this node_ even without telling API server about it"
260
** to "PUT gives you X days of storage _for this block at up to 2x_ even without telling API server about it".
261
* (TC) This also provides an extra sanity check: if a "delete block B" request somehow gets delivered to the wrong node, or after a significant delay, it is much less likely to result in accidental data loss. If Data Manager wants to be paranoid, it can notice cases where all 40 copies have the same timestamp, and decline to issue any delete requests until it refreshes 2x of the timestamps to some different value.
262
263
Disadvantages:
264
* If one Keep server has multiple copies of a block on different volumes, with different timestamps, only one copy will be deleted per garbage collection round. (Possible answer: allow data manager to specify a mtime range for blocks to delete).
265
** (TC) Another possible solution: Allow for both {block1,timestamp1} and {block1,timestamp2} to appear in the delete list.
266
267
268
h3. Future Work
269
270
To be determined.
271
272
h3. Revision History
273
274
|_.Date              |_.Revisions Made |_.Author            |_.Reviewed By     |
275
| June 23, 2014 | Initial Draft         | Misha Zatsman |=. ----              |
276
| July 21, 2014  | Defined Usage    | Misha Zatsman | Tom Clegg, Ward Vandewege, Tim Pierce |