Version 5 - History - Keep Design Doc - Arvados

1

Misha Zatsman

{{>toc}}

2

3

h1=. Keep Design Doc

4

5

p>. *"Misha Zatsman":mailto:misha@curoverse.com

6

Last Updated: July 21, 2014*

7

8

h2. Overview

9

10

h3. Objective

11

12

Keep is a content-addressable distributed file system that yields high performance in I/O-bound cluster environments. Keep is designed to run on low-cost commodity hardware and integrate with the rest of the Arvados system. It provides high fault tolerance and high aggregate performance to a large number of clients.

13

14

h3. Scope

15

16

This design doc provides an overview of the Keep system along with details about the interfaces exposed to users and shared between components. The intended audience is software engineers and bioinformaticians.

17

18

h3. Background

19

20

* [[Introduction to Arvados]]

21

* [[Technical Architecture]]

22

* [[Provenance and Reproducibility]]

23

24

There is also [[Keep|another overview of the Keep system]].

25

26

h3. Alternatives

27

28

* *"HDFS":http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html* - Is not content-addressed.

29

* *"CEPH":http://ceph.com/* - Is not content-addressed.

30

* *"GFS":http://en.wikipedia.org/wiki/Google_File_System* - Is neither content-addressed nor available outside google.

31

* *"Blobstore":https://developers.google.com/appengine/docs/python/blobstore/* - Is not available outside of "google's app engine":https://developers.google.com/appengine/docs/whatisgoogleappengine

32

33

h3. Design Goals and Tradeoffs

34

35

* *Scale* - Keep architecture is built to handle data sets that grow to petabyte and ultimately exabyte scale.

36

37

* *Fault-Tolerance* - Hardware failure is expected. Installations will include tens of thousands of drives on thousands of nodes; therefore, drive, node, and rack failures are inevitable. Keep has redundancy and recovery capabilities at its core.

38

39

* *Large Files* - Unlike disk file systems, Keep is tuned to store large files from gigabytes to terabytes in size. It also reduces the overhead of storing small files by packing those files’ data into larger data blocks.

40

41

* *High Aggregate Performance Over Low Latency* - Keep is optimized for large batch computations instead of lots of small reads and writes by users. It maximizes overall throughput in a busy cluster environment and data bandwidth from client to disk, while at the same time minimizing disk thrashing commonly caused by multiple simultaneous readers.

42

43

* *Provenance and Reproducibility* - Keep makes tracking the origin of data sets, validating contents of objects, and enabling the reproduction of complex computations on large datasets a core function of the storage system.

44

45

* *Data Preservation* - Keep is ideal for data sets that don’t change once saved (e.g. whole genome sequencing data). It’s organized around a write once, read many (WORM) model rather than create-read-update-delete (CRUD) use cases.

46

47

* *Complex Data Management* - Keep operates well in environments where there are many independent users accessing the same data or users who want to organize data in many different ways. Keep facilitates data sharing without expecting users either to agree with one another about directory structures or to create redundant copies of the data.

48

49

* *Security* - Biomedical data is assumed to have strong security requirements, so Keep has on-disk and in-transport encryption as a native capability.

50

51

* *Ease of use* - The combination of the above goals demands some complexity in the keep system. We endeavor to shield our users from much of the complexity and instead provide them with a simple model of storage so that they can easily and correctly reason about their data.

52

53

h3. Usage

54

55

h4. Persisting Data

56

57

Data stored in Keep is ephemeral by default and will only be saved for two weeks.  To store it for longer, a user must create a persisted collection containing the data.

58

59

To permanently store data in Keep, a user must make the following two calls in the Keep Client:

60

61

# The user stores data which will get recorded in one or more blocks.

62

# The user creates a persisted collection containing one or more blocks.

63

64

* A collection's permanence is either ephemeral or persisted.

65

* Ephemeral collections will be deleted by Keep two weeks after the collection became ephemeral, but not before.

66

* Persisted collections will be kept forever (or until a user changes its permanence to ephemeral).

67

* Both ephemeral and persisted collections can be read and used as inputs for jobs, deleted collections cannot.

68

* A user may switch a collection's permanence back and forth between persisted and ephemeral.

69

* A user may delete a collection at any point regardless of it's permanence.

70

71

The following chart illustrates the possible permanence state transitions for collections. Solid arrows indicate transitions that users can perform. The dashed arrow indicates that ephemeral collections can be deleted by users or by Keep.

72

73

!https://docs.google.com/drawings/d/1AVp3zuyCzgWNNjgF-HoI-SbNdeblXMR8vmO8-SyUigM/pub?w=727&h=304!:https://docs.google.com/a/curoverse.com/drawings/d/1AVp3zuyCzgWNNjgF-HoI-SbNdeblXMR8vmO8-SyUigM/edit?usp=sharing

74

75

h4. Referencing a Collection

76

77

Users can refer to collections stored in Keep using either a UUID or a name.

78

79

* A UUID is a _Universally Unique IDentifier_ which Keep assigns to a collection when it is created. Other Arvados entities have UUIDs too.

80

* A user can assign a name to a collection through the Keep Client, either when creating the collection or at a later time.

81

* Collections have at most one name, so assigning a name to a collection will overwrite the previous name, if any.

82

* Collections aren't required to have names, but some parts of Arvados (e.g. fuse mounts) will not work with collections without names.

83

* Within a particular project, each named collection must have a unique name.

84

85

h3. Architecture

86

87

!https://docs.google.com/drawings/d/14NSDPiQSeAeWkwHYuXIkFrupwoBwRYjmqm0wemkAvTA/pub?w=841&h=222!:https://docs.google.com/a/curoverse.com/drawings/d/14NSDPiQSeAeWkwHYuXIkFrupwoBwRYjmqm0wemkAvTA/edit

88

89

The Keep system involves three components:

90

* The *Keep Client* can be built into any binary through an sdk. The client is responsible for:

91

** Sending requests to write and read blocks.

92

** Determining which servers to send each read/write request to.

93

** Combining collections of files and breaking them into blocks.

94

** Replicating blocks to the level specified by the user.

95

** Reporting the specified replication levels for collections to the API server.

96

* The *Keep Server* stores and returns blocks in response to requests it receives.

97

** By design, each keep server instance runs independently of the others.

98

** A given instance does not communicate with other instances, know where they are or which blocks they store.

99

* The *[[Data Manager Design Doc|Data Manager]]* periodically examines the aggregate state of keep storage across keep server instances. The Data Manager is responsible for:

100

** Validating that each block is sufficiently replicated.

101

** Reporting on user disk usage.

102

** Determining which blocks should be deleted when space is needed.

103

104

h2. Specifics

105

106

h3. Interfaces

107

108

h4. Locators

109

110

Each block is identified by a locator, which has the following format (described in "BNF":http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form):

111

112

<pre>locator      ::= address hint*

113

address      ::= digest size

115

1

Misha Zatsman

size         ::= "+" [0-9]+

116

hint         ::= "+" hint-type hint-content

117

hint-type    ::= [A-Z]

118

hint-content ::= [A-Za-z0-9@_-]+</pre>

119

120

Individual hints may have their own required format:

121

122

<pre>sign-hint      ::= "+A" <40 lowercase hex digits> "@" sign-timestamp

123

sign-timestamp ::= <8 lowercase hex digits></pre>

124

125

h4. Keep Server

126

127

_locators_ are built from the relevant block, as described above.

128

129

The Keep Server supports the following http requests sent by the Keep Client.

130

131

|_.Request                                    |_. Meaning                                                                                 |_.Possible Responses |

132

| PUT /locator                               | Write a block

133

                                                       The content of the block goes in the body of the request.        | -- |

134

| HEAD /locator                            | Check whether block exists                                                       | -- |

135

| HEAD /locator?checksum=true   | Check whether block exists and verify checksum                      | -- |

136

| GET /locator                               | Retrieve a block                                                                         | -- |

137

| GET /locator?checksum=true     | Retrieve a block, verify checksum on server before sending       | -- |

138

139

The Keep Server also supports the following requests sent by the Data Manager. Each of these requests requires a privileged token.

140

141

|_.Request                                 |_. Meaning                                                                                 |_.Possible Responses |

142

| DELETE /locator                      | delete all copies of the specified block                                       | 200 OK: body like {"copies_deleted":2,"copies_not_deleted":1} (this would mean one copy was found on a read-only volume, two copies were found on writable volumes).

143

                                                                                                                                                       401 Unauthorized: Token is invalid, probably expired.

144

                                                                                                                                                       403 Forbidden: Scope problem.

145

                                                                                                                                                       403 Forbidden: User does not have admin privileges.

146

                                                                                                                                                       404 Not Found: Block not found on this server.

147

                                                                                                                                                       422 Unprocessable Entity: Block was written too recently to be deleted. |

148

| GET /index                             | get full list of blocks stored here

149

                                                   including size and unix timestamp of most recent PUT              | -- |

150

| GET /index/prefix                 | get list of blocks stored here whose locators start with _prefix_

151

                                                   including size and unix timestamp of most recent PUT              | -- |

152

| GET /state.json                       | get list of backing filesystems, disk fullness,

153

                                                   IO counters, and perhaps recent IO statistics                             | -- |

154

| PUT /trash                               | Update the trash list.

155

                                                    The trash list specifies blocks that Keep should delete when it runs low on disk space. For details see "Trash List" below.        | 200 OK

156

400 Bad Request

157

401 Unauthorized |

158

| PUT /pull                               | Update the pull list.

159

                                                    The pull list specifies blocks that Keep should pull from other Keep servers in order to maintain appropriate replication levels. For details, see "Pull List", below.        | 200 OK

160

400 Bad Request

161

401 Unauthorized |

162

163

h4. Pull List

164

165

The _pull list_ is a JSON message containing a list of blocks to pull from sibling Keep servers. It is supplied as the message body in @PUT /pull@ requests.  When Keep has resources available, it should request these blocks in order, trying the servers in the order given, and store the block returned in its own storage.

166

167

Only the Data Manager is allowed to submit a pull list to Keep.  If the sender cannot be authenticated as the Data Manager, 401 Unauthorized is returned.

168

169

Each element of the pull list is an object with the following fields:

170

* @locator@ - a Keep locator string specifying the block to be replicated

172

1

Misha Zatsman

173

An example pull list:

174

175

176

177

         "locator":"e4d909c290d0fb1ca068ffaddf22cbd0+4985",

179

1

Misha Zatsman

},

180

181

         "locator":"55ae4d45d2db0793d53f03e805f656e5+658395",

183

1

Misha Zatsman

},

184

...

185

186

187

If the pull list cannot be parsed into a list with the expected fields, Keep should return 400 Bad Request.

188

189

The pull list should be discarded when Keep exits, or when a new pull list is received.  Only one pull list is retained at a time.

190

191

h4. Trash List

192

193

The _trash list_ specifies which blocks should be recycled.  The @PUT /trash@ request supplies the trash list as a JSON message in the message body.  When Keep runs low on disk space, it should delete all local copies of each block on the trash list.

194

195

Only the Data Manager is allowed to submit a trash list to Keep.  If the sender cannot be authenticated as the Data Manager, 401 Unauthorized is returned.

196

197

The trash list consists of a JSON object with the following fields:

198

199

* @expiration_time@ - an integer signifying the Unix time at which the trash list expires. After this time, any unprocessed entries on the trash list should be discarded.

200

* @trash_blocks@ - a list of block hashes to be recycled.

201

202

Example:

203

204

205

     "expiration_time":1409082153

206

     "trash_blocks":[

207

         "e4d909c290d0fb1ca068ffaddf22cbd0",

208

         "55ae4d45d2db0793d53f03e805f656e5",

209

         "1fd08fc162a5c6413070a8bd0bffc818",

210

...

211

212

213

214

If the @PUT /trash@ request does not have this structure, Keep should return 400 Bad Request.

215

216

When processing the list, Keep should check that each block is at least two weeks old (or whatever default systemwide expiration time we have chosen). Keep should never delete a block younger than that.

217

218

h3. Detailed Designs

219

220

The details of each Keep component will be described in its own document. Currently we have the following documents:

221

222

*  [[Data Manager Design Doc]] (in progress)

223

*  [[Data Manager]]

224

*  [[Keep Server]]

225

226

h3. Debugging

227

228

To be determined.

229

230

h3. Caveats

231

232

To be determined.

233

234

h3. Security Concerns

235

236

To be determined.

237

238

h3. Open Questions and Risks

239

240

h4. Pull list

241

242

In the /pull list that the Data Manager sends to the Keep Server, who should sign the block locators?

243

* (TC) The signature mechanism uses a shared key known to all Keep Servers, which means that a Keep Server can easily sign a GET request by itself.

244

* (TC) Each Keep server can be provisioned with an API token. Any token should work if it passes the usual validation done by the remote Keep Server for GET requests (e.g., it doesn't need admin privileges, but it must not be expired).

245

246

Should /pull and /trash be renamed to /pull.json and /trash.json to reflect the expected format of the request body?

247

248

* (twp) No -- there is no particular benefit to it. If we expect to support different message formats (e.g. YAML) then we should expect to negotiate it via the Content-Type header.  If we don't, then there's no point.

249

* (TC) I thought this could help make it more obvious what's happening, much like /index.json and /status.json do.

250

251

h4. Trash list

252

253

An alternative way of specifying expiration for the trash list would be to list each block's original timestamp along with its hash, instead of listing a single expiration date for the entire list.  Keep should delete the specified block only if the block's actual mtime matches the mtime specified in the trash list.

254

255

Advantages of this approach include:

256

* It is not necessary to include a grace period for deletable blocks.

257

* The trash list can be used to delete blocks at any time, e.g., for lowering replication count of blocks whose excessive replicas are new.

258

* (TC) This means changing the implicit guarantee of PUT:

259

** from "PUT gives you X days of storage _for this block on this node_ even without telling API server about it"

260

** to "PUT gives you X days of storage _for this block at up to 2x_ even without telling API server about it".

261

* (TC) This also provides an extra sanity check: if a "delete block B" request somehow gets delivered to the wrong node, or after a significant delay, it is much less likely to result in accidental data loss. If Data Manager wants to be paranoid, it can notice cases where all 40 copies have the same timestamp, and decline to issue any delete requests until it refreshes 2x of the timestamps to some different value.

262

263

Disadvantages:

264

* If one Keep server has multiple copies of a block on different volumes, with different timestamps, only one copy will be deleted per garbage collection round. (Possible answer: allow data manager to specify a mtime range for blocks to delete).

265

** (TC) Another possible solution: Allow for both {block1,timestamp1} and {block1,timestamp2} to appear in the delete list.

266

267

268

h3. Future Work

269

270

To be determined.

271

272

h3. Revision History

273

274

|_.Date              |_.Revisions Made |_.Author            |_.Reviewed By     |

275

| June 23, 2014 | Initial Draft         | Misha Zatsman |=. ----              |

276

| July 21, 2014  | Defined Usage    | Misha Zatsman | Tom Clegg, Ward Vandewege, Tim Pierce |

Project

General

Profile

Arvados

Keep Design Doc » History » Version 5