Keep Design Doc » History » Revision 3
Revision 2 (Peter Amstutz, 06/03/2015 07:53 PM) → Revision 3/5 (Peter Amstutz, 06/03/2015 07:54 PM)
{{>toc}} h1=. Keep Design Doc p>. *"Misha Zatsman":mailto:misha@curoverse.com Last Updated: July 21, 2014* h2. Overview h3. Objective Keep is a content-addressable distributed file system that yields high performance in I/O-bound cluster environments. Keep is designed to run on low-cost commodity hardware and integrate with the rest of the Arvados system. It provides high fault tolerance and high aggregate performance to a large number of clients. h3. Scope This design doc provides an overview of the Keep system along with details about the interfaces exposed to users and shared between components. The intended audience is software engineers and bioinformaticians. h3. Background * [[Introduction to Arvados]] * [[Technical Architecture]] * [[Provenance and Reproducibility]] There is also [[Keep|another overview of the Keep system]]. h3. Alternatives * *"HDFS":http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html* - Is not content-addressed. * *"CEPH":http://ceph.com/* - Is not content-addressed. * *"GFS":http://en.wikipedia.org/wiki/Google_File_System* - Is neither content-addressed nor available outside google. * *"Blobstore":https://developers.google.com/appengine/docs/python/blobstore/* - Is not available outside of "google's app engine":https://developers.google.com/appengine/docs/whatisgoogleappengine h3. Design Goals and Tradeoffs * *Scale* - Keep architecture is built to handle data sets that grow to petabyte and ultimately exabyte scale. * *Fault-Tolerance* - Hardware failure is expected. Installations will include tens of thousands of drives on thousands of nodes; therefore, drive, node, and rack failures are inevitable. Keep has redundancy and recovery capabilities at its core. * *Large Files* - Unlike disk file systems, Keep is tuned to store large files from gigabytes to terabytes in size. It also reduces the overhead of storing small files by packing those files’ data into larger data blocks. * *High Aggregate Performance Over Low Latency* - Keep is optimized for large batch computations instead of lots of small reads and writes by users. It maximizes overall throughput in a busy cluster environment and data bandwidth from client to disk, while at the same time minimizing disk thrashing commonly caused by multiple simultaneous readers. * *Provenance and Reproducibility* - Keep makes tracking the origin of data sets, validating contents of objects, and enabling the reproduction of complex computations on large datasets a core function of the storage system. * *Data Preservation* - Keep is ideal for data sets that don’t change once saved (e.g. whole genome sequencing data). It’s organized around a write once, read many (WORM) model rather than create-read-update-delete (CRUD) use cases. * *Complex Data Management* - Keep operates well in environments where there are many independent users accessing the same data or users who want to organize data in many different ways. Keep facilitates data sharing without expecting users either to agree with one another about directory structures or to create redundant copies of the data. * *Security* - Biomedical data is assumed to have strong security requirements, so Keep has on-disk and in-transport encryption as a native capability. * *Ease of use* - The combination of the above goals demands some complexity in the keep system. We endeavor to shield our users from much of the complexity and instead provide them with a simple model of storage so that they can easily and correctly reason about their data. h3. Usage h4. Persisting Data Data stored in Keep is ephemeral by default and will only be saved for two weeks. To store it for longer, a user must create a persisted collection containing the data. To permanently store data in Keep, a user must make the following two calls in the Keep Client: # The user stores data which will get recorded in one or more blocks. # The user creates a persisted collection containing one or more blocks. * A collection's permanence is either ephemeral or persisted. * Ephemeral collections will be deleted by Keep two weeks after the collection became ephemeral, but not before. * Persisted collections will be kept forever (or until a user changes its permanence to ephemeral). * Both ephemeral and persisted collections can be read and used as inputs for jobs, deleted collections cannot. * A user may switch a collection's permanence back and forth between persisted and ephemeral. * A user may delete a collection at any point regardless of it's permanence. The following chart illustrates the possible permanence state transitions for collections. Solid arrows indicate transitions that users can perform. The dashed arrow indicates that ephemeral collections can be deleted by users or by Keep. !https://docs.google.com/drawings/d/1AVp3zuyCzgWNNjgF-HoI-SbNdeblXMR8vmO8-SyUigM/pub?w=727&h=304!:https://docs.google.com/a/curoverse.com/drawings/d/1AVp3zuyCzgWNNjgF-HoI-SbNdeblXMR8vmO8-SyUigM/edit?usp=sharing h4. Referencing a Collection Users can refer to collections stored in Keep using either a UUID or a name. * A UUID is a _Universally Unique IDentifier_ which Keep assigns to a collection when it is created. Other Arvados entities have UUIDs too. * A user can assign a name to a collection through the Keep Client, either when creating the collection or at a later time. * Collections have at most one name, so assigning a name to a collection will overwrite the previous name, if any. * Collections aren't required to have names, but some parts of Arvados (e.g. fuse mounts) will not work with collections without names. * Within a particular project, each named collection must have a unique name. h3. Architecture !https://docs.google.com/drawings/d/14NSDPiQSeAeWkwHYuXIkFrupwoBwRYjmqm0wemkAvTA/pub?w=841&h=222!:https://docs.google.com/a/curoverse.com/drawings/d/14NSDPiQSeAeWkwHYuXIkFrupwoBwRYjmqm0wemkAvTA/edit The Keep system involves three components: * The *Keep Client* can be built into any binary through an sdk. The client is responsible for: ** Sending requests to write and read blocks. ** Determining which servers to send each read/write request to. ** Combining collections of files and breaking them into blocks. ** Replicating blocks to the level specified by the user. ** Reporting the specified replication levels for collections to the API server. * The *Keep Server* stores and returns blocks in response to requests it receives. ** By design, each keep server instance runs independently of the others. ** A given instance does not communicate with other instances, know where they are or which blocks they store. * The *[[Data Manager Design Doc|Data Manager]]* periodically examines the aggregate state of keep storage across keep server instances. The Data Manager is responsible for: ** Validating that each block is sufficiently replicated. ** Reporting on user disk usage. ** Determining which blocks should be deleted when space is needed. h2. Specifics h3. Interfaces h4. Locators Each block is identified by a locator, which has the following format (described in "BNF":http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form): <pre>locator ::= address hint* address ::= digest size digest ::= <32 hexadecimal digits> size ::= "+" [0-9]+ hint ::= "+" hint-type hint-content hint-type ::= [A-Z] hint-content ::= [A-Za-z0-9@_-]+</pre> Individual hints may have their own required format: <pre>sign-hint ::= "+A" <40 lowercase hex digits> "@" sign-timestamp sign-timestamp ::= <8 lowercase hex digits></pre> h4. Keep Server _locators_ are built from the relevant block, as described above. The Keep Server supports the following http requests sent by the Keep Client. |_.Request |_. Meaning |_.Possible Responses | | PUT /locator | Write a block The content of the block goes in the body of the request. | -- | | HEAD /locator | Check whether block exists | -- | | HEAD /locator?checksum=true | Check whether block exists and verify checksum | -- | | GET /locator | Retrieve a block | -- | | GET /locator?checksum=true | Retrieve a block, verify checksum on server before sending | -- | The Keep Server also supports the following requests sent by the Data Manager. Each of these requests requires a privileged token. |_.Request |_. Meaning |_.Possible Responses | | DELETE /locator | delete all copies of the specified block | 200 OK: body like {"copies_deleted":2,"copies_not_deleted":1} (this would mean one copy was found on a read-only volume, two copies were found on writable volumes). 401 Unauthorized: Token is invalid, probably expired. 403 Forbidden: Scope problem. 403 Forbidden: User does not have admin privileges. 404 Not Found: Block not found on this server. 422 Unprocessable Entity: Block was written too recently to be deleted. | | GET /index | get full list of blocks stored here including size and unix timestamp of most recent PUT | -- | | GET /index/prefix | get list of blocks stored here whose locators start with _prefix_ including size and unix timestamp of most recent PUT | -- | | GET /state.json | get list of backing filesystems, disk fullness, IO counters, and perhaps recent IO statistics | -- | | PUT /trash | Update the trash list. The trash list specifies blocks that Keep should delete when it runs low on disk space. For details see "Trash List" below. | 200 OK 400 Bad Request 401 Unauthorized | | PUT /pull | Update the pull list. The pull list specifies blocks that Keep should pull from other Keep servers in order to maintain appropriate replication levels. For details, see "Pull List", below. | 200 OK 400 Bad Request 401 Unauthorized | h4. Pull List The _pull list_ is a JSON message containing a list of blocks to pull from sibling Keep servers. It is supplied as the message body in @PUT /pull@ requests. When Keep has resources available, it should request these blocks in order, trying the servers in the order given, and store the block returned in its own storage. Only the Data Manager is allowed to submit a pull list to Keep. If the sender cannot be authenticated as the Data Manager, 401 Unauthorized is returned. Each element of the pull list is an object with the following fields: * @locator@ - a Keep locator string specifying the block to be replicated * @servers@ - a list of servers, each one specified as a URL @"schema://host:port"@. @"host:port"@. An example pull list: [ { "locator":"e4d909c290d0fb1ca068ffaddf22cbd0+4985", "servers":["https://keep0.qr1hi.arvadosapi.com:25107","https://keep1.qr1hi.arvadosapi.com:25108"] }, { "locator":"55ae4d45d2db0793d53f03e805f656e5+658395", "servers":["http://10.0.1.5:25107","http://10.0.1.6:25107","http://10.0.1.7:25108"] }, ... ] If the pull list cannot be parsed into a list with the expected fields, Keep should return 400 Bad Request. The pull list should be discarded when Keep exits, or when a new pull list is received. Only one pull list is retained at a time. h4. Trash List The _trash list_ specifies which blocks should be recycled. The @PUT /trash@ request supplies the trash list as a JSON message in the message body. When Keep runs low on disk space, it should delete all local copies of each block on the trash list. Only the Data Manager is allowed to submit a trash list to Keep. If the sender cannot be authenticated as the Data Manager, 401 Unauthorized is returned. The trash list consists of a JSON object with the following fields: * @expiration_time@ - an integer signifying the Unix time at which the trash list expires. After this time, any unprocessed entries on the trash list should be discarded. * @trash_blocks@ - a list of block hashes to be recycled. Example: { "expiration_time":1409082153 "trash_blocks":[ "e4d909c290d0fb1ca068ffaddf22cbd0", "55ae4d45d2db0793d53f03e805f656e5", "1fd08fc162a5c6413070a8bd0bffc818", ... ] } If the @PUT /trash@ request does not have this structure, Keep should return 400 Bad Request. When processing the list, Keep should check that each block is at least two weeks old (or whatever default systemwide expiration time we have chosen). Keep should never delete a block younger than that. h3. Detailed Designs The details of each Keep component will be described in its own document. Currently we have the following documents: * [[Data Manager Design Doc]] (in progress) * [[Data Manager]] * [[Keep Server]] h3. Debugging To be determined. h3. Caveats To be determined. h3. Security Concerns To be determined. h3. Open Questions and Risks h4. Pull list In the /pull list that the Data Manager sends to the Keep Server, who should sign the block locators? * (TC) The signature mechanism uses a shared key known to all Keep Servers, which means that a Keep Server can easily sign a GET request by itself. * (TC) Each Keep server can be provisioned with an API token. Any token should work if it passes the usual validation done by the remote Keep Server for GET requests (e.g., it doesn't need admin privileges, but it must not be expired). Should /pull and /trash be renamed to /pull.json and /trash.json to reflect the expected format of the request body? * (twp) No -- there is no particular benefit to it. If we expect to support different message formats (e.g. YAML) then we should expect to negotiate it via the Content-Type header. If we don't, then there's no point. * (TC) I thought this could help make it more obvious what's happening, much like /index.json and /status.json do. h4. Trash list An alternative way of specifying expiration for the trash list would be to list each block's original timestamp along with its hash, instead of listing a single expiration date for the entire list. Keep should delete the specified block only if the block's actual mtime matches the mtime specified in the trash list. Advantages of this approach include: * It is not necessary to include a grace period for deletable blocks. * The trash list can be used to delete blocks at any time, e.g., for lowering replication count of blocks whose excessive replicas are new. * (TC) This means changing the implicit guarantee of PUT: ** from "PUT gives you X days of storage _for this block on this node_ even without telling API server about it" ** to "PUT gives you X days of storage _for this block at up to 2x_ even without telling API server about it". * (TC) This also provides an extra sanity check: if a "delete block B" request somehow gets delivered to the wrong node, or after a significant delay, it is much less likely to result in accidental data loss. If Data Manager wants to be paranoid, it can notice cases where all 40 copies have the same timestamp, and decline to issue any delete requests until it refreshes 2x of the timestamps to some different value. Disadvantages: * If one Keep server has multiple copies of a block on different volumes, with different timestamps, only one copy will be deleted per garbage collection round. (Possible answer: allow data manager to specify a mtime range for blocks to delete). ** (TC) Another possible solution: Allow for both {block1,timestamp1} and {block1,timestamp2} to appear in the delete list. h3. Future Work To be determined. h3. Revision History |_.Date |_.Revisions Made |_.Author |_.Reviewed By | | June 23, 2014 | Initial Draft | Misha Zatsman |=. ---- | | July 21, 2014 | Defined Usage | Misha Zatsman | Tom Clegg, Ward Vandewege, Tim Pierce |