Project

General

Profile

Keep » History » Revision 3

Revision 2 (Tom Clegg, 04/10/2013 06:04 PM) → Revision 3/26 (Tom Clegg, 04/10/2013 10:21 PM)

h1. Keep 

 Keep is a distributed content-addressable storage system designed for high performance in I/O-bound cluster environments. 

 Notable design goals and features include: 

 * High scalability 
 * Node-level redundancy 
 * Maximum overall throughput in a busy cluster environment 
 * Maximum data bandwidth from client to disk 
 * Minimum transaction overhead 
 * Elimination of disk thrashing (commonly caused by multiple simultaneous readers) 
 * Client-controlled redundancy 

 h2. Design 

 The above goals are accomplished by the following design features. 

 * Data is transferred directly between the client and the physical node where the disk is installed. 
 * Data collections are encoded in large (≤64 MiB) blocks to minimize short read/write operations. 
 * Each disk accepts only one block-read/write operation at a time. This prevents disk thrashing and maximizes total throughput when many clients compete for a disk. 
 * Storage redundancy is directly controlled, and can be easily verified, by the client simply by reading or writing a block of data on multiple nodes. 
 * Data block distribution is computed based on a cryptographic the MD5 digest of the data block being stored or retrieved. This eliminates the need for a central or synchronized database of block storage locations. 

 h2. Components 

 The Keep storage system consists of data block read/write services, SDKs, and management agents. 

 The responsibilities of the Keep service are: 

 * Write data blocks 
 * When writing: ensure data integrity by comparing client-supplied cryptographic MD5 digest and to client-supplied data 
 * Read data blocks (subject to permission, which is determined by the system/metadata DB) 
 * Send read/write/error event logs to management agents 

 The responsibilities of the SDK are: 

 * When writing: split data into ≤64 MiB chunks 
 * When writing: encode directory trees as manifests 
 * When writing: write data to the desired number of nodes to achieve storage redundancy 
 * After writing: register a collection with Arvados 
 * When reading: parse manifests 
 * When reading: verify data integrity by comparing locator to MD5 digest of retrieved data 

 The responsibilities of management agents are: 

 * Verify validity of permission tokens 
 * Determine which blocks have higher or lower redundancy than required 
 * Monitor disk space and move or delete blocks as needed 
 * Collect per-user, per-group, per-node, and per-disk usage statistics