Keep » History » Revision 2
Revision 1 (Tom Clegg, 04/10/2013 04:23 PM) → Revision 2/26 (Tom Clegg, 04/10/2013 06:04 PM)
h1. Keep Keep is a distributed content-addressable storage system designed for high performance in I/O-bound cluster environments. Notable design goals and features include: * High scalability * Node-level redundancy * Maximum overall throughput in a busy cluster environment * Maximum data bandwidth from client to disk * Minimum transaction overhead * Elimination of disk thrashing (commonly caused by multiple simultaneous readers) * Client-controlled redundancy h2. Design The above goals are accomplished by the following design features. * Data is transferred directly between the client and the physical node where the disk is installed. * Data collections are encoded in large (≤64 MiB) blocks to minimize short read/write operations. * Each disk accepts only one block-read/write operation at a time. This prevents disk thrashing and maximizes total throughput when many clients compete for a disk. * Storage redundancy is directly controlled, and can be easily verified, by the client simply by reading or writing a block of data on multiple nodes. * Data block distribution is computed based on the MD5 digest of the data block being stored or retrieved. This eliminates the need for a central or synchronized database of block storage locations. h2. Components The Keep storage system consists of data block read/write services, SDKs, and management agents. The responsibilities of the Keep service are: * Write data blocks * When writing: ensure data integrity by comparing client-supplied MD5 digest to client-supplied data * Read data blocks (subject to permission, which is determined by the system/metadata DB) * Send read/write/error event logs to management agents The responsibilities of the SDK are: * When writing: split data into ≤64 MiB chunks * When writing: encode directory trees as manifests * When writing: write data to the desired number of nodes to achieve storage redundancy * After writing: register a collection with Arvados * When reading: parse manifests * When reading: verify data integrity by comparing locator to MD5 digest of retrieved data