Keep » History » Version 2

« Previous - Version 2/26 (diff) - Next » - Current version
Tom Clegg, 04/10/2013 06:04 PM


Keep

Keep is a distributed content-addressable storage system designed for high performance in I/O-bound cluster environments.

Notable design goals and features include:

  • High scalability
  • Node-level redundancy
  • Maximum overall throughput in a busy cluster environment
  • Maximum data bandwidth from client to disk
  • Minimum transaction overhead
  • Elimination of disk thrashing (commonly caused by multiple simultaneous readers)
  • Client-controlled redundancy

Design

The above goals are accomplished by the following design features.

  • Data is transferred directly between the client and the physical node where the disk is installed.
  • Data collections are encoded in large (≤64 MiB) blocks to minimize short read/write operations.
  • Each disk accepts only one block-read/write operation at a time. This prevents disk thrashing and maximizes total throughput when many clients compete for a disk.
  • Storage redundancy is directly controlled, and can be easily verified, by the client simply by reading or writing a block of data on multiple nodes.
  • Data block distribution is computed based on the MD5 digest of the data block being stored or retrieved. This eliminates the need for a central or synchronized database of block storage locations.

Components

The Keep storage system consists of data block read/write services, SDKs, and management agents.

The responsibilities of the Keep service are:

  • Write data blocks
  • When writing: ensure data integrity by comparing client-supplied MD5 digest to client-supplied data
  • Read data blocks (subject to permission, which is determined by the system/metadata DB)
  • Send read/write/error event logs to management agents

The responsibilities of the SDK are:

  • When writing: split data into ≤64 MiB chunks
  • When writing: encode directory trees as manifests
  • When writing: write data to the desired number of nodes to achieve storage redundancy
  • After writing: register a collection with Arvados
  • When reading: parse manifests
  • When reading: verify data integrity by comparing locator to MD5 digest of retrieved data