Keep » History » Revision 9
Revision 8 (Tom Clegg, 04/12/2013 05:32 PM) → Revision 9/26 (Tom Clegg, 04/17/2013 01:06 PM)
h1. Content-Addressable Distributed File System (“Keep”) Keep Keep is a distributed content-addressable distributed file storage system that yields designed for high performance in I/O-bound cluster environments. Keep is designed to run on low-cost commodity hardware and integrate with the rest of the Arvados system. It provides high fault tolerance and high aggregate performance to a large number of clients. h2. Design Goals and Assumptions Notable design goals and features include: * *Scale* - Keep architecture is built to handle data sets that grow to petabyte and ultimately exabyte scale. High scalability * *Fault-Tolerance* - Hardware failure is expected. Installations will include tens of thousands of drives on thousands of nodes; therefore, drive, node, and rack failures are inevitable. Keep has Node-level redundancy and recovery capabilities at its core. * *Large Files* - Unlike disk file systems, Keep is tuned to store large files from gigabytes to terabytes in size. It also reduces the overhead of storing small files by packing those files’ data into larger data blocks. * *High Aggregate Performance Over Low Latency* - Keep is optimized for large batch computations instead of lots of small reads and writes by users. It maximizes Maximum overall throughput in a busy cluster environment and * Maximum data bandwidth from client to disk, while at the same time minimizing disk * Minimum transaction overhead * Elimination of disk thrashing commonly (commonly caused by multiple simultaneous readers. readers) * *Provenance and Reproducibility* - Keep makes tracking Client-controlled redundancy h2. Design The above goals are accomplished by the origin of data sets, validating contents of objects, and enabling the reproduction of complex computations on large datasets a core function of the storage system. * *Data Preservation* - Keep is ideal for data sets that don’t change once saved (e.g. whole genome sequencing data). It’s organized around a write once, read many (WORM) model rather than create-read-update-delete (CRUD) use cases. following design features. * *Complex Data Management* - Keep operates well in environments where there are many independent users accessing transferred directly between the same data or users who want to organize data in many different ways. Keep facilitates data sharing without expecting users either to agree with one another about directory structures or to create redundant copies of the data. * *Security* - Biomedical data is assumed to have strong security requirements, so Keep has on-disk client and in-transport encryption as a native capability. h2. Content Addressing Keep is a content-addressable storage system (CAS). Instead of using a location-addressed storage approach -- where each file is retrieved based on the location physical node where it is stored, typically using POSIX semantics -- Keep uses a content address to locate, retrieve, and validate files. The content address is based on a cryptographic digest of the data in the object. Each content address disk is a permanent universally unique identifier (UUID). By using content addresses, Keep uses a flat address space that is highly scalable and connected. * Data collections are encoded in effect virtualizes storage across large pools of commodity drives. A metadata store makes it possible (≤64 MiB) blocks to tag objects with an arbitrary number of metadata tags. By separating metadata from objects, it’s possible to create separate metadata namespaces for different projects without unnecessarily duplicating the underlying file data. Keep uses manifests (text files with names and content addresses which are themselves also stored in Keep) to define collections of files. minimize short read/write operations. * Each collection has its own content address, like any other object stored in Keep. Collections make it possible to uniquely reference very large datasets using a very small identifier: a 40 byte collection UUID can describe 100 terabytes of data using a simple text-based manifest, while adding disk accepts only one layer of indirection increases this to 180 exabytes. Collections make it possible to create an arbitrary number of dataset definitions for different computations and analyses that are permanently recorded and validated, and do not require any physical reorganization or duplication of data in the underlying file system. For example, block-read/write operation at a researcher can query the metadata database to select existing collections, time. This prevents disk thrashing and build maximizes total throughput when many clients compete for a new collection using only the desired files from the existing sets. This operation disk. * Storage redundancy is both robust directly controlled, and efficient, since it can be done without easily verified, by the client simply by reading or writing any a block of the underlying file data. Collections can be directly referenced in pipeline instances to pass data to individual jobs as inputs or record data from jobs as outputs. Content-addressable storage is well suited to sequencing and other biomedical big data because it addresses the needs for provenance, reproducibility, and data validation. h2. Distributed File System Keep combines content addressing with a distributed file system architecture that was inspired by Google File System (GFS) and is similar in some ways to Hadoop Distributed File system (HDFS). h3. on multiple nodes. * Data Blocks Keep block distribution is designed to store large amounts of large files computed based on large clusters of commodity drives distributed across many nodes. The system splits and combines file data into 64 MiB data blocks and saves those to an underlying file system (e.g. Linux ext4). It also replicates blocks across multiple nodes and optionally multiple racks. Keep creates a cryptographic digest of each the data block for content addressing and creates a manifest that describes the underlying data. The manifest is also being stored in Keep and has a unique content address. Metadata is recorded in the metadata database. h3. Clients and Servers The Keep system is organized around Clients and Servers. Keep clients use an SDK for convenient access to the client-server API. A Keep Server runs on each node which has physical disks (in the optimal private cloud configuration, every node has physical disks). Clients connect directly to Keep Servers. or retrieved. This many-to-many architecture eliminates the single point of failure and data bottleneck that would result from employing need for a master node. Instead central or synchronized database of connecting to an indexing service, clients compute the expected location for each block based on the block’s content address. h4. Clients storage locations. A h2. Components The Keep client SDK can be incorporated in to a variety storage system consists of different clients that interface with Keep. It takes care of most of the data block read/write services, SDKs, and management agents. The responsibilities of the client in the Keep architecture: service are: * split large file Write data into 64 MiB blocks * When writing: ensure data integrity by comparing client-supplied cryptographic digest and data * Read data blocks (subject to permission, which is determined by the system/metadata DB) * combine small file Send read/write/error event logs to management agents The responsibilities of the SDK are: * When writing: split data into 64 ≤64 MiB data blocks chunks * When writing: encode directory trees as manifests * When writing: write data to the desired number of nodes to achieve storage redundancy * After writing: register collections in the metadata database, using the a collection with Arvados REST API * When reading: parse manifests * When reading: verify data integrity by comparing locator to the cryptographic MD5 digest of the retrieved data The responsibilities of the client application include: management agents are: * Record appropriate metadata in the metadata database Verify validity of permission tokens * Specify the desired storage Determine which blocks have higher or lower redundancy (if different from the site default) h4. Servers Keep Servers have several responsibilities: than required * Write data blocks to Monitor disk * Ensure data integrity when writing by comparing data with client-supplied cryptographic digest * Ensure data integrity when reading by comparing data with content address * Retrieve data space and move or delete blocks and send to clients over the network (subject to permission, which is determined by the system/metadata DB) as needed * Send read, write, Collect per-user, per-group, per-node, and error event logs to the Data Manager per-disk usage statistics h2. Data Manager Arvados leverages Keep extensively to enable the development of pipelines and applications. In addition to Keep, Arvados includes the [[Data Manager]] which is a component that assists Keep servers in enforcing site policies and monitors the state of the storage facility as a whole. We expect that the Data Manager will also broker backup and archival services and interfaces to other external data stores. Benefits h2. Interface Keep provides a straightforward API that makes it easy to write, read, organize, and validate files. It does not conform to traditional POSIX semantics (however, individual collections and the whole storage system can be mounted with a FUSE driver and accessed as a read-only POSIX volume). h2. Benefits Keep offers a variety of major benefits over other distributed POSIX file storage systems and scale out disk other object file systems such as Isilon. storage systems. This is a summary of some of those benefits: * *Elimination of Duplication* - — One of the major storage management problems for research labs today is the unnecessary duplication of data. Often researchers will make copies of data for backup or to re-organize files for different projects. Content addressing automatically eliminates unnecessary duplication: if a program saves a file when an identical file has already been stored, Keep simply reports success without having to write a second copy. * *Canonical Records* - — Content addressing creates clear and verifiable canonical records for files. By combining Keep with the computation system in Arvados, it becomes trivial to verify the exact file that was used for a computation. By using a collection to define an entire data set (which could be 100s of terabytes or petabytes), you maintain a permanent and verifiable record of which data were used for each computation. The file that defines a collection is very small relative to the underlying data, so you can make as many as you need. * *Provenance* - — The combination of Keep and the computation system make it possible to maintain clear provenance for all the data in the system. This has a number of benefits including making it easy to ascertain how data were derived at any point in time. * *Easy Management of Temporary Data* - — One benefit of systematic provenance tracking is that Arvados can automatically manage temporary and intermediate data. If you know how a data set or file is was created, you can decide whether it is worthwhile to save keep a copy. copy on disk. Knowing what pipeline was run on which input data, how long it took, etc., makes it possible to automate such decisions. * *Flexible Organization* - — In Arvados, files are grouped in collections and can be easily tagged with metadata. Different researcher and research teams can manage independent sets of metadata. This makes it possible for researchers to organize files in a variety of different ways without duplicating or physically moving the data. A collection is represented by a text file, which lists the filenames and data blocks comprising the collection, and is itself stored in Keep. As a result, the same underlying data can be referenced by many different collections, without ever copying or moving the data itself. * *High Reliability* - — By combining content addressing with an object file store, Keep is fault tolerant across drive and even node failures. The Data Manager monitors the replication level of each data collection. Storage redundancy can thus be adjusted according to the relative importance of individual datasets in addition to default site policy. * *Easier Tiering of Storage* - — The Data Manager in Arvados manages the distribution of files to storage systems such as a NAS or cloud back-up back up service. The files are all content addressed and tracked in the metadata database: when a pipeline uses data which is not on the cluster, Arvados can automatically move the necessary data onto the cluster before starting the job. This makes tiered storage feasible without imposing an undue burden on end users. * *Security and Access Control* - — Keep can encrypt files on disk and this storage architecture makes the implementation of very fine grained access control significantly easier than traditional disk POSIX file systems. * *POSIX Interface* - Collections — Keep can be mounted as drive with a POSIX semantics using FUSE filesystems in a virtual machine in order to access data with tools that expect a POSIX interface. Because collections are so flexible, one can easily create many different virtual directory structures for the same underlying files without copying or even reading the underlying data. Combining the native Arvados tools with UNIX pipes provides better performance, but the POSIX mount option is more convenient in some situations. * *Data Sharing* - — Keep makes it much easier to share data between clusters in different data centers and organizations. Keep content addresses include information about which cluster data is stored on. With federated clusters, collections of data can reside on multiple clusters, and distribution of computations across clusters can eliminate slow, costly data transfers. h2. Background Keep was first developed for the Harvard Personal Genome Project (PGP) in 2007 (see overall project [[history]]. The first Keep code commit was Friday Dec 28 07:41:49 2007. The system was designed to combine the innovations "documented by Google":http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/gfs-sosp2003.pdf in the Google File System (GFS) with the concept of content-addressable storage to address the unique needs of biomedical data. GFS was never made public, but other implementations of the approach have become popular storing and managing big data sets including "Hadoop Distributed File System (HDFS)":http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html and "Ceph":http://ceph.com/. While Keep uses a similar approach to these systems, the focus on content addressing results and a very different set of capabilities aligned with the needs of genomic and biomedical data analysis. "Content-addressable storage (CAS)":http://en.wikipedia.org/wiki/Content-addressable_storage is an idea that was pioneered in the late 1990’s by FilePool and first commercialized in the EMC Centera platform.