Tom Clegg, 04/17/2013 01:06 PM
h1. Content-Addressable Distributed File System (“Keep”)
Keep is a content-addressable distributed file system that yields high performance in I/O-bound cluster environments. Keep is designed to run on low-cost commodity hardware and integrate with the rest of the Arvados system. It provides high fault tolerance and high aggregate performance to a large number of clients.
h2. Design Goals and Assumptions
Notable design goals and features include:
* *Scale* - Keep architecture is built to handle data sets that grow to petabyte and ultimately exabyte scale.
* *Fault-Tolerance* - Hardware failure is expected. Installations will include tens of thousands of drives on thousands of nodes; therefore, drive, node, and rack failures are inevitable. Keep has redundancy and recovery capabilities at its core.
* *Large Files* - Unlike disk file systems, Keep is tuned to store large files from gigabytes to terabytes in size. It also reduces the overhead of storing small files by packing those files’ data into larger data blocks.
* *High Aggregate Performance Over Low Latency* - Keep is optimized for large batch computations instead of lots of small reads and writes by users. It maximizes overall throughput in a busy cluster environment and data bandwidth from client to disk, while at the same time minimizing disk thrashing commonly caused by multiple simultaneous readers.
* *Provenance and Reproducibility* - Keep makes tracking the origin of data sets, validating contents of objects, and enabling the reproduction of complex computations on large datasets a core function of the storage system.
* *Data Preservation* - Keep is ideal for data sets that don’t change once saved (e.g. whole genome sequencing data). It’s organized around a write once, read many (WORM) model rather than create-read-update-delete (CRUD) use cases.
* *Complex Data Management* - Keep operates well in environments where there are many independent users accessing the same data or users who want to organize data in many different ways. Keep facilitates data sharing without expecting users either to agree with one another about directory structures or to create redundant copies of the data.
* *Security* - Biomedical data is assumed to have strong security requirements, so Keep has on-disk and in-transport encryption as a native capability.
h2. Content Addressing
Keep is a content-addressable storage system (CAS).
Instead of using a location-addressed storage approach -- where each file is retrieved based on the location where it is stored, typically using POSIX semantics -- Keep uses a content address to locate, retrieve, and validate files. The content address is based on a cryptographic digest of the data in the object. Each content address is a permanent universally unique identifier (UUID).
By using content addresses, Keep uses a flat address space that is highly scalable and in effect virtualizes storage across large pools of commodity drives. A metadata store makes it possible to tag objects with an arbitrary number of metadata tags. By separating metadata from objects, it’s possible to create separate metadata namespaces for different projects without unnecessarily duplicating the underlying file data.
Keep uses manifests (text files with names and content addresses which are themselves also stored in Keep) to define collections of files. Each collection has its own content address, like any other object stored in Keep. Collections make it possible to uniquely reference very large datasets using a very small identifier: a 40 byte collection UUID can describe 100 terabytes of data using a simple text-based manifest, while adding one layer of indirection increases this to 180 exabytes.
Collections make it possible to create an arbitrary number of dataset definitions for different computations and analyses that are permanently recorded and validated, and do not require any physical reorganization or duplication of data in the underlying file system. For example, a researcher can query the metadata database to select existing collections, and build a new collection using only the desired files from the existing sets. This operation is both robust and efficient, since it can be done without reading or writing any of the underlying file data. Collections can be directly referenced in pipeline instances to pass data to individual jobs as inputs or record data from jobs as outputs.
Content-addressable storage is well suited to sequencing and other biomedical big data because it addresses the needs for provenance, reproducibility, and data validation.
h2. Distributed File System
Keep combines content addressing with a distributed file system architecture that was inspired by Google File System (GFS) and is similar in some ways to Hadoop Distributed File system (HDFS).
h3. Data Blocks
Keep is designed to store large amounts of large files on large clusters of commodity drives distributed across many nodes. The system splits and combines file data into 64 MiB data blocks and saves those to an underlying file system (e.g. Linux ext4). It also replicates blocks across multiple nodes and optionally multiple racks. Keep creates a cryptographic digest of each block for content addressing and creates a manifest that describes the underlying data. The manifest is also stored in Keep and has a unique content address. Metadata is recorded in the metadata database.
h3. Clients and Servers
The Keep system is organized around Clients and Servers. Keep clients use an SDK for convenient access to the client-server API. A Keep Server runs on each node which has physical disks (in the optimal private cloud configuration, every node has physical disks). Clients connect directly to Keep Servers. This many-to-many architecture eliminates the single point of failure and data bottleneck that would result from employing a master node. Instead of connecting to an indexing service, clients compute the expected location for each block based on the block’s content address.
A Keep client SDK can be incorporated in to a variety of different clients that interface with Keep. It takes care of most of the responsibilities of the client in the Keep architecture:
* split large file data into 64 MiB data blocks
* combine small file data into 64 MiB data blocks
* encode directory trees as manifests
* write data to the desired number of nodes to achieve storage redundancy
* register collections in the metadata database, using the Arvados REST API
* parse manifests
* verify data integrity by comparing locator to the cryptographic digest of the retrieved data
The responsibilities of the client application include:
* Record appropriate metadata in the metadata database
* Specify the desired storage redundancy (if different from the site default)
Keep Servers have several responsibilities:
* Write data blocks to disk
* Ensure data integrity when writing by comparing data with client-supplied cryptographic digest
* Ensure data integrity when reading by comparing data with content address
* Retrieve data blocks and send to clients over the network (subject to permission, which is determined by the system/metadata DB)
* Send read, write, and error event logs to the Data Manager
h2. Data Manager
Arvados leverages Keep extensively to enable the development of pipelines and applications. In addition to Keep, Arvados includes the [[Data Manager]] which is a component that assists Keep servers in enforcing site policies and monitors the state of the storage facility as a whole. We expect that the Data Manager will also broker backup and archival services and interfaces to other external data stores.
Keep provides a straightforward API that makes it easy to write, read, organize, and validate files. It does not conform to traditional POSIX semantics (however, individual collections and the whole storage system can be mounted with a FUSE driver and accessed as a read-only POSIX volume).
Keep offers a variety of major benefits over other distributed file storage systems and scale out disk file systems such as Isilon. This is a summary of some of those benefits:
* *Elimination of Duplication* - One of the major storage management problems for research labs is the unnecessary duplication of data. Often researchers will make copies of data for backup or to re-organize files for different projects. Content addressing automatically eliminates unnecessary duplication: if a program saves a file when an identical file has already been stored, Keep simply reports success without having to write a second copy.
* *Canonical Records* - Content addressing creates clear and verifiable canonical records for files. By combining Keep with the computation system in Arvados, it becomes trivial to verify the exact file that was used for a computation. By using a collection to define an entire data set (which could be 100s of terabytes or petabytes), you maintain a permanent and verifiable record of which data were used for each computation. The file that defines a collection is very small relative to the underlying data, so you can make as many as you need.
* *Provenance* - The combination of Keep and the computation system make it possible to maintain clear provenance for all the data in the system. This has a number of benefits including making it easy to ascertain how data were derived at any point in time.
* *Easy Management of Temporary Data* - One benefit of systematic provenance tracking is that Arvados can automatically manage temporary and intermediate data. If you know how a data set or file is was created, you can decide whether it is worthwhile to save a copy. Knowing what pipeline was run on which input data, how long it took, etc., makes it possible to automate such decisions.
* *Flexible Organization* - In Arvados, files are grouped in collections and can be easily tagged with metadata. Different researcher and research teams can manage independent sets of metadata. This makes it possible for researchers to organize files in a variety of different ways without duplicating or physically moving the data. A collection is represented by a text file, which lists the filenames and data blocks comprising the collection, and is itself stored in Keep. As a result, the same underlying data can be referenced by many different collections, without ever copying or moving the data itself.
* *High Reliability* - By combining content addressing with an object file store, Keep is fault tolerant across drive and even node failures. The Data Manager monitors the replication level of each data collection. Storage redundancy can thus be adjusted according to the relative importance of individual datasets in addition to default site policy.
* *Easier Tiering of Storage* - The Data Manager in Arvados manages the distribution of files to storage systems such as a NAS or cloud back-up service. The files are all content addressed and tracked in the metadata database: when a pipeline uses data which is not on the cluster, Arvados can automatically move the necessary data onto the cluster before starting the job. This makes tiered storage feasible without imposing an undue burden on end users.
* *Security and Access Control* - Keep can encrypt files on disk and this storage architecture makes the implementation of very fine grained access control significantly easier than traditional disk file systems.
* *POSIX Interface* - Collections can be mounted as drive with POSIX semantics using FUSE to access data with tools that expect a POSIX interface. Because collections are so flexible, one can easily create many different virtual directory structures for the same underlying files without copying or even reading the underlying data. Combining the native Arvados tools with UNIX pipes provides better performance, but the POSIX mount option is more convenient in some situations.
* *Data Sharing* - Keep makes it much easier to share data between clusters in different data centers and organizations. Keep content addresses include information about which cluster data is stored on. With federated clusters, collections of data can reside on multiple clusters, and distribution of computations across clusters can eliminate slow, costly data transfers.
Keep was first developed for the Harvard Personal Genome Project (PGP) in 2007 (see overall project [[history]]. The first Keep code commit was Friday Dec 28 07:41:49 2007.
The system was designed to combine the innovations "documented by Google":http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/gfs-sosp2003.pdf in the Google File System (GFS) with the concept of content-addressable storage to address the unique needs of biomedical data. GFS was never made public, but other implementations of the approach have become popular storing and managing big data sets including "Hadoop Distributed File System (HDFS)":http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html and "Ceph":http://ceph.com/.
While Keep uses a similar approach to these systems, the focus on content addressing results and a very different set of capabilities aligned with the needs of genomic and biomedical data analysis. "Content-addressable storage (CAS)":http://en.wikipedia.org/wiki/Content-addressable_storage is an idea that was pioneered in the late 1990’s by FilePool and first commercialized in the EMC Centera platform.