Project

General

Profile

Keep » History » Version 12

Anonymous, 04/17/2013 02:18 PM

1 9 Tom Clegg
h1. Content-Addressable Distributed File System (“Keep”) 
2 1 Tom Clegg
3 9 Tom Clegg
Keep is a content-addressable distributed file system that yields high performance in I/O-bound cluster environments. Keep is designed to run on low-cost commodity hardware and integrate with the rest of the Arvados system. It provides high fault tolerance and high aggregate performance to a large number of clients. 
4 1 Tom Clegg
5 9 Tom Clegg
h2. Design Goals and Assumptions
6
7 2 Tom Clegg
Notable design goals and features include:
8
9 9 Tom Clegg
* *Scale* - Keep architecture is built to handle data sets that grow to petabyte and ultimately exabyte scale. 
10 2 Tom Clegg
11 9 Tom Clegg
* *Fault-Tolerance* - Hardware failure is expected. Installations will include tens of thousands of drives on thousands of nodes; therefore, drive, node, and rack failures are inevitable. Keep has redundancy and recovery capabilities at its core.
12 1 Tom Clegg
13 9 Tom Clegg
* *Large Files* - Unlike disk file systems, Keep is tuned to store large files from gigabytes to terabytes in size. It also reduces the overhead of storing small files by packing those files’ data into larger data blocks. 
14 2 Tom Clegg
15 9 Tom Clegg
* *High Aggregate Performance Over Low Latency* - Keep is optimized for large batch computations instead of lots of small reads and writes by users. It maximizes overall throughput in a busy cluster environment and data bandwidth from client to disk, while at the same time minimizing disk thrashing commonly caused by multiple simultaneous readers. 
16 2 Tom Clegg
17 9 Tom Clegg
* *Provenance and Reproducibility* - Keep makes tracking the origin of data sets, validating contents of objects, and enabling the reproduction of complex computations on large datasets a core function of the storage system. 
18 1 Tom Clegg
19 9 Tom Clegg
* *Data Preservation* - Keep is ideal for data sets that don’t change once saved (e.g. whole genome sequencing data). It’s organized around a write once, read many (WORM) model rather than create-read-update-delete (CRUD) use cases.
20 1 Tom Clegg
21 9 Tom Clegg
* *Complex Data Management* - Keep operates well in environments where there are many independent users accessing the same data or users who want to organize data in many different ways. Keep facilitates data sharing without expecting users either to agree with one another about directory structures or to create redundant copies of the data.
22 1 Tom Clegg
23 9 Tom Clegg
* *Security* - Biomedical data is assumed to have strong security requirements, so Keep has on-disk and in-transport encryption as a native capability. 
24 1 Tom Clegg
25 9 Tom Clegg
h2. Content Addressing 
26 1 Tom Clegg
27 9 Tom Clegg
Keep is a content-addressable storage system (CAS). 
28 1 Tom Clegg
29 9 Tom Clegg
Instead of using a location-addressed storage approach -- where each file is retrieved based on the location where it is stored, typically using POSIX semantics -- Keep uses a content address to locate, retrieve, and validate files. The content address is based on a cryptographic digest of the data in the object. Each content address is a permanent universally unique identifier (UUID). 
30 1 Tom Clegg
31 9 Tom Clegg
By using content addresses, Keep uses a flat address space that is highly scalable and in effect virtualizes storage across large pools of commodity drives. A metadata store makes it possible to tag objects with an arbitrary number of metadata tags. By separating metadata from objects, it’s possible to create separate metadata namespaces for different projects without unnecessarily duplicating the underlying file data.
32 1 Tom Clegg
33 9 Tom Clegg
Keep uses manifests (text files with names and content addresses which are themselves also stored in Keep) to define collections of files. Each collection has its own content address, like any other object stored in Keep. Collections make it possible to uniquely reference very large datasets using a very small identifier: a 40 byte collection UUID can describe 100 terabytes of data using a simple text-based manifest, while adding one layer of indirection increases this to 180 exabytes. 
34 1 Tom Clegg
35 9 Tom Clegg
Collections make it possible to create an arbitrary number of dataset definitions for different computations and analyses that are permanently recorded and validated, and do not require any physical reorganization or duplication of data in the underlying file system. For example, a researcher can query the metadata database to select existing collections, and build a new collection using only the desired files from the existing sets. This operation is both robust and efficient, since it can be done without reading or writing any of the underlying file data. Collections can be directly referenced in pipeline instances to pass data to individual jobs as inputs or record data from jobs as outputs.
36 1 Tom Clegg
37 9 Tom Clegg
Content-addressable storage is well suited to sequencing and other biomedical big data because it addresses the needs for provenance, reproducibility, and data validation. 
38 1 Tom Clegg
39 9 Tom Clegg
h2. Distributed File System 
40 1 Tom Clegg
41 9 Tom Clegg
Keep combines content addressing with a distributed file system architecture that was inspired by Google File System (GFS) and is similar in some ways to Hadoop Distributed File system (HDFS).
42 1 Tom Clegg
43 9 Tom Clegg
h3. Data Blocks 
44 1 Tom Clegg
45 9 Tom Clegg
Keep is designed to store large amounts of large files on large clusters of commodity drives distributed across many nodes. The system splits and combines file data into 64 MiB data blocks and saves those to an underlying file system (e.g. Linux ext4). It also replicates blocks across multiple nodes and optionally multiple racks. Keep creates a cryptographic digest of each block for content addressing and creates a manifest that describes the underlying data. The manifest is also stored in Keep and has a unique content address. Metadata is recorded in the metadata database. 
46 1 Tom Clegg
47 12 Anonymous
!keep_workflow_diagram_2.png!
48 1 Tom Clegg
49 9 Tom Clegg
h3. Clients and Servers 
50 1 Tom Clegg
51
The Keep system is organized around Clients and Servers. Keep clients use an SDK for convenient access to the client-server API. A Keep Server runs on each node which has physical disks (in the optimal private cloud configuration, every node has physical disks). Clients connect directly to Keep Servers. This many-to-many architecture eliminates the single point of failure and data bottleneck that would result from employing a master node. Instead of connecting to an indexing service, clients compute the expected location for each block based on the block’s content address.
52
53 9 Tom Clegg
h4. Clients
54
55
A Keep client SDK can be incorporated in to a variety of different clients that interface with Keep. It takes care of most of the responsibilities of the client in the Keep architecture:
56
57
* split large file data into 64 MiB data blocks
58
* combine small file data into 64 MiB data blocks
59
* encode directory trees as manifests
60
* write data to the desired number of nodes to achieve storage redundancy
61
* register collections in the metadata database, using the Arvados REST API
62
* parse manifests
63
* verify data integrity by comparing locator to the cryptographic digest of the retrieved data
64
65
The responsibilities of the client application include:
66
67
* Record appropriate metadata in the metadata database
68
* Specify the desired storage redundancy (if different from the site default)
69
70
h4. Servers
71
72
Keep Servers have several responsibilities: 
73
74
* Write data blocks to disk
75
* Ensure data integrity when writing by comparing data with client-supplied cryptographic digest
76
* Ensure data integrity when reading by comparing data with content address
77
* Retrieve data blocks and send to clients over the network (subject to permission, which is determined by the system/metadata DB)
78
* Send read, write, and error event logs to the Data Manager
79
80
h2. Data Manager
81
82
Arvados leverages Keep extensively to enable the development of pipelines and applications. In addition to Keep, Arvados includes the [[Data Manager]] which is a component that assists Keep servers in enforcing site policies and monitors the state of the storage facility as a whole. We expect that the Data Manager will also broker backup and archival services and interfaces to other external data stores. 
83
84
h2. Interface
85
86
Keep provides a straightforward API that makes it easy to write, read, organize, and validate files. It does not conform to traditional POSIX semantics (however, individual collections and the whole storage system can be mounted with a FUSE driver and accessed as a read-only POSIX volume).
87
88
h2. Benefits
89
90
Keep offers a variety of major benefits over other distributed file storage systems and scale out disk file systems such as Isilon. This is a summary of some of those benefits: 
91
92
* *Elimination of Duplication* - One of the major storage management problems for research labs is the unnecessary duplication of data. Often researchers will make copies of data for backup or to re-organize files for different projects. Content addressing automatically eliminates unnecessary duplication: if a program saves a file when an identical file has already been stored, Keep simply reports success without having to write a second copy.
93
94
* *Canonical Records* - Content addressing creates clear and verifiable canonical records for files. By combining Keep with the computation system in Arvados, it becomes trivial to verify the exact file that was used for a computation. By using a collection to define an entire data set (which could be 100s of terabytes or petabytes), you maintain a permanent and verifiable record of which data were used for each computation. The file that defines a collection is very small relative to the underlying data, so you can make as many as you need. 
95
96
* *Provenance* - The combination of Keep and the computation system make it possible to maintain clear provenance for all the data in the system. This has a number of benefits including making it easy to ascertain how data were derived at any point in time. 
97
98
* *Easy Management of Temporary Data* - One benefit of systematic provenance tracking is that Arvados can automatically manage temporary and intermediate data. If you know how a data set or file is was created, you can decide whether it is worthwhile to save a copy. Knowing what pipeline was run on which input data, how long it took, etc., makes it possible to automate such decisions.
99
100
* *Flexible Organization* - In Arvados, files are grouped in collections and can be easily tagged with metadata. Different researcher and research teams can manage independent sets of metadata. This makes it possible for researchers to organize files in a variety of different ways without duplicating or physically moving the data. A collection is represented by a text file, which lists the filenames and data blocks comprising the collection, and is itself stored in Keep. As a result, the same underlying data can be referenced by many different collections, without ever copying or moving the data itself.
101
102
* *High Reliability* - By combining content addressing with an object file store, Keep is fault tolerant across drive and even node failures. The Data Manager monitors the replication level of each data collection. Storage redundancy can thus be adjusted according to the relative importance of individual datasets in addition to default site policy.
103
104
* *Easier Tiering of Storage* - The Data Manager in Arvados manages the distribution of files to storage systems such as a NAS or cloud back-up service. The files are all content addressed and tracked in the metadata database: when a pipeline uses data which is not on the cluster, Arvados can automatically move the necessary data onto the cluster before starting the job. This makes tiered storage feasible without imposing an undue burden on end users.
105
106
* *Security and Access Control* - Keep can encrypt files on disk and this storage architecture makes the implementation of very fine grained access control significantly easier than traditional disk file systems. 
107
108
* *POSIX Interface* - Collections can be mounted as drive with POSIX semantics using FUSE to access data with tools that expect a POSIX interface. Because collections are so flexible, one can easily create many different virtual directory structures for the same underlying files without copying or even reading the underlying data. Combining the native Arvados tools with UNIX pipes provides better performance, but the POSIX mount option is more convenient in some situations.
109
110
* *Data Sharing* - Keep makes it much easier to share data between clusters in different data centers and organizations. Keep content addresses include information about which cluster data is stored on. With federated clusters, collections of data can reside on multiple clusters, and distribution of computations across clusters can eliminate slow, costly data transfers.
111
112
h2. Background
113
114
Keep was first developed for the Harvard Personal Genome Project (PGP) in 2007 (see overall project [[history]]. The first Keep code commit was Friday Dec 28 07:41:49 2007. 
115
116
The system was designed to combine the innovations "documented by Google":http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/gfs-sosp2003.pdf in the Google File System (GFS) with the concept of content-addressable storage to address the unique needs of biomedical data. GFS was never made public, but other implementations of the approach have become popular storing and managing big data sets including "Hadoop Distributed File System (HDFS)":http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html and "Ceph":http://ceph.com/. 
117
118
While Keep uses a similar approach to these systems, the focus on content addressing results and a very different set of capabilities aligned with the needs of genomic and biomedical data analysis. "Content-addressable storage (CAS)":http://en.wikipedia.org/wiki/Content-addressable_storage is an idea that was pioneered in the late 1990’s by FilePool and first commercialized in the EMC Centera platform.