Project

General

Profile

Keep » History » Version 21

Anonymous, 08/28/2015 01:21 PM

1 21 Anonymous
h1. Keep - Content-Addressable Distributed Storage 
2 1 Tom Clegg
3 16 Tom Clegg
Developer documentation:
4
* [[Keep server]]
5
* [[Keep index]]
6
* [[Keep manifest format]]
7 18 Tom Clegg
* [[Keep locator format]]
8 17 Tom Clegg
* [[Hacking Keep]]
9 16 Tom Clegg
10 20 Anonymous
Keep is a content-addressable distributed storage system that yields high performance in I/O-bound cluster environments. Keep is designed to run on low-cost commodity hardware or cloud services and integrate with the rest of the Arvados system. It provides high fault tolerance and high aggregate performance to a large number of clients. 
11 1 Tom Clegg
12 9 Tom Clegg
h2. Design Goals and Assumptions
13
14 2 Tom Clegg
Notable design goals and features include:
15
16 9 Tom Clegg
* *Scale* - Keep architecture is built to handle data sets that grow to petabyte and ultimately exabyte scale. 
17 2 Tom Clegg
18 9 Tom Clegg
* *Fault-Tolerance* - Hardware failure is expected. Installations will include tens of thousands of drives on thousands of nodes; therefore, drive, node, and rack failures are inevitable. Keep has redundancy and recovery capabilities at its core.
19 1 Tom Clegg
20 9 Tom Clegg
* *Large Files* - Unlike disk file systems, Keep is tuned to store large files from gigabytes to terabytes in size. It also reduces the overhead of storing small files by packing those files’ data into larger data blocks. 
21 2 Tom Clegg
22 9 Tom Clegg
* *High Aggregate Performance Over Low Latency* - Keep is optimized for large batch computations instead of lots of small reads and writes by users. It maximizes overall throughput in a busy cluster environment and data bandwidth from client to disk, while at the same time minimizing disk thrashing commonly caused by multiple simultaneous readers. 
23 2 Tom Clegg
24 9 Tom Clegg
* *Provenance and Reproducibility* - Keep makes tracking the origin of data sets, validating contents of objects, and enabling the reproduction of complex computations on large datasets a core function of the storage system. 
25 1 Tom Clegg
26 9 Tom Clegg
* *Data Preservation* - Keep is ideal for data sets that don’t change once saved (e.g. whole genome sequencing data). It’s organized around a write once, read many (WORM) model rather than create-read-update-delete (CRUD) use cases.
27 1 Tom Clegg
28 9 Tom Clegg
* *Complex Data Management* - Keep operates well in environments where there are many independent users accessing the same data or users who want to organize data in many different ways. Keep facilitates data sharing without expecting users either to agree with one another about directory structures or to create redundant copies of the data.
29 1 Tom Clegg
30 19 Anonymous
* *Security* - Biomedical data is assumed to have strong security requirements, so Keep has on-disk and in-transport encryption as a native capability. Fine grain permission management can be done using data collections. 
31 1 Tom Clegg
32 9 Tom Clegg
h2. Content Addressing 
33 1 Tom Clegg
34 20 Anonymous
Keep is a content-addressable distributed storage system. 
35 1 Tom Clegg
36 9 Tom Clegg
Instead of using a location-addressed storage approach -- where each file is retrieved based on the location where it is stored, typically using POSIX semantics -- Keep uses a content address to locate, retrieve, and validate files. The content address is based on a cryptographic digest of the data in the object. Each content address is a permanent universally unique identifier (UUID). 
37 1 Tom Clegg
38 20 Anonymous
By using content addresses, Keep uses a flat address space that is highly scalable and in effect virtualizes storage across large pools of commodity drives or cloud storage. A metadata store makes it possible to tag objects with an arbitrary number of metadata tags. By separating metadata from objects, it’s possible to create separate metadata namespaces for different projects without unnecessarily duplicating the underlying file data.
39 1 Tom Clegg
40 20 Anonymous
Keep uses manifests (text files with names and content addresses which are themselves also stored in Keep) to define collections of files. Each collection has its own content address and GUID, like any other object stored in Keep. Collections make it possible to uniquely reference very large data sets using a very small identifier: a 40 byte collection UUID can describe 100 terabytes of data using a simple text-based manifest, while adding one layer of indirection increases this to 180 exabytes. 
41 1 Tom Clegg
42 20 Anonymous
Collections make it possible to create an arbitrary number of data set definitions for different computations and analyses that are permanently recorded and validated, and do not require any physical reorganization or duplication of data in the underlying file system. For example, a researcher can query the metadata database to select existing collections, and build a new collection using only the desired files from the existing sets. This operation is both robust and efficient, since it can be done without reading or writing any of the underlying file data. Collections can be directly referenced in pipeline instances to pass data to individual jobs as inputs or record data from jobs as outputs.
43 1 Tom Clegg
44 9 Tom Clegg
Content-addressable storage is well suited to sequencing and other biomedical big data because it addresses the needs for provenance, reproducibility, and data validation. 
45 1 Tom Clegg
46 20 Anonymous
h2. Distributed Storage and Data Blocks
47 1 Tom Clegg
48 20 Anonymous
Keep combines content addressing with a distributed storage architecture that was inspired by Google File System (GFS).
49 1 Tom Clegg
50 20 Anonymous
Keep is designed to store large amounts of large files on large clusters of commodity drives distributed across many nodes or cloud storage. The system splits and combines file data into 64 MiB data blocks and saves those to an underlying file system (e.g. Linux ext4 or Amazon S3). It also replicates blocks across multiple nodes and optionally multiple racks. Keep creates a cryptographic digest of each block for content addressing and creates a manifest that describes the underlying data. The manifest is also stored in Keep and has a unique content address. Metadata is recorded in the metadata database. 
51 1 Tom Clegg
52 12 Anonymous
!keep_workflow_diagram_2.png!
53 1 Tom Clegg
54 9 Tom Clegg
h3. Clients and Servers 
55 1 Tom Clegg
56 20 Anonymous
The Keep system is organized around Clients and Servers. Keep clients use an SDK for convenient access to the client-server API. A Keep Server runs on each node which has physical disks or access to cloud storage (in the optimal on premise system, every node has physical disks). Clients connect directly to Keep servers. This many-to-many architecture eliminates the single point of failure and data bottleneck that would result from employing a master node. Instead of connecting to an indexing service, clients compute the expected location for each block based on the block’s content address.
57 1 Tom Clegg
58 13 Anonymous
!keep_architecture_diagram_v1.png!
59
60 9 Tom Clegg
h4. Clients
61
62
A Keep client SDK can be incorporated in to a variety of different clients that interface with Keep. It takes care of most of the responsibilities of the client in the Keep architecture:
63
64
* split large file data into 64 MiB data blocks
65
* combine small file data into 64 MiB data blocks
66
* encode directory trees as manifests
67
* write data to the desired number of nodes to achieve storage redundancy
68
* register collections in the metadata database, using the Arvados REST API
69
* parse manifests
70
* verify data integrity by comparing locator to the cryptographic digest of the retrieved data
71
72
The responsibilities of the client application include:
73
74
* Record appropriate metadata in the metadata database
75
* Specify the desired storage redundancy (if different from the site default)
76
77
h4. Servers
78
79
Keep Servers have several responsibilities: 
80
81 1 Tom Clegg
* Write data blocks to disk
82 9 Tom Clegg
* Ensure data integrity when writing by comparing data with client-supplied cryptographic digest
83
* Ensure data integrity when reading by comparing data with content address
84 1 Tom Clegg
* Retrieve data blocks and send to clients over the network (subject to permission, which is determined by the system/metadata DB)
85 9 Tom Clegg
* Send read, write, and error event logs to the Data Manager
86
87
h2. Data Manager
88
89
Arvados leverages Keep extensively to enable the development of pipelines and applications. In addition to Keep, Arvados includes the [[Data Manager]] which is a component that assists Keep servers in enforcing site policies and monitors the state of the storage facility as a whole. We expect that the Data Manager will also broker backup and archival services and interfaces to other external data stores. 
90
91
h2. Interface
92
93 20 Anonymous
Keep provides a straightforward API that makes it easy to write, read, organize, and validate files. It does not conform to traditional POSIX semantics. However, individual collections and the whole storage system can be mounted with a FUSE driver and accessed as a POSIX volume).
94 9 Tom Clegg
95 1 Tom Clegg
h2. Benefits
96 9 Tom Clegg
97 20 Anonymous
Keep offers a variety of major benefits over other distributed storage systems and scale out disk file systems such as Isilon. This is a summary of some of those benefits: 
98 9 Tom Clegg
99
* *Elimination of Duplication* - One of the major storage management problems for research labs is the unnecessary duplication of data. Often researchers will make copies of data for backup or to re-organize files for different projects. Content addressing automatically eliminates unnecessary duplication: if a program saves a file when an identical file has already been stored, Keep simply reports success without having to write a second copy.
100
101
* *Canonical Records* - Content addressing creates clear and verifiable canonical records for files. By combining Keep with the computation system in Arvados, it becomes trivial to verify the exact file that was used for a computation. By using a collection to define an entire data set (which could be 100s of terabytes or petabytes), you maintain a permanent and verifiable record of which data were used for each computation. The file that defines a collection is very small relative to the underlying data, so you can make as many as you need. 
102
103
* *Provenance* - The combination of Keep and the computation system make it possible to maintain clear provenance for all the data in the system. This has a number of benefits including making it easy to ascertain how data were derived at any point in time. 
104 1 Tom Clegg
105
* *Easy Management of Temporary Data* - One benefit of systematic provenance tracking is that Arvados can automatically manage temporary and intermediate data. If you know how a data set or file is was created, you can decide whether it is worthwhile to save a copy. Knowing what pipeline was run on which input data, how long it took, etc., makes it possible to automate such decisions.
106 9 Tom Clegg
107
* *Flexible Organization* - In Arvados, files are grouped in collections and can be easily tagged with metadata. Different researcher and research teams can manage independent sets of metadata. This makes it possible for researchers to organize files in a variety of different ways without duplicating or physically moving the data. A collection is represented by a text file, which lists the filenames and data blocks comprising the collection, and is itself stored in Keep. As a result, the same underlying data can be referenced by many different collections, without ever copying or moving the data itself.
108
109 20 Anonymous
* *High Reliability* - By combining content addressing with distributed storage, Keep is fault tolerant across drive and even node failures. The Data Manager monitors the replication level of each data collection. Storage redundancy can thus be adjusted according to the relative importance of individual data sets in addition to default policy.
110 9 Tom Clegg
111
* *Easier Tiering of Storage* - The Data Manager in Arvados manages the distribution of files to storage systems such as a NAS or cloud back-up service. The files are all content addressed and tracked in the metadata database: when a pipeline uses data which is not on the cluster, Arvados can automatically move the necessary data onto the cluster before starting the job. This makes tiered storage feasible without imposing an undue burden on end users.
112
113
* *Security and Access Control* - Keep can encrypt files on disk and this storage architecture makes the implementation of very fine grained access control significantly easier than traditional disk file systems. 
114
115
* *POSIX Interface* - Collections can be mounted as drive with POSIX semantics using FUSE to access data with tools that expect a POSIX interface. Because collections are so flexible, one can easily create many different virtual directory structures for the same underlying files without copying or even reading the underlying data. Combining the native Arvados tools with UNIX pipes provides better performance, but the POSIX mount option is more convenient in some situations.
116
117
* *Data Sharing* - Keep makes it much easier to share data between clusters in different data centers and organizations. Keep content addresses include information about which cluster data is stored on. With federated clusters, collections of data can reside on multiple clusters, and distribution of computations across clusters can eliminate slow, costly data transfers.
118
119 20 Anonymous
* *Versioning* - Each collection in Keep has a Globally Unique ID (GUID), which is consistent for the collection over time. The collection also has a content address, which provides a durable reference to a specific version of a collection. This mechanism is similar to Git, and it makes it possible to add or remove files from a collection and still reliably retrieve older versions of the collection. 
120
121 9 Tom Clegg
h2. Background
122
123 14 Tom Clegg
Keep was first developed for the Harvard Personal Genome Project (PGP) in 2007 (see overall project [[history]]). 
124 9 Tom Clegg
125 20 Anonymous
The system was designed to combine the innovations "documented by Google":http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/gfs-sosp2003.pdf in the Google File System (GFS) with the concept of content-addressable storage to address the unique needs of biomedical data. "Content-addressable storage (CAS)":http://en.wikipedia.org/wiki/Content-addressable_storage is an idea that was pioneered in the late 1990s by FilePool and first commercialized in the EMC Centera platform.
126 9 Tom Clegg
127 20 Anonymous
While Keep uses a similar approach to these systems, the focus on content addressing results and a very different set of capabilities aligned with the needs of large scale data science make it very different.