Project

General

Profile

Keep » History » Version 26

Ward Vandewege, 06/07/2018 07:42 PM

1 22 Anonymous
h1. Keep - Content-Addressable Storage 
2 1 Tom Clegg
3 16 Tom Clegg
Developer documentation:
4
* [[Keep server]]
5
* [[Keep index]]
6
* [[Keep manifest format]]
7 18 Tom Clegg
* [[Keep locator format]]
8 25 Ward Vandewege
* [[Keep real world performance numbers]]
9 17 Tom Clegg
* [[Hacking Keep]]
10 16 Tom Clegg
11 23 Ward Vandewege
Keep is a content-addressable storage system that yields high performance for I/O-bound workloads. Keep is designed to run on low-cost commodity hardware or cloud services and integrates with the rest of the Arvados system. It provides high fault tolerance and high aggregate performance to a large number of clients. 
12 1 Tom Clegg
13 9 Tom Clegg
h2. Design Goals and Assumptions
14
15 2 Tom Clegg
Notable design goals and features include:
16
17 9 Tom Clegg
* *Scale* - Keep architecture is built to handle data sets that grow to petabyte and ultimately exabyte scale. 
18 2 Tom Clegg
19 9 Tom Clegg
* *Fault-Tolerance* - Hardware failure is expected. Installations will include tens of thousands of drives on thousands of nodes; therefore, drive, node, and rack failures are inevitable. Keep has redundancy and recovery capabilities at its core.
20 1 Tom Clegg
21 9 Tom Clegg
* *Large Files* - Unlike disk file systems, Keep is tuned to store large files from gigabytes to terabytes in size. It also reduces the overhead of storing small files by packing those files’ data into larger data blocks. 
22 2 Tom Clegg
23 9 Tom Clegg
* *High Aggregate Performance Over Low Latency* - Keep is optimized for large batch computations instead of lots of small reads and writes by users. It maximizes overall throughput in a busy cluster environment and data bandwidth from client to disk, while at the same time minimizing disk thrashing commonly caused by multiple simultaneous readers. 
24 2 Tom Clegg
25 9 Tom Clegg
* *Provenance and Reproducibility* - Keep makes tracking the origin of data sets, validating contents of objects, and enabling the reproduction of complex computations on large datasets a core function of the storage system. 
26 1 Tom Clegg
27 9 Tom Clegg
* *Data Preservation* - Keep is ideal for data sets that don’t change once saved (e.g. whole genome sequencing data). It’s organized around a write once, read many (WORM) model rather than create-read-update-delete (CRUD) use cases.
28 1 Tom Clegg
29 9 Tom Clegg
* *Complex Data Management* - Keep operates well in environments where there are many independent users accessing the same data or users who want to organize data in many different ways. Keep facilitates data sharing without expecting users either to agree with one another about directory structures or to create redundant copies of the data.
30 1 Tom Clegg
31 23 Ward Vandewege
* *Security* - Biomedical data is assumed to have strong security requirements. Keep works well combined with encryption at rest and transport encryption. Fine grain permission management is achieved using data collections. 
32 1 Tom Clegg
33 9 Tom Clegg
h2. Content Addressing 
34 1 Tom Clegg
35 20 Anonymous
Keep is a content-addressable distributed storage system. 
36 1 Tom Clegg
37 9 Tom Clegg
Instead of using a location-addressed storage approach -- where each file is retrieved based on the location where it is stored, typically using POSIX semantics -- Keep uses a content address to locate, retrieve, and validate files. The content address is based on a cryptographic digest of the data in the object. Each content address is a permanent universally unique identifier (UUID). 
38 1 Tom Clegg
39 23 Ward Vandewege
By using content addresses, Keep uses a flat address space that is highly scalable and in effect virtualizes storage across large pools of commodity drives or cloud storage. A metadata store makes it possible to tag objects with an arbitrary number of metadata tags. By separating metadata from objects, it’s possible to create separate metadata namespaces for different projects without needing to duplicate the underlying data.
40 1 Tom Clegg
41 20 Anonymous
Keep uses manifests (text files with names and content addresses which are themselves also stored in Keep) to define collections of files. Each collection has its own content address and GUID, like any other object stored in Keep. Collections make it possible to uniquely reference very large data sets using a very small identifier: a 40 byte collection UUID can describe 100 terabytes of data using a simple text-based manifest, while adding one layer of indirection increases this to 180 exabytes. 
42 1 Tom Clegg
43 20 Anonymous
Collections make it possible to create an arbitrary number of data set definitions for different computations and analyses that are permanently recorded and validated, and do not require any physical reorganization or duplication of data in the underlying file system. For example, a researcher can query the metadata database to select existing collections, and build a new collection using only the desired files from the existing sets. This operation is both robust and efficient, since it can be done without reading or writing any of the underlying file data. Collections can be directly referenced in pipeline instances to pass data to individual jobs as inputs or record data from jobs as outputs.
44 1 Tom Clegg
45 9 Tom Clegg
Content-addressable storage is well suited to sequencing and other biomedical big data because it addresses the needs for provenance, reproducibility, and data validation. 
46 1 Tom Clegg
47 20 Anonymous
h2. Distributed Storage and Data Blocks
48 1 Tom Clegg
49 20 Anonymous
Keep combines content addressing with a distributed storage architecture that was inspired by Google File System (GFS).
50 1 Tom Clegg
51 23 Ward Vandewege
Keep is designed to store large amounts of large files on large clusters of commodity drives distributed across many nodes or cloud storage. The system splits and combines file data into 64 MiB data blocks and saves those to an underlying file system (e.g. Linux ext4) or object store (e.g. Amazon S3). It also replicates blocks across multiple nodes and optionally multiple racks. Keep creates a cryptographic digest of each block for content addressing and creates a manifest that describes the underlying data. The manifest is also stored in Keep and has a unique content address. Metadata is recorded in the metadata database. 
52 1 Tom Clegg
53 26 Ward Vandewege
!keep_workflow_diagram_3.png!
54 1 Tom Clegg
55 9 Tom Clegg
h3. Clients and Servers 
56 1 Tom Clegg
57 20 Anonymous
The Keep system is organized around Clients and Servers. Keep clients use an SDK for convenient access to the client-server API. A Keep Server runs on each node which has physical disks or access to cloud storage (in the optimal on premise system, every node has physical disks). Clients connect directly to Keep servers. This many-to-many architecture eliminates the single point of failure and data bottleneck that would result from employing a master node. Instead of connecting to an indexing service, clients compute the expected location for each block based on the block’s content address.
58 1 Tom Clegg
59 13 Anonymous
!keep_architecture_diagram_v1.png!
60
61 9 Tom Clegg
h4. Clients
62
63
A Keep client SDK can be incorporated in to a variety of different clients that interface with Keep. It takes care of most of the responsibilities of the client in the Keep architecture:
64
65
* split large file data into 64 MiB data blocks
66
* combine small file data into 64 MiB data blocks
67
* encode directory trees as manifests
68
* write data to the desired number of nodes to achieve storage redundancy
69
* register collections in the metadata database, using the Arvados REST API
70
* parse manifests
71
* verify data integrity by comparing locator to the cryptographic digest of the retrieved data
72
73
The responsibilities of the client application include:
74
75
* Record appropriate metadata in the metadata database
76
* Specify the desired storage redundancy (if different from the site default)
77
78
h4. Servers
79
80
Keep Servers have several responsibilities: 
81
82 1 Tom Clegg
* Write data blocks to disk
83 9 Tom Clegg
* Ensure data integrity when writing by comparing data with client-supplied cryptographic digest
84
* Ensure data integrity when reading by comparing data with content address
85 1 Tom Clegg
* Retrieve data blocks and send to clients over the network (subject to permission, which is determined by the system/metadata DB)
86 9 Tom Clegg
* Send read, write, and error event logs to the Data Manager
87
88 23 Ward Vandewege
h2. Keep Balance
89 9 Tom Clegg
90 24 Ward Vandewege
Arvados leverages Keep extensively to enable the development of pipelines and applications. In addition to Keep, Arvados includes [[keep-balance|Keep Balance]] which is a component that assists Keep servers in enforcing site policies and monitors the state of the storage facility as a whole. By 2018, Keep Balance will also permit the automatic moving of data between different storage tiers (e.g. Hot/Cool/Cold on Microsoft Azure).
91 9 Tom Clegg
92
h2. Interface
93
94 20 Anonymous
Keep provides a straightforward API that makes it easy to write, read, organize, and validate files. It does not conform to traditional POSIX semantics. However, individual collections and the whole storage system can be mounted with a FUSE driver and accessed as a POSIX volume).
95 9 Tom Clegg
96 1 Tom Clegg
h2. Benefits
97 9 Tom Clegg
98 20 Anonymous
Keep offers a variety of major benefits over other distributed storage systems and scale out disk file systems such as Isilon. This is a summary of some of those benefits: 
99 9 Tom Clegg
100
* *Elimination of Duplication* - One of the major storage management problems for research labs is the unnecessary duplication of data. Often researchers will make copies of data for backup or to re-organize files for different projects. Content addressing automatically eliminates unnecessary duplication: if a program saves a file when an identical file has already been stored, Keep simply reports success without having to write a second copy.
101
102
* *Canonical Records* - Content addressing creates clear and verifiable canonical records for files. By combining Keep with the computation system in Arvados, it becomes trivial to verify the exact file that was used for a computation. By using a collection to define an entire data set (which could be 100s of terabytes or petabytes), you maintain a permanent and verifiable record of which data were used for each computation. The file that defines a collection is very small relative to the underlying data, so you can make as many as you need. 
103
104
* *Provenance* - The combination of Keep and the computation system make it possible to maintain clear provenance for all the data in the system. This has a number of benefits including making it easy to ascertain how data were derived at any point in time. 
105 1 Tom Clegg
106 9 Tom Clegg
* *Easy Management of Temporary Data* - One benefit of systematic provenance tracking is that Arvados can automatically manage temporary and intermediate data. If you know how a data set or file is was created, you can decide whether it is worthwhile to save a copy. Knowing what pipeline was run on which input data, how long it took, etc., makes it possible to automate such decisions.
107
108 20 Anonymous
* *Flexible Organization* - In Arvados, files are grouped in collections and can be easily tagged with metadata. Different researcher and research teams can manage independent sets of metadata. This makes it possible for researchers to organize files in a variety of different ways without duplicating or physically moving the data. A collection is represented by a text file, which lists the filenames and data blocks comprising the collection, and is itself stored in Keep. As a result, the same underlying data can be referenced by many different collections, without ever copying or moving the data itself.
109 9 Tom Clegg
110
* *High Reliability* - By combining content addressing with distributed storage, Keep is fault tolerant across drive and even node failures. The Data Manager monitors the replication level of each data collection. Storage redundancy can thus be adjusted according to the relative importance of individual data sets in addition to default policy.
111
112 23 Ward Vandewege
* *Security and Access Control* - Keep can work on top of encrypted filesystems. Its storage architecture makes the implementation of very fine grained access control significantly easier than traditional disk file systems. 
113 9 Tom Clegg
114
* *POSIX Interface* - Collections can be mounted as drive with POSIX semantics using FUSE to access data with tools that expect a POSIX interface. Because collections are so flexible, one can easily create many different virtual directory structures for the same underlying files without copying or even reading the underlying data. Combining the native Arvados tools with UNIX pipes provides better performance, but the POSIX mount option is more convenient in some situations.
115
116
* *Data Sharing* - Keep makes it much easier to share data between clusters in different data centers and organizations. Keep content addresses include information about which cluster data is stored on. With federated clusters, collections of data can reside on multiple clusters, and distribution of computations across clusters can eliminate slow, costly data transfers.
117
118 20 Anonymous
* *Versioning* - Each collection in Keep has a Globally Unique ID (GUID), which is consistent for the collection over time. The collection also has a content address, which provides a durable reference to a specific version of a collection. This mechanism is similar to Git, and it makes it possible to add or remove files from a collection and still reliably retrieve older versions of the collection. 
119
120 9 Tom Clegg
h2. Background
121
122 14 Tom Clegg
Keep was first developed for the Harvard Personal Genome Project (PGP) in 2007 (see overall project [[history]]). 
123 9 Tom Clegg
124 20 Anonymous
The system was designed to combine the innovations "documented by Google":http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/gfs-sosp2003.pdf in the Google File System (GFS) with the concept of content-addressable storage to address the unique needs of biomedical data. "Content-addressable storage (CAS)":http://en.wikipedia.org/wiki/Content-addressable_storage is an idea that was pioneered in the late 1990s by FilePool and first commercialized in the EMC Centera platform.
125 9 Tom Clegg
126 24 Ward Vandewege
While Keep uses a similar approach to these systems, the focus on content addressing results and a very different set of capabilities aligned with the needs of large scale data science make it very different.