Keep » History » Version 8

Tom Clegg, 04/12/2013 05:32 PM

1 1 Tom Clegg
h1. Keep
2 1 Tom Clegg
3 2 Tom Clegg
Keep is a distributed content-addressable storage system designed for high performance in I/O-bound cluster environments.
4 2 Tom Clegg
5 2 Tom Clegg
Notable design goals and features include:
6 2 Tom Clegg
7 2 Tom Clegg
* High scalability
8 2 Tom Clegg
* Node-level redundancy
9 2 Tom Clegg
* Maximum overall throughput in a busy cluster environment
10 2 Tom Clegg
* Maximum data bandwidth from client to disk
11 2 Tom Clegg
* Minimum transaction overhead
12 2 Tom Clegg
* Elimination of disk thrashing (commonly caused by multiple simultaneous readers)
13 2 Tom Clegg
* Client-controlled redundancy
14 2 Tom Clegg
15 2 Tom Clegg
h2. Design
16 2 Tom Clegg
17 2 Tom Clegg
The above goals are accomplished by the following design features.
18 2 Tom Clegg
19 7 Anonymous
* Data are transferred directly between the client and the physical node where the disk is connected.
20 2 Tom Clegg
* Data collections are encoded in large (≤64 MiB) blocks to minimize short read/write operations.
21 2 Tom Clegg
* Each disk accepts only one block-read/write operation at a time. This prevents disk thrashing and maximizes total throughput when many clients compete for a disk.
22 2 Tom Clegg
* Storage redundancy is directly controlled, and can be easily verified, by the client simply by reading or writing a block of data on multiple nodes.
23 3 Tom Clegg
* Data block distribution is computed based on a cryptographic digest of the data block being stored or retrieved. This eliminates the need for a central or synchronized database of block storage locations.
24 2 Tom Clegg
25 2 Tom Clegg
h2. Components
26 2 Tom Clegg
27 1 Tom Clegg
The Keep storage system consists of data block read/write services, SDKs, and management agents.
28 1 Tom Clegg
29 1 Tom Clegg
The responsibilities of the Keep service are:
30 1 Tom Clegg
31 1 Tom Clegg
* Write data blocks
32 3 Tom Clegg
* When writing: ensure data integrity by comparing client-supplied cryptographic digest and data
33 1 Tom Clegg
* Read data blocks (subject to permission, which is determined by the system/metadata DB)
34 1 Tom Clegg
* Send read/write/error event logs to management agents
35 1 Tom Clegg
36 1 Tom Clegg
The responsibilities of the SDK are:
37 1 Tom Clegg
38 1 Tom Clegg
* When writing: split data into ≤64 MiB chunks
39 1 Tom Clegg
* When writing: encode directory trees as manifests
40 1 Tom Clegg
* When writing: write data to the desired number of nodes to achieve storage redundancy
41 1 Tom Clegg
* After writing: register a collection with Arvados
42 1 Tom Clegg
* When reading: parse manifests
43 1 Tom Clegg
* When reading: verify data integrity by comparing locator to MD5 digest of retrieved data
44 3 Tom Clegg
45 3 Tom Clegg
The responsibilities of management agents are:
46 3 Tom Clegg
47 3 Tom Clegg
* Verify validity of permission tokens
48 3 Tom Clegg
* Determine which blocks have higher or lower redundancy than required
49 3 Tom Clegg
* Monitor disk space and move or delete blocks as needed
50 3 Tom Clegg
* Collect per-user, per-group, per-node, and per-disk usage statistics
51 4 Anonymous
52 4 Anonymous
h2. Benefits 
53 4 Anonymous
54 4 Anonymous
Keep offers a variety of major benefits over POSIX file systems and other object file storage systems. This is a summary of some of those benefits: 
55 4 Anonymous
56 8 Tom Clegg
* *Elimination of Duplication* — One of the major storage management problems today is the duplication of data. Often researchers will make copies of data for backup or to re-organize files for different projects. Content addressing automatically eliminates unnecessary duplication: if a program saves a file when an identical file has already been stored, Keep simply reports success without having to write a second copy.
57 4 Anonymous
58 8 Tom Clegg
* *Canonical Records* — Content addressing creates clear and verifiable canonical records for files. By combining Keep with the computation system in Arvados, it becomes trivial to verify the exact file that was used for a computation. By using a collection to define an entire data set (which could be 100s of terabytes or petabytes), you maintain a permanent and verifiable record of which data were used for each computation. The file that defines a collection is very small relative to the underlying data, so you can make as many as you need. 
59 4 Anonymous
60 8 Tom Clegg
* *Provenance* — The combination of Keep and the computation system make it possible to maintain clear provenance for all the data in the system. This has a number of benefits including making it easy to ascertain how data were derived at any point in time. 
61 4 Anonymous
62 8 Tom Clegg
* *Easy Management of Temporary Data* — One benefit of systematic provenance tracking is that Arvados can automatically manage temporary and intermediate data. If you know how a data set or file is was created, you can decide whether it is worthwhile to keep a copy on disk. Knowing what pipeline was run on which input data, how long it took, etc., makes it possible to automate such decisions.
63 4 Anonymous
64 8 Tom Clegg
* *Flexible Organization* — In Arvados, files are grouped in collections and can be easily tagged with metadata. Different researcher and research teams can manage independent sets of metadata. This makes it possible for researchers to organize files in a variety of different ways without duplicating or physically moving the data. A collection is represented by a text file, which lists the filenames and data blocks comprising the collection, and is itself stored in Keep. As a result, the same underlying data can be referenced by many different collections, without ever copying or moving the data itself.
65 4 Anonymous
66 8 Tom Clegg
* *High Reliability* — By combining content addressing with an object file store, Keep is fault tolerant across drive and even node failures. The Data Manager monitors the replication level of each data collection. Storage redundancy can thus be adjusted according to the relative importance of individual datasets in addition to default site policy.
67 4 Anonymous
68 8 Tom Clegg
* *Easier Tiering of Storage* — The Data Manager in Arvados manages the distribution of files to storage systems such as a NAS or cloud back up service. The files are all content addressed and tracked in the metadata database: when a pipeline uses data which is not on the cluster, Arvados can automatically move the necessary data onto the cluster before starting the job. This makes tiered storage feasible without imposing an undue burden on end users.
69 4 Anonymous
70 8 Tom Clegg
* *Security and Access Control* — Keep can encrypt files on disk and this storage architecture makes the implementation of very fine grained access control significantly easier than traditional POSIX file systems. 
71 4 Anonymous
72 8 Tom Clegg
* *POSIX Interface* — Keep can be mounted as a POSIX filesystems in a virtual machine in order to access data with tools that expect a POSIX interface. Because collections are so flexible, one can easily create many different virtual directory structures for the same underlying files without copying or even reading the underlying data. Combining the native Arvados tools with UNIX pipes provides better performance, but the POSIX mount option is more convenient in some situations.
73 4 Anonymous
74 8 Tom Clegg
* *Data Sharing* — Keep makes it much easier to share data between clusters in different data centers and organizations. Keep content addresses include information about which cluster data is stored on. With federated clusters, collections of data can reside on multiple clusters, and distribution of computations across clusters can eliminate slow, costly data transfers.