Provenance and Reproducibility » History » Version 1

Tom Clegg, 04/12/2013 04:21 PM

1 1 Tom Clegg
h1. Provenance and Reproducibility
2 1 Tom Clegg
3 1 Tom Clegg
Arvados aims to ensure it is always trivial to repeat computations, reproduce results, and determine the provenance of data (_i.e._, how it was produced and where the raw input data came from.) Several major design decisions, such as the use of content addressable storage, specifically support this goal.
4 1 Tom Clegg
5 1 Tom Clegg
h2. The Value of Provenance and Reproducibility
6 1 Tom Clegg
7 1 Tom Clegg
Provenance and reproducibility have a wide variety of benefits for both system administrators and informaticians. 
8 1 Tom Clegg
9 1 Tom Clegg
For system administrators, these capabilities make it easier and more cost effective to maintain systems:
10 1 Tom Clegg
11 1 Tom Clegg
* *Data Management* — When you know the provenance of a file and whether or not you can easily regenerate it, you can automate the decision about whether or not to retain the file or delete it. Deleting or reducing replication of intermediate results, which can be easily reproduced and are less likely to be needed at all, can save major costs for IT leaders. 
12 1 Tom Clegg
13 1 Tom Clegg
* *Computation Optimization* — Arvados recognizes which jobs in a pipeline have already been run successfully. By default, when similar or identical pipelines are run, Arvados uses existing results where possible instead of running duplicate jobs. This optimizes the use of compute resources and simplifies application design.
14 1 Tom Clegg
15 1 Tom Clegg
* *Fault Tolerance* — The same features that support provenance and reproducibility also create fault tolerance. Arvados can resume or repeat any job in order to recover from the hardware failures which are inevitable in large scale computational analysis.  
16 1 Tom Clegg
17 1 Tom Clegg
For informaticians, provenance and reproducibility offer a variety of benefits: 
18 1 Tom Clegg
19 1 Tom Clegg
* *Easily Ascertain the Origin of Data* — Informaticians frequently need to determine where data came from. When researchers change organizations, share data, or work on large projects, it's easy to lose track of the source of the data. In Arvados, any file created in the system can be automatically traced back to its original source with a single command. 
20 1 Tom Clegg
21 1 Tom Clegg
* *Compare Pipelines* — It's common to run pipelines repeatedly both during development and as part of an experimental method. Because every run of a pipeline is automatically recorded by Arvados, it's straightforward to compare two pipelines to see how they are different and where in a sequence of pipelines the output began to change. Similarly, after using a pipeline to analyze many data sets over time, Arvados makes it easy to verify that a large aggregate result set was generated with consistent settings, software versions, and other parameters.
22 1 Tom Clegg
23 1 Tom Clegg
* *Speed up Pipeline Iteration* — Because Arvados can check if an identical job in a pipeline has already been run and can therefore skip that job, re-running pipelines where only a single job has changed can happen much more quickly. (Without automated features to support this optimization, informaticians are tempted to make such decisions manually, a practice which is error-prone and therefore often wastes more time than it saves.)
24 1 Tom Clegg
25 1 Tom Clegg
* *Maintain Permanent Copies of Work* — When an informatician publishes his or her work, it is critical to maintain a permanent record of the methods used. Arvados makes it easy to do this correctly. All of the relevant data is stored in Keep in a way that can be verified exactly, to the bit. The pipelines and jobs that produced the analysis results are recorded in the database automatically. The code that was written is permanently in the git repository and can be verified with cryptographic hashes. A copy of an entire virtual machine can also be made. This means published research, even complex analysis of very large datasets, can be easily reproduced and independently verified.
26 1 Tom Clegg
27 1 Tom Clegg
* *Continuous Background Validation* — Arvados can be configured to use idle compute resources to continuously validate and check data integrity and pipeline integrity. 
28 1 Tom Clegg
29 1 Tom Clegg
Provenance and reproducibility are hard problems to solve. Historically, these responsibilities have been left with the informaticians. In the absence of suitable tools, the tendency is to consume a huge excess of storage space. Arvados is designed to make these operations happen automatically, allowing informaticians to work more efficiently and produce results that are more accurate and more useful.
30 1 Tom Clegg
31 1 Tom Clegg
h2. How Provenance and Reproducibility Work
32 1 Tom Clegg
33 1 Tom Clegg
Several key design features of Arvados work together to provide these provenance and reproducibility benefits.
34 1 Tom Clegg
35 1 Tom Clegg
* The content addressable storage system, Keep, ensures that program inputs and outputs can always be specified and retrieved in a way that is immune to race conditions, data corruption, and renaming.
36 1 Tom Clegg
37 1 Tom Clegg
* The content addressable revision control system, git, provides similar features for program source code. Each version of code has a cryptographic hash which can be used to unambiguously specify a complete source code tree.
38 1 Tom Clegg
39 1 Tom Clegg
* The Job Manager records the hashes of all source code and input data used in a job, as well as the output produced by the job. This simultaneously makes it easy to verify the integrity of the code and data used, and provides enough information to repeat the job in the future.
40 1 Tom Clegg
41 1 Tom Clegg
* The Metadata Database maintains records of all pipelines and jobs that have been run. This makes it possible to search by either input or output and discover how data was produced and what computations it has been used in as input.