Provenance and Reproducibility

Arvados aims to ensure it is always trivial to repeat computations, reproduce results, and determine the provenance of data (i.e., how it was produced and where the raw input data came from). Several major design decisions, such as the use of content addressable storage, specifically support this goal.

The Value of Provenance and Reproducibility

Provenance and reproducibility have a wide variety of benefits for both system administrators and informaticians.

For system administrators, these capabilities make it easier and more cost effective to maintain systems:

  • Data Management — When you know the provenance of a file and whether or not you can easily regenerate it, you can automate the decision about whether or not to retain the file or delete it. Deleting or reducing replication of intermediate results, which can be easily reproduced and are less likely to be needed at all, can save major costs for IT leaders.
  • Computation Optimization — Arvados recognizes which jobs in a pipeline have already been run successfully. By default, when similar or identical pipelines are run, Arvados uses existing results where possible instead of running duplicate jobs. This optimizes the use of compute resources and simplifies application design.
  • Fault Tolerance — The same features that support provenance and reproducibility also create fault tolerance. Arvados can resume or repeat any job in order to recover from the hardware failures which are inevitable in large scale computational analysis.

For informaticians, provenance and reproducibility offer a variety of benefits:

  • Easily Ascertain the Origin of Data — Informaticians frequently need to determine where data came from. When researchers change organizations, share data, or work on large projects, it's easy to lose track of the source of the data. In Arvados, any file created in the system can be automatically traced back to its original source with a single command.
  • Compare Pipelines — It's common to run pipelines repeatedly both during development and as part of an experimental method. Because every run of a pipeline is automatically recorded by Arvados, it's straightforward to compare two pipelines to see how they are different and where in a sequence of pipelines the output began to change. Similarly, after using a pipeline to analyze many data sets over time, Arvados makes it easy to verify that a large aggregate result set was generated with consistent settings, software versions, and other parameters.
  • Speed up Pipeline Iteration — Because Arvados can check if an identical job in a pipeline has already been run and can therefore skip that job, re-running pipelines where only a single job has changed can happen much more quickly. (Without automated features to support this optimization, informaticians are tempted to make such decisions manually, a practice which is error-prone and therefore often wastes more time than it saves.)
  • Maintain Permanent Copies of Work — When an informatician publishes his or her work, it is critical to maintain a permanent record of the methods used. Arvados makes it easy to do this correctly. All of the relevant data is stored in Keep in a way that can be verified exactly, to the bit. The pipelines and jobs that produced the analysis results are recorded in the database automatically. The code that was written is permanently in the git repository and can be verified with cryptographic hashes. A copy of an entire virtual machine can also be made. This means published research, even complex analysis of very large datasets, can be easily reproduced and independently verified.
  • Continuous Background Validation — Arvados can be configured to use idle compute resources to continuously validate and check data integrity and pipeline integrity.

Provenance and reproducibility are hard problems to solve. Historically, these responsibilities have been left with the informaticians. In the absence of suitable tools, the tendency is to consume a huge excess of storage space. Arvados is designed to make these operations happen automatically, allowing informaticians to work more efficiently and produce results that are more accurate and more useful.

How Provenance and Reproducibility Work

Several key design features of Arvados work together to provide these provenance and reproducibility benefits.

  • The content addressable storage system, Keep, ensures that program inputs and outputs can always be specified and retrieved in a way that is immune to race conditions, data corruption, and renaming.
  • The content addressable revision control system, git, provides similar features for program source code. Each version of code has a cryptographic hash which can be used to unambiguously specify a complete source code tree.
  • The Job Manager records the hashes of all source code and input data used in a job, as well as the output produced by the job. This simultaneously makes it easy to verify the integrity of the code and data used, and provides enough information to repeat the job in the future.
  • The Metadata Database maintains records of all pipelines and jobs that have been run. This makes it possible to search by either input or output and discover how data was produced and what computations it has been used in as input.