Porting Pathomap / Ancestry Mapper to Arvados

I recently read an article on the difference between scientific coding and software engineering. The main takeaway I got was: "Before software can be reusable, it has first to be usable." - Ralph Johnson.

As a recent graduate in the bioinformatics world, I quickly learned that most software created for research is very hard to reproduce. You need to have the right dependencies, settings, environments, scripts, paths to files, and so on. It is very hard to get all of that right on your first try. You do some research and keep running your pipeline and hope that when you hit run, you finally see a completed sign, and see some new results. To scientists, its not about creating reproducible code, it's about getting results. To software engineers, it's about making your code perfectly readable and modular. Scientists go for speed, whereas software engineers go for reusability.

Arvados is an easy way to show others your entire project from start to finish. Its trivial to reproduce other's findings and inspect for yourself how everything was created. But before you get there, you first need to have your code hosted on the platform. Other platforms for genomics make it pretty difficult to get from your script to production, but using arvados, its pretty simple. I can see scientists wanting to port their pipelines to arvados for its ease of transformation. There's not much to change within your scripts, no matter how unique it is.

I learned a great deal of the bioinformatic world from porting the Pathomap Project to Arvados. For example, I've seen this many times where bioinformaticians hard-code the path to files using their own directories. Writing "vcf <- /home/name/folder/path/to/file/foo.vcf" is common practice for pointing to files. Although not modular, its not a problem within Arvados. You can just change that one line to accept outside parameters. Modularity of code is important to the core of pipeline porting. It is important to be able to accept different parameters all the time. For example, bioinformatic software is constantly getting updated. The version of BWA you used two weeks ago could be outdated tomorrow. You wouldn't want to search through all your scripts and change that parameter, it would take hours; and even if you did, how do you know for sure that you changed everything?

Onto porting pipelines to Arvados. The first thing I use to port pipelines as quickly as possible is our generic "command line wrapper" because it takes away the need to build infrastructure and boiling plate code to use our system. I can use this script and insert parameters from the command line in order to make user scripts modular. I simply change the pathing of the inputs of the script to accept command line inputs and go through testing to make sure it works. I put all the scripts I need in my git repository in order to make sure I can get the exact change I committed and know exactly what is going into my pipeline. I also write my dockerfile from scratch in order to make sure I know exactly what is going in my environment as well.

Once the pipeline is ported, the first thing I want to know is if the pipeline is working as intended. I go through intensive testing by making sure that the md5sums and contents of the files match what happens locally as well as in publication. Once everything works as intended, I make sure the versions of my git repo and docker image are set. This ensures that changing the docker image slightly or committing to my repo will not change the pipeline.

The easiest method for porting pipelines:

  1. Insert script to Arvados git repository
  2. Create JSON template
  3. Create/reuse docker image/file for all the dependencies that the script needs
  4. Test to make sure the results match