Project

General

Profile

Pathomap blog » History » Version 6

Bryan Cosca, 02/18/2015 10:49 PM

1 5 Bryan Cosca
h1. Porting Pathomap / Ancestry Mapper to Arvados
2 1 Bryan Cosca
3 3 Bryan Cosca
I recently read an article on the difference between scientific coding and software engineering. The main takeaway I got was: "Before software can be reusable, it has first to be usable." - Ralph Johnson. 
4 1 Bryan Cosca
5 3 Bryan Cosca
As a recent graduate in the bioinformatics world, I quickly learned that most software created for research is very hard to reproduce. You need to have the right dependencies, settings, environments, scripts, paths to files, and so on. It is very hard to get all of that right on your first try. You do some research and keep running your pipeline and hope that when you hit run, you finally see a completed sign, and see some new results. To scientists, its not about creating reproducible code, it's about getting results. To software engineers, it's about making your code perfectly readable and modular. Scientists go for speed, whereas software engineers go for reusability.
6 1 Bryan Cosca
7 3 Bryan Cosca
Arvados is an easy way to show others your entire project from start to finish. Its trivial to reproduce other's findings and inspect for yourself how everything was created. But before you get there, you first need to have your code hosted on the platform. Other platforms for genomics make it pretty difficult to get from your script to production, but using arvados, its pretty simple. I can see scientists wanting to port their pipelines to arvados for its ease of transformation. There's not much to change within your scripts, no matter how unique it is. 
8 1 Bryan Cosca
9 6 Bryan Cosca
I learned a great deal of the bioinformatic world from porting the "Pathomap":http://www.pathomap.org/ Project to Arvados.  For example, I've seen this many times where bioinformaticians hard-code the path to files using their own directories. Writing "vcf <- /home/name/folder/path/to/file/foo.vcf" is common practice for pointing to files. Although not modular, its not a problem within Arvados. You can just change that one line to accept outside parameters. Modularity of code is important to the core of pipeline porting. It is important to be able to accept different parameters all the time. For example, bioinformatic software is constantly getting updated. The version of BWA you used two weeks ago could be outdated tomorrow. You wouldn't want to search through all your scripts and change that parameter, it would take hours; and even if you did, how do you know for sure that you changed everything? 
10 1 Bryan Cosca
11 6 Bryan Cosca
Onto porting pipelines to Arvados. The first thing I use to port pipelines as quickly as possible is our generic "command line wrapper" because it takes away the need to build infrastructure and boiling plate code to use our system. I can use this script and insert parameters from the command line in order to make user scripts modular. I simply change the pathing of the inputs of the script to accept command line inputs and go through testing to make sure it works. I put all the scripts I need in my git repository in order to make sure I can get the exact change I committed and know exactly what is going into my pipeline. I also write my dockerfile from scratch in order to make sure I know exactly what is going in my environment as well.
12 1 Bryan Cosca
13 6 Bryan Cosca
Once the pipeline is ported, the first thing I want to know is if the pipeline is working as intended. I go through intensive testing by making sure that the md5sums and contents of the files match what happens locally as well as in publication. Once everything works as intended, I make sure the versions of my git repo and docker image are set. This ensures that changing the docker image slightly or committing to my repo will not change the pipeline.
14
15 3 Bryan Cosca
The easiest method for porting pipelines:
16
17 6 Bryan Cosca
# Insert script to Arvados git repository
18 3 Bryan Cosca
# Create JSON template
19
# Create/reuse docker image/file for all the dependencies that the script needs
20 2 Bryan Cosca
# Test to make sure the results match