Project

General

Profile

Pathomap blog » History » Version 4

Bryan Cosca, 02/17/2015 07:28 PM

1 4 Bryan Cosca
h1. Porting Pathomap / Ancestry Mapper to Arvados.
2 1 Bryan Cosca
3 3 Bryan Cosca
I recently read an article on the difference between scientific coding and software engineering. The main takeaway I got was: "Before software can be reusable, it has first to be usable." - Ralph Johnson. 
4 1 Bryan Cosca
5 3 Bryan Cosca
As a recent graduate in the bioinformatics world, I quickly learned that most software created for research is very hard to reproduce. You need to have the right dependencies, settings, environments, scripts, paths to files, and so on. It is very hard to get all of that right on your first try. You do some research and keep running your pipeline and hope that when you hit run, you finally see a completed sign, and see some new results. To scientists, its not about creating reproducible code, it's about getting results. To software engineers, it's about making your code perfectly readable and modular. Scientists go for speed, whereas software engineers go for reusability.
6 1 Bryan Cosca
7 3 Bryan Cosca
Arvados is an easy way to show others your entire project from start to finish. Its trivial to reproduce other's findings and inspect for yourself how everything was created. But before you get there, you first need to have your code hosted on the platform. Other platforms for genomics make it pretty difficult to get from your script to production, but using arvados, its pretty simple. I can see scientists wanting to port their pipelines to arvados for its ease of transformation. There's not much to change within your scripts, no matter how unique it is. 
8 1 Bryan Cosca
9 3 Bryan Cosca
I learned a great deal of the bioinformatic world from porting the "Pathomap":http://www.pathomap.org/ Project to Arvados.  For example, I've seen this many times where bioinformaticians hard-code the path to files using their own directories. Writing "vcf <- /home/name/folder/path/to/file/foo.vcf" is common practice for pointing to files. Although not modular, its not a problem within Arvados. You can just change that one line to accept outside parameters. Modularity of code is important to the core of pipeline porting. You need to be able to accept different parameters all the time. For example, bioinformatic software is constantly getting updated. The version of BWA you used two weeks ago could be outdated tomorrow. You wouldn't want to search through all your scripts and change that parameter, it would take hours; and even if you did, how do you know for sure that you changed everything? 
10 1 Bryan Cosca
11 3 Bryan Cosca
Onto porting pipelines to Arvados. I currently use a generic "command line wrapper" for porting pipelines because it takes away the need to build infrastructure and boiling plate code to use our system. I can use this script and insert parameters from the command line in order to make user scripts modular. I simply change the pathing of the inputs of the script to accept command line inputs and hope it works. I put all the scripts I need in my git repository in order to make sure I can get the exact change I committed and know exactly what is going into my pipeline. I also write my dockerfile from scratch in order to make sure I know exactly what is going in my environment as well.
12
13
The easiest method for porting pipelines:
14
15
# Insert script to git repository
16
# Create JSON template
17
# Create/reuse docker image/file for all the dependencies that the script needs
18 2 Bryan Cosca
# Test to make sure the results match