Project

General

Profile

Pathomap blog » History » Revision 3

Revision 2 (Bryan Cosca, 02/17/2015 12:29 AM) → Revision 3/6 (Bryan Cosca, 02/17/2015 03:22 AM)

h1. Porting Pathomap to Arvados. 

 I recently read an article on the difference differentiation between scientific coding and software engineering. The main takeaway I got was: engineering and one quote really stood out. "Before software can be reusable, it has first to be usable." - Ralph Johnson.  

 As a recent graduate in the bioinformatics world, I quickly learned that most software created for research is very hard to reproduce. You need to have the right dependencies, settings, environments, environment, scripts, paths to files, and so on. It is very hard to get etc. Getting all of that right on your first try. You do some research and keep running your pipeline and hope hoping, just hoping that when you can hit run, you finally see a completed sign, and finally see some new results. To scientists, its not about creating reproducible code, it's about getting results. To software engineers, it's about making your code perfectly readable and modular. Scientists go for speed, whereas software engineers go for reusability. 

 Arvados is an easy way to show others your entire project from start to finish. Its trivial to reproduce other's findings and inspect for yourself how everything was created. But before you get there, you first need to have your code hosted on the platform. Other platforms PaaS for genomics make it pretty difficult to get from your script to production, their platform, but using arvados, its pretty simple. I can see scientists wanting to port their pipelines to arvados for its ease of transformation. There's not much to change within your scripts, no matter how unique it is.  

 I learned a great deal of Porting the bioinformatic world from porting the "Pathomap":http://www.pathomap.org/ Project to Arvados.    Arvados required some script tweaking. For example, I've seen this many times where bioinformaticians like to hard-code the path to files using their own directories. Writing "vcf <- /home/name/folder/path/to/file/foo.vcf" is common practice not helpful for pointing those trying to files. Although not modular, its not a problem within Arvados. You can just change that one line to accept outside parameters. Modularity of code is important to recreate what you've found. I think the core of pipeline porting. You need bioinformatics world needs to be able learn how to accept different parameters all write modular code. That way, other people can use it on their own system as well as on platforms for actually re-running data. Once the time. For example, bioinformatic software code is constantly getting updated. The version of BWA you used two weeks ago could be outdated tomorrow. You wouldn't want modular however, its very easy to search through all your scripts and change that parameter, it would take hours; and even if you did, how do you know for sure that you changed everything?  

 Onto porting pipelines to Arvados. get into arvados! I currently use a generic "command line wrapper" for porting pipelines because it takes away the need to build infrastructure and boiling plate code to use our system. I can use this script and insert parameters from the command line in order simply use generic script calling to make user get other scripts modular. I simply change the pathing to work. However, this method does have flaws. You lose a ton of the inputs flexibility in how you run your scripts. You cannot choose whats going into your output collections, or any sort of the boolean logic. Its simply a script to accept command line that takes your inputs and hope it works. I put all the scripts I need in my git repository in order pipes them to make sure I can get the exact change I committed and know exactly what is going into my pipeline. I also write my dockerfile from scratch in order your script, which then pipes to make sure I know exactly what is going in my environment as well. your output. 

 The easiest method My methods for porting pipelines: pipelines in the most easiest sense is: 

 # Insert script to git repository repo 
 # Create JSON json template for job 
 # Create/reuse docker image/file image for all the dependencies that the script needs 
 # Test to make sure the results match 
 # Repeat.