Project

General

Profile

Actions

Pathomap blog » History » Revision 2

« Previous | Revision 2/6 (diff) | Next »
Bryan Cosca, 02/17/2015 12:29 AM


Porting Pathomap to Arvados.

I recently read an article on the differentiation between scientific coding and software engineering and one quote really stood out. "Before software can be reusable, it has first to be usable." - Ralph Johnson. As a recent graduate in the bioinformatics world, I quickly learned that most software created for research is very hard to reproduce. You need to have the right dependencies, settings, environment, scripts, etc. Getting all of that right and hoping, just hoping that you can hit run, see a completed sign, and finally see some new results.

Arvados is an easy way to show others your entire project from start to finish. Its trivial to reproduce other's findings and inspect for yourself how everything was created. But before you get there, you first need to have your code hosted on the platform. Other PaaS for genomics make it pretty difficult to get from your script to their platform, but using arvados, its pretty simple.

Porting the Pathomap Project to Arvados required some script tweaking. For example, I've seen this many times where bioinformaticians like to hard-code the path to files using their own directories. Writing "vcf <- /home/name/folder/path/to/file/foo.vcf" is not helpful for those trying to recreate what you've found. I think the bioinformatics world needs to learn how to write modular code. That way, other people can use it on their own system as well as on platforms for actually re-running data. Once the code is modular however, its very easy to get into arvados! I currently use a generic "command line wrapper" for porting pipelines because it takes away the need to build infrastructure and boiling plate code to use our system. I can use this script and simply use generic script calling to get other scripts to work. However, this method does have flaws. You lose a ton of flexibility in how you run your scripts. You cannot choose whats going into your output collections, or any sort of boolean logic. Its simply a script that takes your inputs and pipes them to your script, which then pipes to your output.

My methods for porting pipelines in the most easiest sense is:

  1. Insert script to git repo
  2. Create json template for job
  3. Create/reuse docker image for all the dependencies that the script needs
  4. Test to make sure the results match
  5. Repeat.

Updated by Bryan Cosca over 9 years ago · 2 revisions