Story #4687

[Crunch] Support Brad Chapman to port bcbio tools and workflows to CWL

Added by Peter Amstutz over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/12/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Subtasks

Task #4949: Sketch out how to port bcbio-nextgen StandardPipelineResolvedPeter Amstutz


Related issues

Related to Arvados - Story #4783: [CWL] Implement CWL prototype workflow runnerResolved03/10/2015

Related to Arvados - Story #4035: [Sample pipelines] Proof-of-concept support for common-workflow-language tool description in ArvadosResolved11/13/2014

Related to Arvados - Story #4685: [Crunch] CWL prototype workflow runner in ArvadosResolved04/07/2015

Related to Arvados - Feature #4605: [Workbench] Workbench support for generating input UI from common workflow tool description documents.Resolved03/07/2017

History

#1 Updated by Peter Amstutz over 5 years ago

  • Subject changed from Support Brad Chapman to port bcbio tools and workflows to CWL to [Crunch] Support Brad Chapman to port bcbio tools and workflows to CWL

#2 Updated by Tom Clegg over 5 years ago

  • Target version changed from Arvados Future Sprints to 2015-01-28 Sprint

#3 Updated by Peter Amstutz over 5 years ago

  • Assigned To set to Peter Amstutz

#4 Updated by Peter Amstutz about 5 years ago

Core of StandardPipeline with boilerplate stripped out:

samples = run_parallel("organize_samples", [[dirs, config, run_info_yaml,
                                             [x[0]["description"] for x in samples]]])
samples = run_parallel("process_alignment", samples)
samples = run_parallel("prep_samples", [samples])
samples = run_parallel("postprocess_alignment", samples)
samples = run_parallel("combine_sample_regions", [samples])
samples = qcsummary.generate_parallel(samples, run_parallel)

#5 Updated by Peter Amstutz about 5 years ago

Bcbio "pipelines" are just Python methods called run() that call each step in sequence, with the work of each step recorded in a structure called "samples" that is transformed by each step. Distinct steps of bcbio can be invoked on the command line individually. They can be run using bcbio_nextget.py runfn step_name config_file (where "config_file" is a subset of the "samples" structure describing one work task.)

There are two possible approaches:

1) There are hooks woven into bcbio for starting parallel jobs (this is how bcbio uses ipython, for example). I started to implement Arvados support in July 2014, but that project was put on the back burner. This is a somewhat substantial amount of work since not all the assumptions that bcbio makes lines up with assumptions that crunch makes.

2) From conversations with Brad, he would prefer Common Workflow Language to wrap each step as a "tool" and replace the existing python-based pipelines with CWL description of the pipeline. This is more interesting to him that just adding Arvados support, since it potentially means bcbio could run on more platforms.

I think this is also interesting to us, since it represents a significant use case to motivate CWL development.

Approach:

  1. Write tool description wrappers for several bcbio tools (enough to be able to port one of the small pipelines such as StandardPipeline)
    • Need to add missing features to CWL Tool description language necessary to support bcbio. In particular, need to add support for configuration file templates.
  2. Write the actual CWL pipeline and validate with reference implementation (#4783)
    • Draft CWL reference implementation doesn't exist yet (although prototypes are floating around), we need to push that forward

At this point it should be possible to turn over the remaining work of porting bcbio to CWL to the bcbio community, but there's more work to be done to actually implement CWL in Arvados:

  1. Implement Arvados CWL pipeline runner and make sure bcbio works on it. (#4035) (#4685)
    • Will run as single Arvados job which creates a bunch of tasks, will not be able to get fine-grained task reuse or provenance until Crunch v2.
  2. Workbench features for running CWL pipelines, generating user interface for CWL pipelines. (#4605)

#6 Updated by Peter Amstutz about 5 years ago

  • Status changed from New to In Progress

#7 Updated by Ward Vandewege about 5 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF