[Crunch] Support Brad Chapman to port bcbio tools and workflows to CWL
#4 Updated by Peter Amstutz about 5 years ago
Core of StandardPipeline with boilerplate stripped out:
samples = run_parallel("organize_samples", [[dirs, config, run_info_yaml, [x["description"] for x in samples]]]) samples = run_parallel("process_alignment", samples) samples = run_parallel("prep_samples", [samples]) samples = run_parallel("postprocess_alignment", samples) samples = run_parallel("combine_sample_regions", [samples]) samples = qcsummary.generate_parallel(samples, run_parallel)
#5 Updated by Peter Amstutz about 5 years ago
Bcbio "pipelines" are just Python methods called
run() that call each step in sequence, with the work of each step recorded in a structure called "samples" that is transformed by each step. Distinct steps of bcbio can be invoked on the command line individually. They can be run using
bcbio_nextget.py runfn step_name config_file (where "config_file" is a subset of the "samples" structure describing one work task.)
There are two possible approaches:
1) There are hooks woven into bcbio for starting parallel jobs (this is how bcbio uses ipython, for example). I started to implement Arvados support in July 2014, but that project was put on the back burner. This is a somewhat substantial amount of work since not all the assumptions that bcbio makes lines up with assumptions that crunch makes.
2) From conversations with Brad, he would prefer Common Workflow Language to wrap each step as a "tool" and replace the existing python-based pipelines with CWL description of the pipeline. This is more interesting to him that just adding Arvados support, since it potentially means bcbio could run on more platforms.
I think this is also interesting to us, since it represents a significant use case to motivate CWL development.
- Write tool description wrappers for several bcbio tools (enough to be able to port one of the small pipelines such as StandardPipeline)
- Need to add missing features to CWL Tool description language necessary to support bcbio. In particular, need to add support for configuration file templates.
- Write the actual CWL pipeline and validate with reference implementation (#4783)
- Draft CWL reference implementation doesn't exist yet (although prototypes are floating around), we need to push that forward
At this point it should be possible to turn over the remaining work of porting bcbio to CWL to the bcbio community, but there's more work to be done to actually implement CWL in Arvados:
- Implement Arvados CWL pipeline runner and make sure bcbio works on it. (#4035) (#4685)
- Will run as single Arvados job which creates a bunch of tasks, will not be able to get fine-grained task reuse or provenance until Crunch v2.
- Workbench features for running CWL pipelines, generating user interface for CWL pipelines. (#4605)