Pipelines as jobs » History » Version 15

Peter Amstutz, 09/22/2014 01:30 PM

1 11 Peter Amstutz
h1. Everything should be a job
2 1 Peter Amstutz
h2. Problem
5 6 Peter Amstutz
Currently we have tasks, jobs, and pipelines.  While this corresponds to a common pattern for building bioinfomatics analysis, in practice we are finding that this design is overly rigid with several unintended consequences:
6 1 Peter Amstutz
7 14 Peter Amstutz
# arv-run-pipeline-instance currently gets special treatment.  Pipelines instances are run by crunch-dispatch system automatically using the resources of the dispatcher node, and creating pipelines from pipeline templates is the only way to run jobs using workbench.  However, there are other third party pipeline frameworks that our users are currently using, such as bcbio-nextgen, rmake, snakemake, Nextflow; rewriting these to use arv-run-pipeline-instance is time consuming (due to the need to analyze the original pipeline) or infeasible (because they make user of features that don't map cleanly onto existing Arvados features).
8 9 Peter Amstutz
# Need to work with batches and pipelines of pipelines.  If we have a pipeline that processes a single sample, and we want to run it 100 times, we need to create 100 pipelines by hand or by a script.
9 1 Peter Amstutz
# Currently, we can create jobs which either 1. submits stages as subtasks or 2. submits stages as additional jobs.
10 9 Peter Amstutz
## In the first approach, job reuse features are not available with tasks, and all subtasks must be able to run out of the same docker image.  There is also (by design) reduced visibility into the inner working of tasks as compared to jobs.
## In the second approach, the controller job currently ties up a whole node, even though it is mostly idle.  Additionally (and unlike tasks and pipelines) we do not track which job submissions were made by which other jobs, so there's a loss of provenance information.
13 1 Peter Amstutz
h2. Proposed solution
15 6 Peter Amstutz
# Improve job scheduling so that we can have more than one job on a node, with jobs can be allocated to a single core (possibly even fractions of a core).
# Remove arv-run-pipeline-instance from its privileged position and run it as a job in a container just like everything else.
17 1 Peter Amstutz
# Deprecate tasks, prefer to submit jobs instead (enables work reuse)
18 10 Peter Amstutz
# Track which job submissions were made by the controlling job (add a spawned_by_job_uuid field to the jobs object), possibly using the API token associated with the job.
# Workbench summary pages such as dashboard just display jobs that were submitted by a user (spawned_by_job_uuid is null).  Unify the display of pipelines and jobs so that a pipeline is just a job that creates other jobs, and permit drilling down through the process tree.
# Improve SDK APIs to make it easy to spawn a a bunch of jobs as "futures" and then wait for them all to finish.
21 13 Peter Amstutz
# Possibly add a field to indicate if a job is "task-like" or "pipeline-like" 
22 4 Peter Amstutz
23 1 Peter Amstutz
Another benefit: supports the proposed v2 Python SDK by enabling users to orchestrate pipelines where "python" is the same whether it runs locally, runs locally and submits jobs, or runs as a crunch job itself and submits jobs.
h2. Related ideas
27 6 Peter Amstutz
Currently, porting tools like bcbio or rmake still requires the tool be modified so that it schedules jobs on the cluster instead of running locally.  We could use LD_PRELOAD to intercept a whitelist of exec() calls and redirect them to a script that causes the tool to run on the cluster.