Pipelines as jobs » History » Revision 6
Revision 5 (Peter Amstutz, 09/18/2014 10:05 AM) → Revision 6/18 (Peter Amstutz, 09/18/2014 03:23 PM)
h1. Everything is a job h2. Problem Currently we have tasks, jobs, and pipelines. While this corresponds to a common pattern for building bioinfomatics analysis, in practice we are finding that this design is overly rigid with several unintended consequences: # arv-run-pipeline-instance is currently a special privileged priviledged pipeline runner. However, there are potentially many other pipeline frameworks runners we would like to support, such as bcbio-nextgen, rmake, snakemake, Nextflow, etc. that should be usable by regular users and so can't be privileged processes. # Need The current solution is to work with batches and pipelines of pipelines. If we have a pipeline that processes a single sample, and we want to run it 100 times, currently we need to create 100 pipelines by hand, by these as a script that runs outside of the system, or using a separate normal job. # Currently, we can create jobs which That job either 1. a) submits stages as subtasks or 2. b) submits stages as additional jobs. ## In the first approach, The problem with (a) is that job reuse features are not available with tasks, available, and all subtasks must be able to run out of the same docker image. ## In The problem with (b) is that the second approach, the controller job currently ties up a whole node, even though it is mostly generally idle, and we currently do not track the process tree (which job submissions were made by which other jobs.) h2. Proposed solution # Improve job scheduling so that we can have more than one job on a node, with jobs can be allocated to a single core (possibly even fractions of a core). # Remove arv-run-pipeline-instance from its privileged priviledged position and run it as a job in a container just like everything else. # Deprecate tasks, prefer to submit Fix crunch-dispatch so the pipeline runner job only takes up a single slot and other jobs instead (enables task reuse) # or tasks can be scheduled on the node. Use the API token associated with the job to track which job submissions were made by the controlling job (add a spawned_by_job_uuid field to the jobs object). job. Unify the display of jobs and pipelines so that a pipeline is just a job that creates other jobs. Another benefit: supports the proposed v2 Python SDK by enabling users to orchestrate pipelines where "python program.py" is the same whether it runs locally, runs locally and submits jobs, or runs as a crunch job itself and submits jobs. h2. Related ideas Currently, porting tools like bcbio or rmake still requires the tool be modified so that it schedules jobs on the cluster instead of running runnig locally. We could use LD_PRELOAD to intercept a whitelist of exec() calls and redirect them to a script that causes the tool to run on the cluster.