Pipelines as jobs » History » Version 8
Peter Amstutz, 09/18/2014 03:56 PM
1 | 5 | Peter Amstutz | h1. Everything is a job |
---|---|---|---|
2 | 1 | Peter Amstutz | |
3 | h2. Problem |
||
4 | |||
5 | 6 | Peter Amstutz | Currently we have tasks, jobs, and pipelines. While this corresponds to a common pattern for building bioinfomatics analysis, in practice we are finding that this design is overly rigid with several unintended consequences: |
6 | 1 | Peter Amstutz | |
7 | 6 | Peter Amstutz | # arv-run-pipeline-instance is currently a special privileged pipeline runner. However, there are potentially many other pipeline frameworks we would like to support, such as bcbio-nextgen, rmake, snakemake, Nextflow, etc. that should be usable by regular users and so can't be privileged processes. |
8 | # Need to work with batches and pipelines of pipelines. If we have a pipeline that processes a single sample, and we want to run it 100 times, currently we need to create 100 pipelines by hand, by a script that runs outside of the system, or using a separate job. |
||
9 | # Currently, we can create jobs which either 1. submits stages as subtasks or 2. submits stages as additional jobs. |
||
10 | ## In the first approach, job reuse features are not available with tasks, and all subtasks must be able to run out of the same docker image. |
||
11 | ## In the second approach, the controller job currently ties up a whole node, even though it is mostly idle, and we currently do not track the process tree (which job submissions were made by which other jobs.) |
||
12 | 1 | Peter Amstutz | |
13 | h2. Proposed solution |
||
14 | |||
15 | 6 | Peter Amstutz | # Improve job scheduling so that we can have more than one job on a node, with jobs can be allocated to a single core (possibly even fractions of a core). |
16 | # Remove arv-run-pipeline-instance from its privileged position and run it as a job in a container just like everything else. |
||
17 | 7 | Peter Amstutz | # Deprecate tasks, prefer to submit jobs instead (enables work reuse) |
18 | 8 | Peter Amstutz | # Use the API token associated with the job to track which job submissions were made by the controlling job (add a spawned_by_job_uuid field to the jobs object). Top level UI just displays jobs that were submitted by a user (spawned_by_job_uuid is null). Unify the display of pipelines and jobs so that a pipeline is just a job that creates other jobs. |
19 | 4 | Peter Amstutz | |
20 | 1 | Peter Amstutz | Another benefit: supports the proposed v2 Python SDK by enabling users to orchestrate pipelines where "python program.py" is the same whether it runs locally, runs locally and submits jobs, or runs as a crunch job itself and submits jobs. |
21 | |||
22 | h2. Related ideas |
||
23 | |||
24 | 6 | Peter Amstutz | Currently, porting tools like bcbio or rmake still requires the tool be modified so that it schedules jobs on the cluster instead of running locally. We could use LD_PRELOAD to intercept a whitelist of exec() calls and redirect them to a script that causes the tool to run on the cluster. |