Project

General

Profile

Tasks as jobs » History » Version 1

Peter Amstutz, 10/03/2014 01:59 AM

1 1 Peter Amstutz
h1. Tasks as jobs
2
3
In #3807 we want to be able to reuse tasks.  However, if we take the approach of adding reuse logic to the existing job_tasks table, this has some of drawbacks:
4
* Redundant reuse logic between jobs and tasks.  High potential for inconsistent behavior between them.
5
* Tasks will need either:
6
*# Add new columns which duplicate many columns from the jobs table (e.g. script, script_version, docker_image, etc)
7
*# Require task reuse to compare the fields of the parent jobs, likely reducing the utility of task reuse (because differences in job fields such as script_parameters will prevent jobs with tasks in common from sharing tasks).
8
9
h2. Proposed concept: task-like jobs
10
11
* Drop job_tasks table (migrate to existing tasks to jobs if that makes sense).  Provide backwards-compatible job_tasks endpoint to support existing crunch scripts; this creates "task-like" jobs
12
* Add a "parent_job_uuid" column to jobs.
13
** If this column is null, it acts like a normal job now and is run by crunch-dispatch.
14
** If this column is not null, it is managed by crunch-job.  Instead of crunch-job running tasks, it runs task-like jobs with parent_job_uuid of the primary job.
15
* Job reuse logic remains mostly the same (but see below)
16
17
h2. Outstanding questions
18
19
Tasks often access the parent job's script_parameters.  This complicates reusing tasks (whether as job_tasks or task-like jobs) because the task may depend on job parameters such as reference genome or GATK version.  One possible solution is to capture the parent job's script_parameters in the task's script_parameters (for backwards compatibility, when using the job_tasks endpoint) while requiring going forward that parameters like reference genome need be passed through.
20
21
(This doesn't stop the job from accessing parent_job_uuid and reading script parameters, however this brings up a broader issue of "pure" vs "impure" jobs → jobs that access the database should be marked "impure" except for a whitelist of "pure" operations such as accessing collections by content hash).