Tasks as jobs » History » Version 1
Peter Amstutz, 10/03/2014 01:59 AM
1 | 1 | Peter Amstutz | h1. Tasks as jobs |
---|---|---|---|
2 | |||
3 | In #3807 we want to be able to reuse tasks. However, if we take the approach of adding reuse logic to the existing job_tasks table, this has some of drawbacks: |
||
4 | * Redundant reuse logic between jobs and tasks. High potential for inconsistent behavior between them. |
||
5 | * Tasks will need either: |
||
6 | *# Add new columns which duplicate many columns from the jobs table (e.g. script, script_version, docker_image, etc) |
||
7 | *# Require task reuse to compare the fields of the parent jobs, likely reducing the utility of task reuse (because differences in job fields such as script_parameters will prevent jobs with tasks in common from sharing tasks). |
||
8 | |||
9 | h2. Proposed concept: task-like jobs |
||
10 | |||
11 | * Drop job_tasks table (migrate to existing tasks to jobs if that makes sense). Provide backwards-compatible job_tasks endpoint to support existing crunch scripts; this creates "task-like" jobs |
||
12 | * Add a "parent_job_uuid" column to jobs. |
||
13 | ** If this column is null, it acts like a normal job now and is run by crunch-dispatch. |
||
14 | ** If this column is not null, it is managed by crunch-job. Instead of crunch-job running tasks, it runs task-like jobs with parent_job_uuid of the primary job. |
||
15 | * Job reuse logic remains mostly the same (but see below) |
||
16 | |||
17 | h2. Outstanding questions |
||
18 | |||
19 | Tasks often access the parent job's script_parameters. This complicates reusing tasks (whether as job_tasks or task-like jobs) because the task may depend on job parameters such as reference genome or GATK version. One possible solution is to capture the parent job's script_parameters in the task's script_parameters (for backwards compatibility, when using the job_tasks endpoint) while requiring going forward that parameters like reference genome need be passed through. |
||
20 | |||
21 | (This doesn't stop the job from accessing parent_job_uuid and reading script parameters, however this brings up a broader issue of "pure" vs "impure" jobs → jobs that access the database should be marked "impure" except for a whitelist of "pure" operations such as accessing collections by content hash). |