Tasks as jobs¶In #3807 we want to be able to reuse tasks. However, if we take the approach of adding reuse logic to the existing job_tasks table, this has some of drawbacks:
- Redundant reuse logic between jobs and tasks. High potential for inconsistent behavior between them.
- Tasks will need either:
- Add new columns which duplicate many columns from the jobs table (e.g. script, script_version, docker_image, etc)
- Require task reuse to compare the fields of the parent jobs, likely reducing the utility of task reuse (because differences in job fields such as script_parameters will prevent jobs with tasks in common from sharing tasks).
Proposed concept: task-like jobs¶
- Drop job_tasks table (migrate to existing tasks to jobs if that makes sense). Provide backwards-compatible job_tasks endpoint to support existing crunch scripts; this creates "task-like" jobs
- Add a "parent_job_uuid" column to jobs.
- If this column is null, it acts like a normal job now and is run by crunch-dispatch.
- If this column is not null, it is managed by crunch-job. Instead of crunch-job running tasks, it runs task-like jobs with parent_job_uuid of the primary job.
- Job reuse logic remains mostly the same (but see below)
Tasks often access the parent job's script_parameters. This complicates reusing tasks (whether as job_tasks or task-like jobs) because the task may depend on job parameters such as reference genome or GATK version. One possible solution is to capture the parent job's script_parameters in the task's script_parameters (for backwards compatibility, when using the job_tasks endpoint) while requiring going forward that parameters like reference genome need be passed through.
(This doesn't stop the job from accessing parent_job_uuid and reading script parameters, however this brings up a broader issue of "pure" vs "impure" jobs → jobs that access the database should be marked "impure" except for a whitelist of "pure" operations such as accessing collections by content hash).