Project

General

Profile

Actions

Tasks as jobs

In #3807 we want to be able to reuse tasks. However, if we take the approach of adding reuse logic to the existing job_tasks table, this has some of drawbacks:
  • Redundant reuse logic between jobs and tasks. High potential for inconsistent behavior between them.
  • Tasks will need either:
    1. Add new columns which duplicate many columns from the jobs table (e.g. script, script_version, docker_image, etc)
    2. Require task reuse to compare the fields of the parent jobs, likely reducing the utility of task reuse (because differences in job fields such as script_parameters will prevent jobs with tasks in common from sharing tasks).

Proposed concept: task-like jobs

  • Drop job_tasks table (migrate to existing tasks to jobs if that makes sense). Provide backwards-compatible job_tasks endpoint to support existing crunch scripts; this creates "task-like" jobs
  • Add a "parent_job_uuid" column to jobs.
    • If this column is null, it acts like a normal job now and is run by crunch-dispatch.
    • If this column is not null, it is managed by crunch-job. Instead of crunch-job running tasks, it runs task-like jobs with parent_job_uuid of the primary job.
  • Job reuse logic remains mostly the same (but see below)

Outstanding questions

Tasks often access the parent job's script_parameters. This complicates reusing tasks (whether as job_tasks or task-like jobs) because the task may depend on job parameters such as reference genome or GATK version. One possible solution is to capture the parent job's script_parameters in the task's script_parameters (for backwards compatibility, when using the job_tasks endpoint) while requiring going forward that parameters like reference genome need be passed through.

(This doesn't stop the job from accessing parent_job_uuid and reading script parameters, however this brings up a broader issue of "pure" vs "impure" jobs → jobs that access the database should be marked "impure" except for a whitelist of "pure" operations such as accessing collections by content hash).

Updated by Peter Amstutz about 10 years ago · 1 revisions