Reusable tasks » History » Version 4
Tom Clegg, 12/10/2014 07:14 AM
1 | 1 | Tom Clegg | {{>toc}} |
---|---|---|---|
2 | |||
3 | h1=. Reusable tasks |
||
4 | |||
5 | p>. *"Tom Clegg":mailto:tom@curoverse.com |
||
6 | Last Updated: October 6, 2014* |
||
7 | |||
8 | h2. Overview |
||
9 | |||
10 | h3. Objective |
||
11 | |||
12 | Say jobs A and B, although not identical, have some tasks in common. Job A is complete. Job B starting now. They use the same script, version, docker image, etc. The only difference between A and B is that B's input collection has one more file; the rest of the files are identical. The script processes each input file independently, and it is a pure function (re-computing the same files will produce the same result). This means most of Job B's work has already been done. Task re-use will allow Arvados to recognize this condition and re-use the outputs of Job A's tasks instead of recomputing them. |
||
13 | |||
14 | Task re-use will not attempt to detect equivalence conditions like differently-encoded collection manifests with identical data, differing git commits with identical trees, and differing docker images with functionally equivalent content. |
||
15 | |||
16 | The intended audience for this document is software engineers. |
||
17 | |||
18 | h3. Background |
||
19 | |||
20 | 2 | Tom Clegg | The arvados.v1.jobs.create API offers a find_or_create feature which searches for an existing job which meets criteria specified by the client (e.g., same script, compatible script_version) and additional criteria (e.g., did not fail, is not marked impure/nondeterministic, does not diagree with other jobs passing the same criteria about what the correct output is). |
21 | 1 | Tom Clegg | |
22 | 2 | Tom Clegg | * http://doc.arvados.org/api/methods/jobs.html#create |
23 | |||
24 | 1 | Tom Clegg | h3. Alternatives |
25 | |||
26 | 2 | Tom Clegg | Always recompute each task (i.e., leave existing behavior). |
27 | 1 | Tom Clegg | |
28 | 2 | Tom Clegg | bq. This makes desirable use cases prohibitively expensive. |
29 | 1 | Tom Clegg | |
30 | 2 | Tom Clegg | Use smaller jobs, and more jobs per pipeline. |
31 | 1 | Tom Clegg | |
32 | 2 | Tom Clegg | bq. We could make the dynamic-structure capabilities of crunch jobs available at the pipeline level, and de-emphasize or stop using the features that encourage long-running jobs. Disadvantages include: |
33 | * The process of running a pipeline is not done in a controlled environment. This effectively reduces the utility of reproducibility and provenance features. |
||
34 | * Pipelines are currently encoded as JSON which is awkward to use as a DSL. |
||
35 | 1 | Tom Clegg | |
36 | 2 | Tom Clegg | h3. Tradeoffs |
37 | 1 | Tom Clegg | |
38 | 2 | Tom Clegg | _TODO_ |
39 | 1 | Tom Clegg | |
40 | 2 | Tom Clegg | h3. High Level Design |
41 | 1 | Tom Clegg | |
42 | 3 | Tom Clegg | Before executing a job_task that qualifies for re-use, crunch-job uses the API to discover existing job_tasks that are functionally identical, are marked as "pure", and have already finished. If any are found, crunch-job copies the existing job_tasks' output into the new job_task instead of executing the task. |
43 | 2 | Tom Clegg | |
44 | 1 | Tom Clegg | h2. Specifics |
45 | |||
46 | h3. Detailed Design |
||
47 | |||
48 | 2 | Tom Clegg | The JobTask schema has a new boolean flag @is_pure@ (not null, default @false@). |
49 | 1 | Tom Clegg | |
50 | 2 | Tom Clegg | Just before starting a task having @is_pure==true@, crunch-job does an API query look up other tasks with @is_pure=true@ and identical inputs, parameters, script_version, etc. |
51 | * Some attributes like script and script_version are currently stored in the job record, not the job_task record. This will make the lookup interesting, in the absence of a generic "join" API. |
||
52 | 4 | Tom Clegg | * This should be done just before executing the task, rather than upon noticing the task has been queued. This increases the chance of finding duplicates when jobs overlap. (Otherwise, two identical jobs that run at nearly the same time will both find no reusable tasks, both queue the same set of tasks, and both execute all of them.) |
53 | 1 | Tom Clegg | |
54 | 2 | Tom Clegg | Job tasks have one especially noteworthy side effect: queueing additional tasks. In order to reuse tasks safely without races, we need additional restraints: |
55 | * Tasks with @is_pure==true@ cannot queue additional tasks, *and* @is_pure@ cannot change from @false@ to @true@. |
||
56 | 3 | Tom Clegg | * Tasks do not qualify for reuse until they have completed[1]. When reusing a task, copy (and reset to "todo" state) each task whose @created_by_job_task_uuid@ attribute references the task being reused. |
57 | 1 | Tom Clegg | |
58 | 2 | Tom Clegg | fn1. At least in the short term, this constraint is a good way to limit the complexity of implementation without sacrificing too much of the user benefit. |
59 | 1 | Tom Clegg | |
60 | h3. Code Location |
||
61 | |||
62 | 2 | Tom Clegg | @sdk/cli/bin/crunch-job@ will have new task reuse logic. |
63 | 1 | Tom Clegg | |
64 | 2 | Tom Clegg | @services/api/db/migrate@ will have a new migration, which will be reflected in @services/api/db/structure.sql@. |
65 | 4 | Tom Clegg | * add :is_pure boolean |
66 | 1 | Tom Clegg | |
67 | 4 | Tom Clegg | @services/api/app/models/job_task.rb@ will |
68 | * add :is_pure to the API response |
||
69 | * prohibit any transaction that changes @is_pure@ from @false@ to @true@. (IOW, @is_pure@ can be set to @true@ only at creation time.) |
||
70 | 2 | Tom Clegg | |
71 | 3 | Tom Clegg | @doc/api/schema/JobTask.html.textile.liquid@ will document the @is_pure@ flag. |
72 | 2 | Tom Clegg | |
73 | 1 | Tom Clegg | h3. Testing Plan |
74 | |||
75 | 2 | Tom Clegg | _TODO_ |
76 | 1 | Tom Clegg | |
77 | h3. Logging |
||
78 | |||
79 | 2 | Tom Clegg | @crunch-job@ will log the fact that it has copied its output attribute (and, if applicable, queued additional tasks) from an existing completed task. |
80 | 1 | Tom Clegg | |
81 | h3. Debugging |
||
82 | |||
83 | 2 | Tom Clegg | _TODO_ |
84 | 1 | Tom Clegg | |
85 | h3. Caveats |
||
86 | |||
87 | To be determined. |
||
88 | |||
89 | h3. Security Concerns |
||
90 | |||
91 | 3 | Tom Clegg | The existing permission model can prevent user A's job from reusing completed tasks merely because they were initiated by a different user. In such cases (where user A has no other way of knowing about user B's job or task), this is preferable to exposing to user A the fact that any other user has run the task. |
92 | 1 | Tom Clegg | |
93 | h3. Open Questions and Risks |
||
94 | |||
95 | 4 | Tom Clegg | In the absence of a generic join API, it might be easy enough to implement a subset of full (ha, ha) join functionality to enable queries like @filters=[["job.script_version","=","abc123..."]]@ under some simplifying conditions: |
96 | * the current model (job_tasks) has a belongs_to relation called "job". (This could extend easily to a few other relations.) |
||
97 | * anyone with permission to read a job_task also has permission to read the corresponding job. (This seems correct for job_task→job, but not for most other relations.) |
||
98 | |||
99 | 3 | Tom Clegg | Should purity be enforced or monitored? |
100 | 1 | Tom Clegg | * Each task could be given a token with scopes restricting it to reading the collection hashes in its @parameters@ hash and its own JobTask and Job resources. |
101 | 4 | Tom Clegg | * API server could notice when a task with @is_pure=true@ retrieves a collection record keyed by UUID, or any other resource that isn't content-addressed, and turn off @is_pure@ automatically. This would be less disruptive, but can waste resources by going unnoticed. |
102 | 1 | Tom Clegg | |
103 | 3 | Tom Clegg | Will there be a special-purpose API for looking up a reusable task, or a generic join-and-filter API? If neither, crunch-job will have to fetch multiple pages of job_tasks and jobs in order to reject ones with mismatched script, script_version, docker image, etc. |
104 | |||
105 | Do we indicate in the job_task record that the output was copied from an existing task? If so, how? (Note that a reference to the existing job_task can become stale due to permission changes.) |
||
106 | |||
107 | What are the appropriate values for a job_task's start/finish timestamp attributes, if the task's outputs were copied from existing tasks? |
||
108 | |||
109 | 1 | Tom Clegg | h3. Work Estimates |
110 | |||
111 | _TODO_ |
||
112 | |||
113 | h3. Future Work |
||
114 | |||
115 | 3 | Tom Clegg | The database tables could be refactored into @jobs@, @job_tasks@, and @tasks@ where @job_tasks@ establishes a many-to-many relationship. |
116 | 1 | Tom Clegg | |
117 | 3 | Tom Clegg | |Table|Significance of a row| |
118 | |jobs|A user initiated some work (requested an output) using Crunch.| |
||
119 | |job_tasks|A job must run a task in order to generate part of its output.| |
||
120 | |tasks|A unit of work was (or will be, or is being) performed as part of a job.| |
||
121 | |||
122 | This way, jobs could reference existing tasks directly rather than copying data between rows in @job_tasks@. Jobs could share tasks even before the tasks have completed. |
||
123 | |||
124 | A facility (and incentive) could be provided to denote tasks as reusable even by users to whom they are otherwise invisible: "If you can guess exactly what I did, and you have permission to read the inputs, I'll admit I did that work and I'll show you the output." |
||
125 | |||
126 | 1 | Tom Clegg | h3. Revision History |
127 | |||
128 | |_.Date |_.Revisions Made |_.Author |_.Reviewed By | |
||
129 | | October 6, 2014 | Initial Draft | Tom Clegg |=. ---- | |
||
130 | | October 15, 2014 | (cont'd) | Tom Clegg |=. ---- | |
||
131 | 4 | Tom Clegg | | December 10, 2014 | (cont'd) | Tom Clegg |=. ---- | |