Reusable tasks » History » Version 3
Tom Clegg, 10/15/2014 03:44 PM
1 | 1 | Tom Clegg | {{>toc}} |
---|---|---|---|
2 | |||
3 | h1=. Reusable tasks |
||
4 | |||
5 | p>. *"Tom Clegg":mailto:tom@curoverse.com |
||
6 | Last Updated: October 6, 2014* |
||
7 | |||
8 | h2. Overview |
||
9 | |||
10 | h3. Objective |
||
11 | |||
12 | Say jobs A and B, although not identical, have some tasks in common. Job A is complete. Job B starting now. They use the same script, version, docker image, etc. The only difference between A and B is that B's input collection has one more file; the rest of the files are identical. The script processes each input file independently, and it is a pure function (re-computing the same files will produce the same result). This means most of Job B's work has already been done. Task re-use will allow Arvados to recognize this condition and re-use the outputs of Job A's tasks instead of recomputing them. |
||
13 | |||
14 | Task re-use will not attempt to detect equivalence conditions like differently-encoded collection manifests with identical data, differing git commits with identical trees, and differing docker images with functionally equivalent content. |
||
15 | |||
16 | The intended audience for this document is software engineers. |
||
17 | |||
18 | h3. Background |
||
19 | |||
20 | 2 | Tom Clegg | The arvados.v1.jobs.create API offers a find_or_create feature which searches for an existing job which meets criteria specified by the client (e.g., same script, compatible script_version) and additional criteria (e.g., did not fail, is not marked impure/nondeterministic, does not diagree with other jobs passing the same criteria about what the correct output is). |
21 | 1 | Tom Clegg | |
22 | 2 | Tom Clegg | * http://doc.arvados.org/api/methods/jobs.html#create |
23 | |||
24 | 1 | Tom Clegg | h3. Alternatives |
25 | |||
26 | 2 | Tom Clegg | Always recompute each task (i.e., leave existing behavior). |
27 | 1 | Tom Clegg | |
28 | 2 | Tom Clegg | bq. This makes desirable use cases prohibitively expensive. |
29 | 1 | Tom Clegg | |
30 | 2 | Tom Clegg | Use smaller jobs, and more jobs per pipeline. |
31 | 1 | Tom Clegg | |
32 | 2 | Tom Clegg | bq. We could make the dynamic-structure capabilities of crunch jobs available at the pipeline level, and de-emphasize or stop using the features that encourage long-running jobs. Disadvantages include: |
33 | * The process of running a pipeline is not done in a controlled environment. This effectively reduces the utility of reproducibility and provenance features. |
||
34 | * Pipelines are currently encoded as JSON which is awkward to use as a DSL. |
||
35 | 1 | Tom Clegg | |
36 | 2 | Tom Clegg | h3. Tradeoffs |
37 | 1 | Tom Clegg | |
38 | 2 | Tom Clegg | _TODO_ |
39 | 1 | Tom Clegg | |
40 | 2 | Tom Clegg | h3. High Level Design |
41 | 1 | Tom Clegg | |
42 | 3 | Tom Clegg | Before executing a job_task that qualifies for re-use, crunch-job uses the API to discover existing job_tasks that are functionally identical, are marked as "pure", and have already finished. If any are found, crunch-job copies the existing job_tasks' output into the new job_task instead of executing the task. |
43 | 2 | Tom Clegg | |
44 | 1 | Tom Clegg | h2. Specifics |
45 | |||
46 | h3. Detailed Design |
||
47 | |||
48 | 2 | Tom Clegg | The JobTask schema has a new boolean flag @is_pure@ (not null, default @false@). |
49 | 1 | Tom Clegg | |
50 | 2 | Tom Clegg | Just before starting a task having @is_pure==true@, crunch-job does an API query look up other tasks with @is_pure=true@ and identical inputs, parameters, script_version, etc. |
51 | * Some attributes like script and script_version are currently stored in the job record, not the job_task record. This will make the lookup interesting, in the absence of a generic "join" API. |
||
52 | 1 | Tom Clegg | |
53 | 2 | Tom Clegg | Job tasks have one especially noteworthy side effect: queueing additional tasks. In order to reuse tasks safely without races, we need additional restraints: |
54 | * Tasks with @is_pure==true@ cannot queue additional tasks, *and* @is_pure@ cannot change from @false@ to @true@. |
||
55 | 3 | Tom Clegg | * Tasks do not qualify for reuse until they have completed[1]. When reusing a task, copy (and reset to "todo" state) each task whose @created_by_job_task_uuid@ attribute references the task being reused. |
56 | 1 | Tom Clegg | |
57 | 2 | Tom Clegg | fn1. At least in the short term, this constraint is a good way to limit the complexity of implementation without sacrificing too much of the user benefit. |
58 | 1 | Tom Clegg | |
59 | h3. Code Location |
||
60 | |||
61 | 2 | Tom Clegg | @sdk/cli/bin/crunch-job@ will have new task reuse logic. |
62 | 1 | Tom Clegg | |
63 | 2 | Tom Clegg | @services/api/db/migrate@ will have a new migration, which will be reflected in @services/api/db/structure.sql@. |
64 | |||
65 | 3 | Tom Clegg | @services/api/app/models/job_task.rb@ will add :is_pure to the API response and prohibit any transaction that changes @is_pure@ from @false@ to @true@. IOW, @is_pure@ can be set to @true@ only at creation time. |
66 | 2 | Tom Clegg | |
67 | 3 | Tom Clegg | @doc/api/schema/JobTask.html.textile.liquid@ will document the @is_pure@ flag. |
68 | 2 | Tom Clegg | |
69 | 1 | Tom Clegg | h3. Testing Plan |
70 | |||
71 | 2 | Tom Clegg | _TODO_ |
72 | 1 | Tom Clegg | |
73 | h3. Logging |
||
74 | |||
75 | 2 | Tom Clegg | @crunch-job@ will log the fact that it has copied its output attribute (and, if applicable, queued additional tasks) from an existing completed task. |
76 | 1 | Tom Clegg | |
77 | h3. Debugging |
||
78 | |||
79 | 2 | Tom Clegg | _TODO_ |
80 | 1 | Tom Clegg | |
81 | h3. Caveats |
||
82 | |||
83 | To be determined. |
||
84 | |||
85 | h3. Security Concerns |
||
86 | |||
87 | 3 | Tom Clegg | The existing permission model can prevent user A's job from reusing completed tasks merely because they were initiated by a different user. In such cases (where user A has no other way of knowing about user B's job or task), this is preferable to exposing to user A the fact that any other user has run the task. |
88 | 1 | Tom Clegg | |
89 | h3. Open Questions and Risks |
||
90 | |||
91 | 3 | Tom Clegg | Should purity be enforced or monitored? |
92 | * Each task could be given a token with scopes restricting it to reading the collection hashes in its @parameters@ hash and its own JobTask and Job resources. |
||
93 | 1 | Tom Clegg | |
94 | 3 | Tom Clegg | Will there be a special-purpose API for looking up a reusable task, or a generic join-and-filter API? If neither, crunch-job will have to fetch multiple pages of job_tasks and jobs in order to reject ones with mismatched script, script_version, docker image, etc. |
95 | |||
96 | Do we indicate in the job_task record that the output was copied from an existing task? If so, how? (Note that a reference to the existing job_task can become stale due to permission changes.) |
||
97 | |||
98 | What are the appropriate values for a job_task's start/finish timestamp attributes, if the task's outputs were copied from existing tasks? |
||
99 | |||
100 | 1 | Tom Clegg | h3. Work Estimates |
101 | |||
102 | _TODO_ |
||
103 | |||
104 | h3. Future Work |
||
105 | |||
106 | 3 | Tom Clegg | The database tables could be refactored into @jobs@, @job_tasks@, and @tasks@ where @job_tasks@ establishes a many-to-many relationship. |
107 | 1 | Tom Clegg | |
108 | 3 | Tom Clegg | |Table|Significance of a row| |
109 | |jobs|A user initiated some work (requested an output) using Crunch.| |
||
110 | |job_tasks|A job must run a task in order to generate part of its output.| |
||
111 | |tasks|A unit of work was (or will be, or is being) performed as part of a job.| |
||
112 | |||
113 | This way, jobs could reference existing tasks directly rather than copying data between rows in @job_tasks@. Jobs could share tasks even before the tasks have completed. |
||
114 | |||
115 | A facility (and incentive) could be provided to denote tasks as reusable even by users to whom they are otherwise invisible: "If you can guess exactly what I did, and you have permission to read the inputs, I'll admit I did that work and I'll show you the output." |
||
116 | |||
117 | 1 | Tom Clegg | h3. Revision History |
118 | |||
119 | |_.Date |_.Revisions Made |_.Author |_.Reviewed By | |
||
120 | | October 6, 2014 | Initial Draft | Tom Clegg |=. ---- | |
||
121 | 3 | Tom Clegg | | October 15, 2014 | (cont'd) | Tom Clegg |=. ---- | |