Reusable tasks » History » Version 2
Tom Clegg, 10/08/2014 03:28 PM
1 | 1 | Tom Clegg | {{>toc}} |
---|---|---|---|
2 | |||
3 | h1=. Reusable tasks |
||
4 | |||
5 | p>. *"Tom Clegg":mailto:tom@curoverse.com |
||
6 | Last Updated: October 6, 2014* |
||
7 | |||
8 | h2. Overview |
||
9 | |||
10 | h3. Objective |
||
11 | |||
12 | Say jobs A and B, although not identical, have some tasks in common. Job A is complete. Job B starting now. They use the same script, version, docker image, etc. The only difference between A and B is that B's input collection has one more file; the rest of the files are identical. The script processes each input file independently, and it is a pure function (re-computing the same files will produce the same result). This means most of Job B's work has already been done. Task re-use will allow Arvados to recognize this condition and re-use the outputs of Job A's tasks instead of recomputing them. |
||
13 | |||
14 | Task re-use will not attempt to detect equivalence conditions like differently-encoded collection manifests with identical data, differing git commits with identical trees, and differing docker images with functionally equivalent content. |
||
15 | |||
16 | The intended audience for this document is software engineers. |
||
17 | |||
18 | h3. Background |
||
19 | |||
20 | 2 | Tom Clegg | The arvados.v1.jobs.create API offers a find_or_create feature which searches for an existing job which meets criteria specified by the client (e.g., same script, compatible script_version) and additional criteria (e.g., did not fail, is not marked impure/nondeterministic, does not diagree with other jobs passing the same criteria about what the correct output is). |
21 | 1 | Tom Clegg | |
22 | 2 | Tom Clegg | * http://doc.arvados.org/api/methods/jobs.html#create |
23 | |||
24 | 1 | Tom Clegg | h3. Alternatives |
25 | |||
26 | 2 | Tom Clegg | Always recompute each task (i.e., leave existing behavior). |
27 | 1 | Tom Clegg | |
28 | 2 | Tom Clegg | bq. This makes desirable use cases prohibitively expensive. |
29 | 1 | Tom Clegg | |
30 | 2 | Tom Clegg | Use smaller jobs, and more jobs per pipeline. |
31 | 1 | Tom Clegg | |
32 | 2 | Tom Clegg | bq. We could make the dynamic-structure capabilities of crunch jobs available at the pipeline level, and de-emphasize or stop using the features that encourage long-running jobs. Disadvantages include: |
33 | * The process of running a pipeline is not done in a controlled environment. This effectively reduces the utility of reproducibility and provenance features. |
||
34 | * Pipelines are currently encoded as JSON which is awkward to use as a DSL. |
||
35 | 1 | Tom Clegg | |
36 | 2 | Tom Clegg | h3. Tradeoffs |
37 | 1 | Tom Clegg | |
38 | 2 | Tom Clegg | _TODO_ |
39 | 1 | Tom Clegg | |
40 | 2 | Tom Clegg | h3. High Level Design |
41 | 1 | Tom Clegg | |
42 | 2 | Tom Clegg | Before executing a job_task that qualifies for re-use, crunch-job uses the API to discover existing job_tasks that are functionally identical, are marked as "pure", and have already finished. |
43 | |||
44 | 1 | Tom Clegg | h2. Specifics |
45 | |||
46 | h3. Detailed Design |
||
47 | |||
48 | 2 | Tom Clegg | The JobTask schema has a new boolean flag @is_pure@ (not null, default @false@). |
49 | 1 | Tom Clegg | |
50 | 2 | Tom Clegg | Just before starting a task having @is_pure==true@, crunch-job does an API query look up other tasks with @is_pure=true@ and identical inputs, parameters, script_version, etc. |
51 | * Some attributes like script and script_version are currently stored in the job record, not the job_task record. This will make the lookup interesting, in the absence of a generic "join" API. |
||
52 | 1 | Tom Clegg | |
53 | 2 | Tom Clegg | Job tasks have one especially noteworthy side effect: queueing additional tasks. In order to reuse tasks safely without races, we need additional restraints: |
54 | * Tasks with @is_pure==true@ cannot queue additional tasks, *and* @is_pure@ cannot change from @false@ to @true@. |
||
55 | * Tasks do not qualify for reuse until they have completed.[1] When reusing a task, copy (and reset to "todo" state) each task whose @created_by_job_task_uuid@ attribute references the task being reused. |
||
56 | 1 | Tom Clegg | |
57 | 2 | Tom Clegg | fn1. At least in the short term, this constraint is a good way to limit the complexity of implementation without sacrificing too much of the user benefit. |
58 | 1 | Tom Clegg | |
59 | h3. Code Location |
||
60 | |||
61 | 2 | Tom Clegg | @sdk/cli/bin/crunch-job@ will have new task reuse logic. |
62 | 1 | Tom Clegg | |
63 | 2 | Tom Clegg | @services/api/db/migrate@ will have a new migration, which will be reflected in @services/api/db/structure.sql@. |
64 | |||
65 | @services/api/app/models/job_task.rb@ will add :is_pure to the API response and prohibit @is_pure@ from changing from @false@ to @true@. |
||
66 | |||
67 | @doc/api/schema/JobTask.html.textile.liquid@ will document the :is_pure flag. |
||
68 | |||
69 | 1 | Tom Clegg | h3. Testing Plan |
70 | |||
71 | 2 | Tom Clegg | _TODO_ |
72 | 1 | Tom Clegg | |
73 | h3. Logging |
||
74 | |||
75 | 2 | Tom Clegg | @crunch-job@ will log the fact that it has copied its output attribute (and, if applicable, queued additional tasks) from an existing completed task. |
76 | 1 | Tom Clegg | |
77 | h3. Debugging |
||
78 | |||
79 | 2 | Tom Clegg | _TODO_ |
80 | 1 | Tom Clegg | |
81 | h3. Caveats |
||
82 | |||
83 | To be determined. |
||
84 | |||
85 | h3. Security Concerns |
||
86 | |||
87 | 2 | Tom Clegg | _TODO_ |
88 | 1 | Tom Clegg | |
89 | h3. Open Questions and Risks |
||
90 | |||
91 | 2 | Tom Clegg | _TODO_ |
92 | 1 | Tom Clegg | |
93 | h3. Work Estimates |
||
94 | |||
95 | 2 | Tom Clegg | _TODO_ |
96 | 1 | Tom Clegg | |
97 | h3. Future Work |
||
98 | |||
99 | 2 | Tom Clegg | _TODO_ |
100 | 1 | Tom Clegg | |
101 | h3. Revision History |
||
102 | |||
103 | |_.Date |_.Revisions Made |_.Author |_.Reviewed By | |
||
104 | | October 6, 2014 | Initial Draft | Tom Clegg |=. ---- | |