Project

General

Profile

Reusable tasks » History » Version 2

Tom Clegg, 10/08/2014 03:28 PM

1 1 Tom Clegg
{{>toc}}
2
3
h1=. Reusable tasks
4
5
p>. *"Tom Clegg":mailto:tom@curoverse.com
6
Last Updated: October 6, 2014*
7
8
h2. Overview
9
10
h3. Objective
11
12
Say jobs A and B, although not identical, have some tasks in common. Job A is complete. Job B starting now. They use the same script, version, docker image, etc. The only difference between A and B is that B's input collection has one more file; the rest of the files are identical. The script processes each input file independently, and it is a pure function (re-computing the same files will produce the same result). This means most of Job B's work has already been done. Task re-use will allow Arvados to recognize this condition and re-use the outputs of Job A's tasks instead of recomputing them.
13
14
Task re-use will not attempt to detect equivalence conditions like differently-encoded collection manifests with identical data, differing git commits with identical trees, and differing docker images with functionally equivalent content.
15
16
The intended audience for this document is software engineers.
17
18
h3. Background
19
20 2 Tom Clegg
The arvados.v1.jobs.create API offers a find_or_create feature which searches for an existing job which meets criteria specified by the client (e.g., same script, compatible script_version) and additional criteria (e.g., did not fail, is not marked impure/nondeterministic, does not diagree with other jobs passing the same criteria about what the correct output is).
21 1 Tom Clegg
22 2 Tom Clegg
* http://doc.arvados.org/api/methods/jobs.html#create
23
24 1 Tom Clegg
h3. Alternatives
25
26 2 Tom Clegg
Always recompute each task (i.e., leave existing behavior).
27 1 Tom Clegg
28 2 Tom Clegg
bq. This makes desirable use cases prohibitively expensive.
29 1 Tom Clegg
30 2 Tom Clegg
Use smaller jobs, and more jobs per pipeline.
31 1 Tom Clegg
32 2 Tom Clegg
bq. We could make the dynamic-structure capabilities of crunch jobs available at the pipeline level, and de-emphasize or stop using the features that encourage long-running jobs. Disadvantages include:
33
* The process of running a pipeline is not done in a controlled environment. This effectively reduces the utility of reproducibility and provenance features.
34
* Pipelines are currently encoded as JSON which is awkward to use as a DSL.
35 1 Tom Clegg
36 2 Tom Clegg
h3. Tradeoffs
37 1 Tom Clegg
38 2 Tom Clegg
_TODO_
39 1 Tom Clegg
40 2 Tom Clegg
h3. High Level Design
41 1 Tom Clegg
42 2 Tom Clegg
Before executing a job_task that qualifies for re-use, crunch-job uses the API to discover existing job_tasks that are functionally identical, are marked as "pure", and have already finished.
43
44 1 Tom Clegg
h2. Specifics
45
46
h3. Detailed Design
47
48 2 Tom Clegg
The JobTask schema has a new boolean flag @is_pure@ (not null, default @false@).
49 1 Tom Clegg
50 2 Tom Clegg
Just before starting a task having @is_pure==true@, crunch-job does an API query look up other tasks with @is_pure=true@ and identical inputs, parameters, script_version, etc.
51
* Some attributes like script and script_version are currently stored in the job record, not the job_task record. This will make the lookup interesting, in the absence of a generic "join" API.
52 1 Tom Clegg
53 2 Tom Clegg
Job tasks have one especially noteworthy side effect: queueing additional tasks. In order to reuse tasks safely without races, we need additional restraints:
54
* Tasks with @is_pure==true@ cannot queue additional tasks, *and* @is_pure@ cannot change from @false@ to @true@.
55
* Tasks do not qualify for reuse until they have completed.[1] When reusing a task, copy (and reset to "todo" state) each task whose @created_by_job_task_uuid@ attribute references the task being reused.
56 1 Tom Clegg
57 2 Tom Clegg
fn1. At least in the short term, this constraint is a good way to limit the complexity of implementation without sacrificing too much of the user benefit.
58 1 Tom Clegg
59
h3. Code Location
60
61 2 Tom Clegg
@sdk/cli/bin/crunch-job@ will have new task reuse logic.
62 1 Tom Clegg
63 2 Tom Clegg
@services/api/db/migrate@ will have a new migration, which will be reflected in @services/api/db/structure.sql@.
64
65
@services/api/app/models/job_task.rb@ will add :is_pure to the API response and prohibit @is_pure@ from changing from @false@ to @true@.
66
67
@doc/api/schema/JobTask.html.textile.liquid@ will document the :is_pure flag.
68
69 1 Tom Clegg
h3. Testing Plan
70
71 2 Tom Clegg
_TODO_
72 1 Tom Clegg
73
h3. Logging
74
75 2 Tom Clegg
@crunch-job@ will log the fact that it has copied its output attribute (and, if applicable, queued additional tasks) from an existing completed task.
76 1 Tom Clegg
77
h3. Debugging
78
79 2 Tom Clegg
_TODO_
80 1 Tom Clegg
81
h3. Caveats
82
83
To be determined.
84
85
h3. Security Concerns
86
87 2 Tom Clegg
_TODO_
88 1 Tom Clegg
89
h3. Open Questions and Risks
90
91 2 Tom Clegg
_TODO_
92 1 Tom Clegg
93
h3. Work Estimates
94
95 2 Tom Clegg
_TODO_
96 1 Tom Clegg
97
h3. Future Work
98
99 2 Tom Clegg
_TODO_
100 1 Tom Clegg
101
h3. Revision History
102
103
|_.Date            |_.Revisions Made |_.Author            |_.Reviewed By     |
104
| October 6, 2014 | Initial Draft         | Tom Clegg |=. ----              |