Project

General

Profile

Crunch1-in-Crunch2 » History » Version 3

Tom Clegg, 07/15/2015 04:09 PM

1 1 Tom Clegg
{{>TOC}}
2
3
h1. Crunch1-in-Crunch2 (DRAFT)
4
5
Detail about how Crunch2 runs jobs that were written for Crunch1.
6
7
See
8 3 Tom Clegg
* Crunch2 [[Containers API]]
9 2 Tom Clegg
* [[Crunch2 migration]]
10 1 Tom Clegg
11
h2. Background
12
13 2 Tom Clegg
In order for Crunch2 to replace Crunch1, Crunch2 must:
14
* run jobs that rely on Crunch1's API, like
15
** run-command
16
** arv-run (via run-command)
17
** existing tutorial/example jobs
18
** user scripts based on existing tutorials
19
* accept job submissions from clients using the Crunch1 API, like
20
** arv-run-pipeline-instance
21
** user scripts
22
* maintain the ability to view progress of Crunch1 jobs using Crunch1 clients, like
23
** Workbench
24
** arv-run-pipeline-instance
25 1 Tom Clegg
26
Crunch1 jobs rely on the following pieces:
27
* Keep mount available within the container
28
* Some environment variables (CRUNCH_SRC, ARVADOS_API_*, etc)
29 2 Tom Clegg
* jobs and job_tasks APIs for executing work on multiple nodes
30
31
h2. Approach
32
33
h3. Submitting a job
34
35
Translate the incoming Crunch1 job submission to a Crunch2 job request.
36
* The container/command given in the job request are determined by the server configuration. The Crunch1 API doesn't specify [which version of] crunch-job is to be used.
37
38
Create the job request using the JobRequests controller.
39
40
Create a job record just as before, but set a flag so crunch-dispatch doesn't try to run it. (This could be implemented as a "Proxy" state.)
41
42
h3. Running a job
43
44
Once it has been translated to a job request, a Crunch1 job is merely a Crunch2 job (the "parent") which acts as any "workflow runner" would: it submits additional job requests of its own (the "children"). Its notable difference is that it uses an additional communication channel not normally used by Crunch2 jobs:
45
* The children perform Arvados API requests (jobs.get, job_tasks.get, job_tasks.update, and job_tasks.create) to get information about themselves and to ask the parent to submit more job requests.
46
* The parent performs Arvados API requests (presumably job_tasks.list and job_tasks.get) to get the information submitted by the children.
47
48
The Crunch1 runner implements the same algorithm as crunch-job, but with a few simplifying restrictions.
49
* It has only one way to run tasks: submit a jobrequest[1].
50
* It doesn't construct docker command lines, or run docker itself: instead, it writes Crunch2 job requests.
51
* It doesn't retry tasks. Crunch2 is responsible for this.
52
* It doesn't look for node failure. Ditto Crunch2.
53
* It doesn't copy stderr to Keep. Ditto Crunch2.
54
* It doesn't know anything about slurm.
55
56
With all that stuff removed, the Crunch1 runner algorithm reduces to something like this:
57
* Submit a job_request for "task 0".
58
* When the assigned job succeeds, look for new job_tasks that it submitted. Add these to a list of "pending" tasks.
59
* Take min(sequence) across all pending job_tasks. Translate job_tasks with that sequence out of "pending" and submit them as job_requests.
60
* Repeat until all submitted job_requests have been assigned and finished, and "pending" is empty.
61
* Collate task outputs into a job output.
62
63
TBD:
64
* If a child job (formerly "job_task") sets the parent job's (formerly "job's") output attribute, it cannot be reused to fulfill a future job request. Either this should be handled transparently, or this use case should be prohibited (at the cost of breaking some Crunch1 jobs).
65
* If a child job reads the parent job record (which is nearly universal among Crunch1 jobs) it cannot be reused to fulfill a future job request _except_ where the future job request would return the same values. This could be ensured by copying the parent job's crunch1 job record into the crunch2 job request's inputs -- however, this would effectively prevent _any_ crunch1 job from reusing tasks across non-identical jobs.
66
67
fn1. This means "local dev jobs" will require a dev/transient install of the Crunch infrastructure. This is probably a good thing overall, but does mean we need to do the work of making the transient infrastructure spring up quickly and easily.
68
69
h3. Getting job status
70
71
Clients must be able to get the current status of a Crunch1 job (i.e., one that was submitted with the Crunch1 API) by using the Crunch1 "list" and "get" APIs. This is necessary for existing clients (including Crunch1 jobs themselves) to continue working without modifications after Crunch2 has replaced Crunch1 as the execution engine.
72
73
Clients must be able to get job status for both Crunch1- and Crunch2-submitted jobs using only the Crunch2 "list" and "get" APIs. This makes it possible to migrate Workbench from Crunch1 to Crunch2 without losing the ability to see old jobs.
74
75
However, it is _not_ necessary for Crunch1 clients to see Crunch2 jobs.
76
77
When a job is/was executed by Crunch2, the Crunch2 API is the source of truth about its state. Therefore:
78
* Crunch1 APIs that modify a job must also modify the corresponding Crunch2 record(s). This might be the empty set, though: crunch-job's replacement will use Crunch2 directly rather than using Crunch1's jobs.update API to update job output/progress, for example.
79
* Crunch1 APIs that retrieve a job must read the Crunch2 record.
80
81
h3. job_tasks APIs
82
83
The job_tasks APIs are used by Crunch1 jobs to communicate between crunch-job and the processes it runs on allocated nodes. The API server doesn't need to touch these.