Project

General

Profile

Container dispatch » History » Version 13

Peter Amstutz, 12/13/2015 04:33 PM

1 1 Peter Amstutz
h1. Crunch2 dispatch
2 2 Peter Amstutz
3 9 Peter Amstutz
h2. Design sketch
4
5 13 Peter Amstutz
crunch-dispatch-slurm (sLURm to crunCH = "Lurch"?)
6 9 Peter Amstutz
7 13 Peter Amstutz
Responsibilities:
8 9 Peter Amstutz
9 1 Peter Amstutz
* Monitor containers table. 
10
## When a new container appears, claim it
11 12 Peter Amstutz
## Put the job into the slurm job queue with  @sbatch --immediate --share --export=ARVADOS_API_HOST,ARVADOS_API_TOKEN@
12
## Possibly request new node from node manager (mechanism TBD, see below)
13 11 Peter Amstutz
## Job consists of running "arv-container" on the compute node given the API token and container uuid.
14
## arv-container is responsible for actually running the container, see https://dev.arvados.org/issues/8001
15 12 Peter Amstutz
## arv-container should update the container record unless there is a node failure
16
* Monitor slurm job status 
17
## Run @squeue@ to get status
18
## Synchronize with containers table: cancel priority 0 jobs, update container status from slurm queue.
19 1 Peter Amstutz
20 13 Peter Amstutz
Decisions to make:
21 1 Peter Amstutz
22 12 Peter Amstutz
Two new components: arv-container and crunch-dispatch-slurm.
23
24 1 Peter Amstutz
What language to write arv-container in?
25 12 Peter Amstutz
26
* If Python
27 13 Peter Amstutz
** Will require Python SDK be installed on every compute node.
28
** Could integrate FUSE mount directly instead of requiring arv-mount subprocess, would make it easier to detect keep errors
29
** Could talk to docker daemon directly instead of using docker CLI
30 12 Peter Amstutz
* In Go:
31
** Will still require Python SDK for arv-mount
32
** However, could integrate with a future Go FUSE mount; would be possible to eventually eliminate python dependencies making arv-container much more lightweight.
33
34
What language to write crunch-dispatch-slurm?
35
36
* In Python
37
** Could be integrated with node manager?
38
* In Go:
39
** If using websockets to listen for container events (new containers added, priority changes) will require Go websocket client
40
41
How to decide and communicate requests for new cloud nodes?
42
43
* Could use http://slurm.schedmd.com/elastic_computing.html
44
** Potentially non-trival amount of work to configure slurm
45
* Compute wishlist & send to node manager
46
** Need to predict how slurm will allocate jobs to nodes or risk under- or over-allocating nodes. 
47
48
Retry
49
50
* Slurm can re-queue jobs on node failure automatically
51 13 Peter Amstutz
* Can make retry decisions at API server, can detect and retry on wider class of infrastructure failure
52 9 Peter Amstutz
53
h1. Notes
54
55 3 Peter Amstutz
Some discussion notes from https://dev.arvados.org/issues/6429 and https://dev.arvados.org/issues/6518
56
57 4 Peter Amstutz
h2. Interaction with node manager
58 3 Peter Amstutz
59 2 Peter Amstutz
Suggest writing crunch 2 job dispatcher as a new set of actors in node manger.
60
61
This would enable us to solve the question of communication between the scheduler and cloud node management (#6520).
62
63
Node manager already has a lot of the framework we will want like concurrency (can have one actor per job) and a configuration system.
64
65 1 Peter Amstutz
Different schedulers (slurm, sge, kubernetes) can be implemented as modules similarly to how different cloud providers are supported now.
66 4 Peter Amstutz
67
If we don't actually combine them, we should at least move the logic of deciding how many and what size nodes to run from the node manager to the dispatcher; the dispatcher can then communicate its wishlist to node manager.
68 2 Peter Amstutz
69
h2.  Interaction with API
70
71
More ideas:
72
73
Have a "dispatchers" table. Dispatcher processes are responsible for pinging the API server similar to how it is done for nodes to show they are alive.
74
75
A dispatcher claims a container by setting "dispatcher" field to it's UUID. This field can only be set once and that locks the record so that only the dispatcher can update it.
76
77
If a dispatcher stops pinging, the containers it has claimed should be marked as TempFail.
78
79
Dispatchers should be able to annotate containers (preferably through links) for example "I can't run this because I don't have any nodes with 40 GiB of RAM".
80
81
h2. Retry
82
83
How do we handle failure? Is the dispatcher required to retry containers that fail, or is the dispatcher a "best effort" service and the API decides to retry by scheduling a new container?
84
85
Currently the container_uuid field only holds a single container_uuid at a time. If the API schedules a new container, does that mean any container requests associated with that container get updated with the new container?
86
87
If the container_uuid field only holds one container at a time, and container don't link back to the container requests that created, then we don't have a way to record of past attempts to fulfill this request. This means we don't have anything to check against container_count_max. A few possible solutions:
88
89
* Make container_uuid an array of containers created to fulfill a given container request (this introduces complexity)
90
* Decrement container_count_max on the request when submitting a new container
91
* Compute content address of the container request and discover containers with that content address. This would conflict with "no reuse" or "impure" requests which are supposed to ignore past execution history. Could solve this by salting the content address with a timestamp; "no reuse" containers would never ever be reusable which might be fine.
92
93
I think we should distinguish between infrastructure failure and task failure by distinguishing between "TempFail" and "PermFail" in the container state. "TempFail" shouldn't count againt the container_count_max count, or alternately we only honor container_count_max for "TempFail" tasks and don't retry "PermFail".
94
95
Ideally, "TempFail" containers should retry forever, but with a backoff. One way to do the backoff is to schedule the container to run at a specific time in the future.
96
97
h2. Scheduling
98
99
Having a field specifying "wait until time X to run this container" would be generally useful for cron-style tasks.
100 5 Peter Amstutz
101
h2. State changes
102
103
Container request:
104
105
||priority == 0|priority > 0|
106
|Uncommitted|nothing|nothing|
107
|Committed|set container priority to max(all request priority)|set container priority to max(all request priority)|
108 6 Peter Amstutz
|Final|invalid|invalid|
109 1 Peter Amstutz
110 7 Peter Amstutz
When a container request goes to the "committed" state, it is assigned a container.
111
112 1 Peter Amstutz
Container:
113 6 Peter Amstutz
114
||priority == 0|priority > 0|
115
|Queued|Desire change state to "Cancelled"|Desire to change change state to "Running"|
116
|Running|Desire change state to "Cancelled"|nothing|
117
|Complete|invalid|invalid|
118
|Cancelled|invalid|invalid|
119 7 Peter Amstutz
120
When the container goes to either the "complete" or "cancelled" state, any associated container requests go to "final" state.
121 8 Peter Amstutz
122
h2. Claims
123
124
Have a field "claimed_by_uuid" on the Container record.  A queued container is claimed by a dispatcher process via an atomic "claim" operation.  A claim can be released if the container is still in the Queued state.
125
126
The container record cannot be updated by anyone except the owner of the claim.