Container dispatch » History » Version 26
Tom Clegg, 06/27/2016 07:53 PM
1 | 16 | Tom Clegg | h1. Container dispatch |
---|---|---|---|
2 | 2 | Peter Amstutz | |
3 | 15 | Tom Clegg | {{toc}} |
4 | 9 | Peter Amstutz | |
5 | 15 | Tom Clegg | h2. Summary |
6 | 1 | Peter Amstutz | |
7 | 15 | Tom Clegg | A dispatcher uses available compute resources to execute queued containers. |
8 | 1 | Peter Amstutz | |
9 | 15 | Tom Clegg | Dispatch is meant to be a small simple component rather than a pluggable framework: e.g., "slurm dispatch" can be a small standalone program, rather than a plugin for a big generic dispatch program. |
10 | 12 | Peter Amstutz | |
11 | 15 | Tom Clegg | h2. Pseudocode |
12 | 1 | Peter Amstutz | |
13 | 15 | Tom Clegg | * Notice there is a queued container |
14 | * Decide whether the required resources are available to run the container |
||
15 | * Lock the container (this avoids races with other dispatch processes) |
||
16 | * Translate the container's runtime constraints and priority to instructions for the lower-level scheduler, if any |
||
17 | * Invoke the "crunch2 run" executor |
||
18 | * When the priority changes on a container taken by this dispatch process, update the lower-level scheduler accordingly (cancel if priority is zero) |
||
19 | * If the lower-level scheduler indicates the container is finished or abandoned, but the Container record is locked by this dispatcher and has state=Running, fail the container |
||
20 | 1 | Peter Amstutz | |
21 | 15 | Tom Clegg | h2. Examples |
22 | 1 | Peter Amstutz | |
23 | 15 | Tom Clegg | slurm batch mode |
24 | * Use "sinfo" to determine whether it is possible to run the container |
||
25 | * Submit a batch job to the queue: "echo crunch-run --job {uuid} | sbatch -N1" |
||
26 | * When container priority changes, use scontrol and scancel to propagate changes to slurm |
||
27 | * Use strigger to run a cleanup script when a container exits |
||
28 | 2 | Peter Amstutz | |
29 | 15 | Tom Clegg | standalone worker |
30 | * Inspect /proc/meminfo, /proc/cpuinfo, "docker ps", etc. to determine local capacity |
||
31 | * Invoke crunch-run as a child process (or perhaps a detached daemon process) |
||
32 | * Signal crunch-run to stop if container priority changes to zero |
||
33 | 2 | Peter Amstutz | |
34 | 25 | Tom Clegg | h2. Security |
35 | |||
36 | A dispatch process needs to: |
||
37 | * List queued containers |
||
38 | * List containers locked by its own token |
||
39 | * Lock queued containers |
||
40 | * Update containers locked by its own token |
||
41 | |||
42 | Crunch-run needs to: |
||
43 | * Get the inside-container API token (either from the API or from its caller/environment) |
||
44 | * Update its own container record |
||
45 | * Get the manifest for the ContainerImage collection |
||
46 | * Create output/log collections |
||
47 | |||
48 | The container itself, and arv-mount if enabled, need to: |
||
49 | * Act as the requesting user (if RuntimeConstraints.API is enabled) |
||
50 | |||
51 | 26 | Tom Clegg | The security model should assume worker nodes are less trusted than dispatch nodes. For example, there may be cheaper less-trusted worker nodes where less-sensitive containers can be run, and those worker nodes should never be given a token that would let them see any of the more-sensitive containers. In such cases a less-powerful token: the dispatcher cannot pass its own token to crunch-run, but crunch-run needs some way to update container state (the token passed to the container itself can't do this). This less-powerful token must be scoped to permit updates to a specific container(s). For example, a new single-container token can be created each time a container changes state from Queued to Locked. |
52 | |||
53 | 25 | Tom Clegg | |
54 | 23 | Tom Clegg | h2. Locking |
55 | |||
56 | At a given moment, there should be (at most) one process responsible for running a given container, and that process should be the only one updating the database record. |
||
57 | |||
58 | Certain common scenarios threaten to disrupt this: |
||
59 | # Two dispatch processes are running. They both notice a queued container, and they both decide to run it. |
||
60 | 24 | Tom Clegg | # A dispatch process decides to run a container, and starts a crunch-run process (e.g., via slurm) but the dispatch service restarts while crunch-run is still running. |
61 | # A sysadmin or daemon-supervisor mishap results in two concurrent dispatch processes using the same token. This _should_ be preventable but it's still desirable to behave correctly if it happens. |
||
62 | 1 | Peter Amstutz | |
63 | 24 | Tom Clegg | The first scenario ("multiple dispatch, different tokens") is addressed by the locked_by_uuid field. |
64 | 23 | Tom Clegg | |
65 | 24 | Tom Clegg | In the second scenario ("amnesiac dispatch"): |
66 | 23 | Tom Clegg | * As long as the original crunch-run is running (or queued in slurm), the new dispatch process should leave it alone. |
67 | * If the new dispatch process knows somehow (e.g., squeue) that the original crunch-run process has stopped without moving the container record out of Running state, it should clean up the container record accordingly. |
||
68 | 1 | Peter Amstutz | * If the new dispatch process makes a mistake here, and tries to clean up the container record while crunch-run is still alive, _one of them must lose:_ If the cleanup transaction is successful, all of crunch-run's subsequent transactions must fail. |
69 | ** If state=Running then cleanup will change state to Cancelled, which itself ensures subsequent transactions will fail. |
||
70 | ** If state=Locked then cleanup will change state to Queued, and the new dispatch process might use the same dispatch token to take it off the queue and change state to Locked. An additional locking mechanism is needed here. |
||
71 | 24 | Tom Clegg | |
72 | In the third scenario ("multiple dispatch, same token"): |
||
73 | * If both processes acquire the lock by doing an "update" transaction with state=Locked, using the same token, then after a race both will think they succeeded: the loser's update will look like a no-op. |
||
74 | * Solution 1: Use an explicit "lock" API. |
||
75 | * Solution 2: Use an If-Match HTTP header when updating with intent to acquire a lock. |
||
76 | |||
77 | 23 | Tom Clegg | |
78 | 15 | Tom Clegg | h2. Arvados API support |
79 | 2 | Peter Amstutz | |
80 | 15 | Tom Clegg | Each dispatch process has an Arvados API token that allows it to see queued containers. |
81 | * No two dispatch processes can run at the same time with the same token. One way to achieve this is to make a user record for each dispatch service. |
||
82 | 2 | Peter Amstutz | |
83 | 15 | Tom Clegg | Container APIs relevant to a dispatch program: |
84 | * List Queued containers (might be a subset of Queued containers) |
||
85 | * List containers with state=Locked or state=Running associated with current token |
||
86 | 18 | Tom Clegg | ** arvados.v1.containers.current (equivalent to @filters=[["dispatch_auth_uuid","=",current_client_auth.uuid]]@) |
87 | 15 | Tom Clegg | * Receive event when container is created or modified and state is Queued (it might become runnable) |
88 | * Change state Queued->Locked |
||
89 | * Change state Locked->Queued |
||
90 | * Change state Locked->Running |
||
91 | * Change state Running->Complete |
||
92 | * Receive event when priority changes |
||
93 | 1 | Peter Amstutz | * Receive event when state changes to Complete |
94 | 18 | Tom Clegg | * Retrieve an API token to pass into the container and its arv-mount process (via crunch-run) |
95 | ** Token is automatically created/assigned when container state changes to Locked |
||
96 | ** Token is automatically expired/destroyed when container state changes away from Running |
||
97 | 19 | Tom Clegg | ** arvados.v1.containers.container_auth(uuid=container.uuid) → returns an api_client_authorization record |
98 | 15 | Tom Clegg | * Create events/logs |
99 | 22 | Tom Clegg | ** Decided to run this container |
100 | ** Decided not to run this container (e.g., no node with those resources) |
||
101 | 15 | Tom Clegg | ** Lock failed |
102 | ** Dispatched to crunch-run |
||
103 | ** Cleaned up crashed crunch-run (lower-level scheduler indicates the job finished, but crunch-run didn't leave the container in a final state) |
||
104 | ** Cleaned up abandoned container (container belongs to this process, but dispatch and lower-level scheduler don't know about it) |
||
105 | 6 | Peter Amstutz | |
106 | 15 | Tom Clegg | h2. Non-responsibilities |
107 | 6 | Peter Amstutz | |
108 | 15 | Tom Clegg | Dispatch doesn't retry failed containers. If something needs to be reattempted, a new container will appear in the queue. |
109 | 7 | Peter Amstutz | |
110 | 15 | Tom Clegg | Dispatch doesn't fail a container that it can't run. It doesn't know whether other dispatchers will be able to run it. |
111 | 8 | Peter Amstutz | |
112 | 15 | Tom Clegg | h2. Additional notes |
113 | 8 | Peter Amstutz | |
114 | 17 | Tom Clegg | (see also #6429 and #6518 and #8028) |
115 | 8 | Peter Amstutz | |
116 | 15 | Tom Clegg | Using websockets to listen for container events (new containers added, priority changes) will benefit from some Go SDK support. |
117 | 20 | Peter Amstutz | |
118 | h2. Cloud Container Services |
||
119 | |||
120 | Cloud providers now offer container execution services. However, rather than being just an API to run containers (similar to Crunch) these take the form of preconfigured clusters set up with a container orchestration system. |
||
121 | |||
122 | AWS offers Elastic Container Service. It appears that the leader runs on AWS infrastructure (?) and you spin up worker VMs which run the ECS Agent: https://github.com/aws/amazon-ecs-agent |
||
123 | |||
124 | 21 | Peter Amstutz | Google Container Engine provides a preconfigured Kubernetes cluster. https://cloud.google.com/container-engine/docs/clusters/operations |
125 | 20 | Peter Amstutz | |
126 | 21 | Peter Amstutz | Azure provides a preconfigured Mesos or Docker Swarm cluster. https://azure.microsoft.com/en-us/services/container-service/?WT.mc_id=azurebg_email_Trans_1083_Tier2_Release_MOSP |