Fixing cloud scheduling » History » Version 11
Peter Amstutz, 07/25/2018 06:58 PM
1 | 1 | Peter Amstutz | h1. Fixing cloud scheduling |
---|---|---|---|
2 | |||
3 | Our current approach to scheduling containers on the cloud using SLURM has a number of problems: |
||
4 | |||
5 | * Head-of-line problem: with a single queue, slurm will only schedule the job at the top of the queue, if it cannot be scheduled, every other job has to wait. This results in wasteful idle nodes and reduces throughput. |
||
6 | * Queue ordering doesn't reflect our desired priority order without a lot of hacking around with "niceness" |
||
7 | * Slurm queue forgets dynamic configuration, requires constant maintenance processes to reset slurm dynamic configuration |
||
8 | |||
9 | 2 | Peter Amstutz | Things that slurm currently provides: |
10 | |||
11 | * allocating containers to specific nodes |
||
12 | 7 | Peter Amstutz | * reporting idle/busy/failed/down state, and out of contact |
13 | 2 | Peter Amstutz | |
14 | 3 | Peter Amstutz | h2. Crunch-dispatch-local |
15 | |||
16 | (currently the preferred option) |
||
17 | |||
18 | 4 | Peter Amstutz | Node manager generates wishlist based on container queue. Compute nodes run crunch-dispatch-local or similar service, which asks the API server for work and then runs it. |
19 | 3 | Peter Amstutz | |
20 | Advantages: |
||
21 | |||
22 | * Complete control over scheduling decisions / priority |
||
23 | |||
24 | Disadvantages: |
||
25 | |||
26 | * Additional load on API server (but probably not that much) |
||
27 | * Need a new scheme for nodes to report their status so that node manager knows if they are busy, idle. Node manager has to be able to put nodes in equivalent of "draining" state to ensure they don't get shut down while doing work. (We can use the "nodes" table for this). |
||
28 | * Need to be able to detect node failure. |
||
29 | |||
30 | h3. Starting up |
||
31 | |||
32 | # Node looks at pending containers to get a "wishlist" |
||
33 | # Nodes spin up the way they do now. However, instead of registering with slurm, they start crunch-dispatch-local. |
||
34 | # Node ping token should have corresponding API token to be used by dispatcher to talk to API server |
||
35 | 4 | Peter Amstutz | # C-d-l pings the API server to ask for work, the ping operation puts the node in either "busy" (if work is returned) or "idle" |
36 | 3 | Peter Amstutz | |
37 | h3. Running containers |
||
38 | |||
39 | 11 | Peter Amstutz | Assumption: Nodes only run one container at once. |
40 | |||
41 | 10 | Peter Amstutz | # Add "I am idle, give me work" API which locks and returns the next container that is appropriate for the node, or marks the node as "idle" if no work is available |
42 | 3 | Peter Amstutz | # Node record records which container it is supposed to be running (can be part of the "Lock" call based on the per-node API token) |
43 | # C-d-l makes API call to nodes table to say it is "busy" |
||
44 | # C-d-l calls crunch-run to run the container |
||
45 | # C-d-l must continue to ping that it is "busy" every X seconds |
||
46 | # When container finishes, c-d-l pings that it is "idle" |
||
47 | |||
48 | h3. Shutting down |
||
49 | |||
50 | # When node manager decides a node is ready for shutdown, it makes an API call on the node record to indicate "draining". |
||
51 | # C-d-l pings "I am idle" on a "draining" record. This puts the state in "drained" and c-d-l does not get any new work. |
||
52 | 8 | Peter Amstutz | # Node manager sees the node is "drained" and can proceed with destroying the cloud node. |
53 | 3 | Peter Amstutz | |
54 | h3. Handling failure |
||
55 | |||
56 | 9 | Peter Amstutz | # If a node enters a failure state and there is a container associated with it, the container should either be unlocked (if container is in locked state) or cancelled (if in running state). |
57 | 3 | Peter Amstutz | # API server should have a background process which looks for nodes that haven't pinged recently puts them into failed state. |
58 | 4 | Peter Amstutz | # Node can also put itself into failed state with an API call. |
59 | 3 | Peter Amstutz | |
60 | h1. Other options |
||
61 | |||
62 | 6 | Peter Amstutz | h2. Kubernetes |
63 | 1 | Peter Amstutz | |
64 | 6 | Peter Amstutz | Submit containers to a Kubernetes cluster. Kubernetes handles cluster scaling and scheduling. |
65 | 1 | Peter Amstutz | |
66 | Advantages: |
||
67 | |||
68 | 6 | Peter Amstutz | * Get rid of node manager |
69 | * Desirable as part of overall plan to be able to run Arvados on Kubernetes |
||
70 | 1 | Peter Amstutz | |
71 | 6 | Peter Amstutz | Disadvantages: |
72 | 1 | Peter Amstutz | |
73 | 6 | Peter Amstutz | * Running crunch-run inside a container requires docker-in-docker (privileged container) or access to the Docker socket. |
74 | 1 | Peter Amstutz | |
75 | h2. Cloud provider scheduling APIs |
||
76 | |||
77 | Use cloud provider scheduling APIs such as Azure Batch, AWS Batch, Google pipelines API to perform cluster scaling and scheduling. |
||
78 | |||
79 | Would be implemented as custom Arvados dispatcher services: crunch-dispatch-azure, crunch-dispatch-aws, crunch-dispatch-google. |
||
80 | |||
81 | Advantages: |
||
82 | |||
83 | * Get rid of Node Manager |
||
84 | |||
85 | Disadvantages: |
||
86 | |||
87 | * Has to be implemented per cloud provider. |
||
88 | * May be hard to customize behavior, such as job priority. |
||
89 | |||
90 | 6 | Peter Amstutz | h2. Use slurm better |
91 | 1 | Peter Amstutz | |
92 | 6 | Peter Amstutz | Most of our slurm problems are self-inflicted. We have a single partition and single queue with heterogeneous, dynamically configured nodes. We would have fewer problems if we adopted a strategy whereby we define configure slurm ranges "compute-small-[0-255]", "compute-medium-[0-255]", "compute-large-[0-255]" with appropriate specs. Define a partition for each size range, so that a job waiting for one node size does not hold up jobs that want a different node size. |
93 | 1 | Peter Amstutz | |
94 | Advantages: |
||
95 | |||
96 | 6 | Peter Amstutz | * Least overall change compared to current architecture |
97 | 1 | Peter Amstutz | |
98 | 6 | Peter Amstutz | Disadvantages: |
99 | 1 | Peter Amstutz | |
100 | 6 | Peter Amstutz | * Requires coordinated change to API server, node manager, crunch-dispatch-slurm, cluster configuration |
101 | * Ops seems to think that defining (sizes * max nodes) hostnames might be a problem? |
||
102 | * Can't adjust node configurations without restarting the whole cluster |