Fixing cloud scheduling » History » Version 3

Peter Amstutz, 07/25/2018 06:29 PM

1 1 Peter Amstutz
h1. Fixing cloud scheduling
2 1 Peter Amstutz
3 1 Peter Amstutz
Our current approach to scheduling containers on the cloud using SLURM has a number of problems:
4 1 Peter Amstutz
5 1 Peter Amstutz
* Head-of-line problem: with a single queue, slurm will only schedule the job at the top of the queue, if it cannot be scheduled, every other job has to wait.  This results in wasteful idle nodes and reduces throughput.
6 1 Peter Amstutz
* Queue ordering doesn't reflect our desired priority order without a lot of hacking around with "niceness"
7 1 Peter Amstutz
* Slurm queue forgets dynamic configuration, requires constant maintenance processes to reset slurm dynamic configuration 
8 1 Peter Amstutz
9 2 Peter Amstutz
Things that slurm currently provides:
10 2 Peter Amstutz
11 2 Peter Amstutz
* allocating containers to specific nodes
12 2 Peter Amstutz
* reporting idle/busy/failed/down state, and out of contact 
13 2 Peter Amstutz
14 1 Peter Amstutz
Some solutions:
15 1 Peter Amstutz
16 3 Peter Amstutz
h2. Crunch-dispatch-local
17 3 Peter Amstutz
18 3 Peter Amstutz
(currently the preferred option)
19 3 Peter Amstutz
20 3 Peter Amstutz
Node manager spins up nodes based on container queue.  Compute nodes run crunch-dispatch-local or similar service, which asks the API server for work and then runs it.  Possibly node manager directly decides which jobs should go onto which nodes.
21 3 Peter Amstutz
22 3 Peter Amstutz
Advantages:
23 3 Peter Amstutz
24 3 Peter Amstutz
* Complete control over scheduling decisions / priority
25 3 Peter Amstutz
26 3 Peter Amstutz
Disadvantages:
27 3 Peter Amstutz
28 3 Peter Amstutz
* Additional load on API server (but probably not that much)
29 3 Peter Amstutz
* Need a new scheme for nodes to report their status so that node manager knows if they are busy, idle.  Node manager has to be able to put nodes in equivalent of "draining" state to ensure they don't get shut down while doing work.  (We can use the "nodes" table for this).
30 3 Peter Amstutz
* Need to be able to detect node failure.
31 3 Peter Amstutz
32 3 Peter Amstutz
Proposed design:
33 3 Peter Amstutz
34 3 Peter Amstutz
h3. Starting up
35 3 Peter Amstutz
36 3 Peter Amstutz
# Node looks at pending containers to get a "wishlist"
37 3 Peter Amstutz
# Nodes spin up the way they do now.  However, instead of registering with slurm, they start crunch-dispatch-local.
38 3 Peter Amstutz
# Node ping token should have corresponding API token to be used by dispatcher to talk to API server
39 3 Peter Amstutz
# C-d-l pings the API server to indicate it is "idle"
40 3 Peter Amstutz
41 3 Peter Amstutz
h3. Running containers
42 3 Peter Amstutz
43 3 Peter Amstutz
# C-d-l finds a container appropriate sized for the node and locks it
44 3 Peter Amstutz
## Could use existing list / lock API
45 3 Peter Amstutz
## Alternately, to reduce contention, could add a "I am idle, give me work" API which locks and returns the next container that is appropriate for the node, or marks the node as "idle" if none is available
46 3 Peter Amstutz
# Node record records which container it is supposed to be running (can be part of the "Lock" call based on the per-node API token)
47 3 Peter Amstutz
# C-d-l makes API call to nodes table to say it is "busy"
48 3 Peter Amstutz
# C-d-l calls crunch-run to run the container
49 3 Peter Amstutz
# C-d-l must continue to ping that it is "busy" every X seconds
50 3 Peter Amstutz
# When container finishes, c-d-l pings that it is "idle"
51 3 Peter Amstutz
52 3 Peter Amstutz
h3. Shutting down
53 3 Peter Amstutz
54 3 Peter Amstutz
# When node manager decides a node is ready for shutdown, it makes an API call on the node record to indicate "draining".
55 3 Peter Amstutz
# C-d-l pings "I am idle" on a "draining" record.  This puts the state in "drained" and c-d-l does not get any new work.
56 3 Peter Amstutz
# Node manager sees the node is "drained" and can proceed with shutdown.
57 3 Peter Amstutz
58 3 Peter Amstutz
h3. Handling failure
59 3 Peter Amstutz
60 3 Peter Amstutz
# If a node enters a failure state and there is a container associated with it, make sure to cancel the container.
61 3 Peter Amstutz
# API server should have a background process which looks for nodes that haven't pinged recently puts them into failed state.
62 3 Peter Amstutz
# Node can also put itself into failed state with an API call.
63 3 Peter Amstutz
64 3 Peter Amstutz
h1. Other options
65 3 Peter Amstutz
66 1 Peter Amstutz
h2. Use slurm better
67 1 Peter Amstutz
68 1 Peter Amstutz
Most of our slurm problems are self-inflicted.  We have a single partition and single queue with heterogeneous, dynamically configured nodes.  We would have fewer problems if we adopted a strategy whereby we define configure slurm ranges "compute-small-[0-255]", "compute-medium-[0-255]", "compute-large-[0-255]" with appropriate specs.  Define a partition for each size range, so that a job waiting for one node size does not hold up jobs that want a different node size.
69 1 Peter Amstutz
70 1 Peter Amstutz
Advantages:
71 1 Peter Amstutz
72 1 Peter Amstutz
* Least overall change compared to current architecture
73 1 Peter Amstutz
74 1 Peter Amstutz
Disadvantages:  
75 1 Peter Amstutz
76 1 Peter Amstutz
* Requires coordinated change to API server, node manager, crunch-dispatch-slurm, cluster configuration
77 1 Peter Amstutz
* Ops seems to think that defining (sizes * max nodes) hostnames might be a problem?
78 1 Peter Amstutz
* Can't adjust node configurations without restarting the whole cluster
79 1 Peter Amstutz
80 1 Peter Amstutz
h2. Cloud provider scheduling APIs
81 1 Peter Amstutz
82 1 Peter Amstutz
Use cloud provider scheduling APIs such as Azure Batch, AWS Batch, Google pipelines API to perform cluster scaling and scheduling.
83 1 Peter Amstutz
84 1 Peter Amstutz
Would be implemented as custom Arvados dispatcher services: crunch-dispatch-azure, crunch-dispatch-aws, crunch-dispatch-google.
85 1 Peter Amstutz
86 1 Peter Amstutz
Advantages:
87 1 Peter Amstutz
88 1 Peter Amstutz
* Get rid of Node Manager
89 1 Peter Amstutz
90 1 Peter Amstutz
Disadvantages:
91 1 Peter Amstutz
92 1 Peter Amstutz
* Has to be implemented per cloud provider.
93 1 Peter Amstutz
* May be hard to customize behavior, such as job priority.
94 1 Peter Amstutz
95 1 Peter Amstutz
h2. Kubernetes
96 1 Peter Amstutz
97 1 Peter Amstutz
Submit containers to a Kubernetes cluster.  Kubernetes handles cluster scaling and scheduling.
98 1 Peter Amstutz
99 1 Peter Amstutz
Advantages:
100 1 Peter Amstutz
101 1 Peter Amstutz
* Get rid of node manager
102 1 Peter Amstutz
* Desirable as part of overall plan to be able to run Arvados on Kubernetes
103 1 Peter Amstutz
104 1 Peter Amstutz
Disadvantages:
105 1 Peter Amstutz
106 1 Peter Amstutz
* Running crunch-run inside a container requires docker-in-docker (privileged container) or access to the Docker socket.