Fixing cloud scheduling » History » Version 2

Peter Amstutz, 07/25/2018 05:06 PM

1 1 Peter Amstutz
h1. Fixing cloud scheduling
2 1 Peter Amstutz
3 1 Peter Amstutz
Our current approach to scheduling containers on the cloud using SLURM has a number of problems:
4 1 Peter Amstutz
5 1 Peter Amstutz
* Head-of-line problem: with a single queue, slurm will only schedule the job at the top of the queue, if it cannot be scheduled, every other job has to wait.  This results in wasteful idle nodes and reduces throughput.
6 1 Peter Amstutz
* Queue ordering doesn't reflect our desired priority order without a lot of hacking around with "niceness"
7 1 Peter Amstutz
* Slurm queue forgets dynamic configuration, requires constant maintenance processes to reset slurm dynamic configuration 
8 1 Peter Amstutz
9 2 Peter Amstutz
Things that slurm currently provides:
10 2 Peter Amstutz
11 2 Peter Amstutz
* allocating containers to specific nodes
12 2 Peter Amstutz
* reporting idle/busy/failed/down state, and out of contact 
13 2 Peter Amstutz
14 1 Peter Amstutz
Some solutions:
15 1 Peter Amstutz
16 1 Peter Amstutz
h2. Use slurm better
17 1 Peter Amstutz
18 1 Peter Amstutz
Most of our slurm problems are self-inflicted.  We have a single partition and single queue with heterogeneous, dynamically configured nodes.  We would have fewer problems if we adopted a strategy whereby we define configure slurm ranges "compute-small-[0-255]", "compute-medium-[0-255]", "compute-large-[0-255]" with appropriate specs.  Define a partition for each size range, so that a job waiting for one node size does not hold up jobs that want a different node size.
19 1 Peter Amstutz
20 1 Peter Amstutz
Advantages:
21 1 Peter Amstutz
22 1 Peter Amstutz
* Least overall change compared to current architecture
23 1 Peter Amstutz
24 1 Peter Amstutz
Disadvantages:  
25 1 Peter Amstutz
26 1 Peter Amstutz
* Requires coordinated change to API server, node manager, crunch-dispatch-slurm, cluster configuration
27 1 Peter Amstutz
* Ops seems to think that defining (sizes * max nodes) hostnames might be a problem?
28 1 Peter Amstutz
* Can't adjust node configurations without restarting the whole cluster
29 1 Peter Amstutz
30 1 Peter Amstutz
h2. Cloud provider scheduling APIs
31 1 Peter Amstutz
32 1 Peter Amstutz
Use cloud provider scheduling APIs such as Azure Batch, AWS Batch, Google pipelines API to perform cluster scaling and scheduling.
33 1 Peter Amstutz
34 1 Peter Amstutz
Would be implemented as custom Arvados dispatcher services: crunch-dispatch-azure, crunch-dispatch-aws, crunch-dispatch-google.
35 1 Peter Amstutz
36 1 Peter Amstutz
Advantages:
37 1 Peter Amstutz
38 1 Peter Amstutz
* Get rid of Node Manager
39 1 Peter Amstutz
40 1 Peter Amstutz
Disadvantages:
41 1 Peter Amstutz
42 1 Peter Amstutz
* Has to be implemented per cloud provider.
43 1 Peter Amstutz
* May be hard to customize behavior, such as job priority.
44 1 Peter Amstutz
45 1 Peter Amstutz
h2. Kubernetes
46 1 Peter Amstutz
47 1 Peter Amstutz
Submit containers to a Kubernetes cluster.  Kubernetes handles cluster scaling and scheduling.
48 1 Peter Amstutz
49 1 Peter Amstutz
Advantages:
50 1 Peter Amstutz
51 1 Peter Amstutz
* Get rid of node manager
52 1 Peter Amstutz
* Desirable as part of overall plan to be able to run Arvados on Kubernetes
53 1 Peter Amstutz
54 1 Peter Amstutz
Disadvantages:
55 1 Peter Amstutz
56 1 Peter Amstutz
* Running crunch-run inside a container requires docker-in-docker (privileged container) or access to the Docker socket.
57 1 Peter Amstutz
58 1 Peter Amstutz
h2. Crunch-dispatch-local
59 1 Peter Amstutz
60 1 Peter Amstutz
Node manager spins up nodes based on container queue.  Compute nodes run crunch-dispatch-local or similar service, which asks the API server for work and then runs it.  Possibly node manager directly decides which jobs should go onto which nodes.
61 1 Peter Amstutz
62 1 Peter Amstutz
Advantages:
63 1 Peter Amstutz
64 1 Peter Amstutz
* Complete control over scheduling decisions / priority
65 1 Peter Amstutz
66 1 Peter Amstutz
Disadvantages:
67 1 Peter Amstutz
68 1 Peter Amstutz
* Requesting work puts additional load of API server (may not be any worse than live logging, though)
69 1 Peter Amstutz
* Need a new scheme for nodes to report their status so that node manager knows if they are busy, idle.  Node manager has to be able to put nodes in equivalent of "draining" state to ensure they don't get shut down while doing work.  (We can use the "nodes" table for this).
70 1 Peter Amstutz
* Need to be able to detect node failure.