Project

General

Profile

Hacking Node Manager » History » Version 1

Brett Smith, 10/03/2014 09:38 PM

1 1 Brett Smith
h1. Hacking Node Manager
2
3
h2. Important dependencies
4
5
h3. libcloud
6
7
"Apache Libcloud":https://libcloud.readthedocs.org/en/latest/ gives us a consistent interface to manage compute nodes across different cloud providers.
8
9
h3. Pykka
10
11
The Node Manager uses "Pykka":http://www.pykka.org/en/latest/ to easily set up lots of small workers in a multithreaded environment.  You'll probably want to read that introduction before you get started.  The Node Manager makes heavy use of Pykka's proxies.
12
13
h2. Overview - Subscriptions
14
15
Most of the actors in the Node Manager only need to communicate to others about one kind of event:
16
17
* ArvadosNodeListMonitorActor: updated information about Arvados Node objects
18
* ComputeNodeListMonitorActor: updated information about compute nodes running in the cloud
19
* JobQueueMonitorActor: updated information about the number and sizes of compute nodes that would best satisfy the job queue
20
* ComputeNodeSetupActor: compute node setup is finished
21
* ComputeNodeShutdownActor: compute node is successfully shut down
22
* ComputeNodeActor: compute node is eligible for shutdown
23
24
These communications happen through subscriptions.  Each actor has a @subscribe@ method that takes an arbitrary callable object, usually a proxy method.  Those callables are called with new information whenever there's a state change.
25
26
List monitor actors also have a @subscribe_to@ method that calls the callable on every update, with information about one specific object in the response (e.g., every update about an Arvados node with a specific UUID).
27
28
Thanks to this pattern, it's rare for our code to directly use the Future objects that are returned from proxy methods.  Instead, the different actors send messages to each other about interesting state changes.  The 30,000-foot overview of the program is:
29
30
* Start the list monitor actors
31
* Start the NodeManagerDaemonActor.  It subscribes to those monitors.
32
* The daemon creates different compute node actors to manage different points of the node's lifecycle, and subscribes to their updates as well.
33
* When the daemon creates a ComputeNodeActor, it subscribes that new actor to updates from the list monitor about the underlying cloud and Arvados data.
34
35
See @launcher.py@, and the @update_cloud_nodes@ and @update_arvados_nodes@ methods in @daemon.py@.
36
37
h3. Test Mocks
38
39
The subscription pattern also simplifies testing with mocks.  Each test starts at most one actor.  We send messages to that actor with mock data, and then check the results through a mock subscriber.  As long as you can commit to particular message semantics, this makes it possible to write well-isolated, fast tests.  @testutil.py@ provides rich mocks for different kinds of objects, as well as a Mixin class to help test actors.
40
41
h2. Driver wrappers
42
43
When we start a compute node, we need to seed it with information from the associated Arvados node object.  The mechanisms to pass that information will be different for each cloud provider.  To accommodate this, there are driver classes under @arvnodeman.computenode@ that handle the translation.  They also proxy public methods from the "real" libcloud driver, so except for the @create_node@ method, you can usually use libcloud's standard interfaces on our custom drivers.
44
45
h2. Configuration
46
47
@doc/ec2.example.cfg@ has lots of comments describing what parameters are available and how they behave.  Bear in mind that settings in Cloud and Size sections are specific to the provider named in the Daemon section.
48
49
@doc/local.example.cfg@ lets you run a development node manager, backed by libcloud's dummy driver and your development Arvados API server.  Refer to the instructions at the top of that file.