Project

General

Profile

Node manager policy matrix » History » Version 1

Peter Amstutz, 04/14/2016 06:12 PM

1 1 Peter Amstutz
h1. Node manager policy matrix
2
3
arvados node state (last_ping_at, crunch_worker_state):
4
* (no arvados record associated with cloud node) -> unpaired
5
* last_ping_at is stale -> down
6
* slurm state is idle -> idle
7
* slurm state is drng, alloc, maint -> busy
8
* slurm state is drain, down, error, fail, unknown, any* -> down
9
10
billing window:
11
* open
12
* closed
13
14
boot_grace: (time since boot)
15
* under boot expiry -> boot wait
16
* exceeded boot expiry -> boot exceeded
17
18
idle_grace (time since last state change to "idle")
19
* arvados node state is not 'idle' -> not idle
20
* idle and not exceeded grace period -> idle wait
21
* idle and exceeded grace period -> idle exceed
22
23
Node manager will construct a state tuple and then consult the following table to determine what action to take.  Actions are:
24
25
* None (do nothing)
26
* START_DRAIN (put the node into slurm draining state)
27
* START_SHUTDOWN (initiate cloud shutdown)
28
29
<pre>
30
crunch_worker_state = ['unpaired', 'busy', 'idle', 'down']
31
window = ["open", "closed"]
32
boot_grace = ["boot wait", "boot exceeded"]
33
idle_grace = ["not idle", "idle wait", "idle exceeded"]
34
35
{('busy', 'closed', 'boot exceeded', 'idle exceeded'): None,
36
 ('busy', 'closed', 'boot exceeded', 'idle wait'): None,
37
 ('busy', 'closed', 'boot exceeded', 'not idle'): None,
38
 ('busy', 'closed', 'boot wait', 'idle exceeded'): None,
39
 ('busy', 'closed', 'boot wait', 'idle wait'): None,
40
 ('busy', 'closed', 'boot wait', 'not idle'): None,
41
 ('busy', 'open', 'boot exceeded', 'idle exceeded'): None,
42
 ('busy', 'open', 'boot exceeded', 'idle wait'): None,
43
 ('busy', 'open', 'boot exceeded', 'not idle'): None,
44
 ('busy', 'open', 'boot wait', 'idle exceeded'): None,
45
 ('busy', 'open', 'boot wait', 'idle wait'): None,
46
 ('busy', 'open', 'boot wait', 'not idle'): None,
47
48
 ('down', 'closed', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN",
49
 ('down', 'closed', 'boot exceeded', 'idle wait'): "START_SHUTDOWN",
50
 ('down', 'closed', 'boot exceeded', 'not idle'): "START_SHUTDOWN",
51
 ('down', 'closed', 'boot wait', 'idle exceeded'): "START_SHUTDOWN",
52
 ('down', 'closed', 'boot wait', 'idle wait'): "START_SHUTDOWN",
53
 ('down', 'closed', 'boot wait', 'not idle'): "START_SHUTDOWN",
54
 ('down', 'open', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN",
55
 ('down', 'open', 'boot exceeded', 'idle wait'): "START_SHUTDOWN",
56
 ('down', 'open', 'boot exceeded', 'not idle'): "START_SHUTDOWN",
57
 ('down', 'open', 'boot wait', 'idle exceeded'): "START_SHUTDOWN",
58
 ('down', 'open', 'boot wait', 'idle wait'): "START_SHUTDOWN",
59
 ('down', 'open', 'boot wait', 'not idle'): "START_SHUTDOWN",
60
61
 ('idle', 'closed', 'boot exceeded', 'idle exceeded'): None,
62
 ('idle', 'closed', 'boot exceeded', 'idle wait'): None,
63
 ('idle', 'closed', 'boot exceeded', 'not idle'): None,
64
 ('idle', 'closed', 'boot wait', 'idle exceeded'): None,
65
 ('idle', 'closed', 'boot wait', 'idle wait'): None,
66
 ('idle', 'closed', 'boot wait', 'not idle'): None,
67
 ('idle', 'open', 'boot exceeded', 'idle exceeded'): "START_DRAIN",
68
 ('idle', 'open', 'boot exceeded', 'idle wait'): None,
69
 ('idle', 'open', 'boot exceeded', 'not idle'): None,
70
 ('idle', 'open', 'boot wait', 'idle exceeded'): "START_DRAIN",
71
 ('idle', 'open', 'boot wait', 'idle wait'): None,
72
 ('idle', 'open', 'boot wait', 'not idle'): None,
73
74
 ('unpaired', 'closed', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN",
75
 ('unpaired', 'closed', 'boot exceeded', 'idle wait'): "START_SHUTDOWN",
76
 ('unpaired', 'closed', 'boot exceeded', 'not idle'): "START_SHUTDOWN",
77
 ('unpaired', 'closed', 'boot wait', 'idle exceeded'): None,
78
 ('unpaired', 'closed', 'boot wait', 'idle wait'): None,
79
 ('unpaired', 'closed', 'boot wait', 'not idle'): None,
80
 ('unpaired', 'open', 'boot exceeded', 'idle exceeded'): "START_SHUTDOWN",
81
 ('unpaired', 'open', 'boot exceeded', 'idle wait'): "START_SHUTDOWN",
82
 ('unpaired', 'open', 'boot exceeded', 'not idle'): "START_SHUTDOWN",
83
 ('unpaired', 'open', 'boot wait', 'idle exceeded'): None,
84
 ('unpaired', 'open', 'boot wait', 'idle wait'): None,
85
 ('unpaired', 'open', 'boot wait', 'not idle'): None}
86
</pre>
87
88
89
Note on libcloud node states:
90
* error, unknown -> broken
91
* everything else -> ok
92
93
However we don't use it, it's expensive to fetch on some clouds and not as useful as knowing whether the node is actually live and in communication.  A blanket policy that shuts down nodes that are unavailable to do useful work should also catch broken nodes.